Programming Ruby

The Pragmatic Programmer's Guide

Previous < Contents ^
Next >

The Ruby Language



This chapter is a bottom-up look at the Ruby language. Unlike the previous tutorial, here we're concentrating on presenting facts, rather than motivating some of the language design features. We also ignore the built-in classes and modules where possible. These are covered in depth starting on page 275.

If the content of this chapter looks familiar, it's because it should; we've covered just about all of this in the earlier tutorial chapters. Consider this chapter to be a self-contained reference to the core Ruby language.

Source Layout

Ruby programs are written in 7-bit ASCII.[Ruby also has extensive support for Kanji, using the EUC, SJIS, or UTF-8 coding system. If a code set other than 7-bit ASCII is used, the KCODE option must be set appropriately, as shown on page 137.]

Ruby is a line-oriented language. Ruby expressions and statements are terminated at the end of a line unless the statement is obviously incomplete---for example if the last token on a line is an operator or comma. A semicolon can be used to separate multiple expressions on a line. You can also put a backslash at the end of a line to continue it onto the next. Comments start with `#' and run to the end of the physical line. Comments are ignored during compilation.

a = 1

b = 2; c = 3

d = 4 + 5 +      # no '\' needed     6 + 7

e = 8 + 9   \     + 10         # '\' needed

Physical lines between a line starting with =begin and{=begin...=end@{=begin documentation} a line starting with =end are ignored by the compiler and may be used for embedded documentation (see Appendix A, which begins on page 511).

Ruby reads its program input in a single pass, so you can pipe programs to the compiler's stdin.

echo 'print "Hello\n"' | ruby

If the compiler comes across a line anywhere in the source containing just ``__END__'', with no leading or trailing whitespace, it treats that line as the end of the program---any subsequent lines will not be compiled. However, these lines can be read into the running program using the global IO object DATA, described on page 217.

BEGIN and END Blocks

Every Ruby source file can declare blocks of code to be run as the file is being loaded (the BEGIN blocks) and after the program has finished executing (the END blocks).

BEGIN {
  begin code
}

END { end code }

A program may include multiple BEGIN and END blocks. BEGIN blocks are executed in the order they are encountered. END blocks are executed in reverse order.

General Delimited Input

There are alternative forms of literal strings, arrays, regular expressions, and shell commands that are specified using a generalized delimited syntax. All these literals start with a percent character, followed by a single character that identifies the literal's type. These characters are summarized in Table 18.1 on page 200; the actual literals are described in the corresponding sections later in this chapter.

General delimited input
Type Meaning See Page
%q Single-quoted string 202
%Q, % Double-quoted string 202
%w Array of tokens 204
%r Regular expression pattern 205
%x Shell command 218

Following the type character is a delimiter, which can be any character. If the delimiter is one of the characters ``('', ``['', ``{'', or ``<'', the literal consists of the characters up to the matching closing delimiter, taking account of nested delimiter pairs. For all other delimiters, the literal comprises the characters up to the next occurrence of the delimiter character.

%q/this is a string/
%q-string-
%q(a (nested) string)

Delimited strings may continue over multiple lines.

%q{def fred(a)
     a.each { |i| puts i }
   end}

The Basic Types

The basic types in Ruby are numbers, strings, arrays, hashes, ranges, symbols, and regular expressions.

Integer and Floating Point Numbers

Ruby integers are objects of class Fixnum or Bignum. Fixnum objects hold integers that fit within the native machine word minus 1 bit. Whenever a Fixnum exceeds this range, it is automatically converted to a Bignum object, whose range is effectively limited only by available memory. If an operation with a Bignum result has a final value that will fit in a Fixnum, the result will be returned as a Fixnum.

Integers are written using an optional leading sign, an optional base indicator (0 for octal, 0x for hex, or 0b for binary), followed by a string of digits in the appropriate base. Underscore characters are ignored in the digit string.

You can get the integer value corresponding to an ASCII character by preceding that character with a question mark. Control and meta combinations of characters can also be generated using ?\C-x, ?\M-x, and ?\M-\C-x. The control version of ch is ch&0x9f, and the meta version is ch | 0x80. You can get the integer value of a backslash character using the sequence ?\\.

123456                    # Fixnum
123_456                   # Fixnum (underscore ignored)
-543                      # Negative Fixnum
123_456_789_123_345_789   # Bignum
0xaabb                    # Hexadecimal
0377                      # Octal
-0b1010                   # Binary (negated)
0b001_001                 # Binary
?a                        # character code
?A                        # upper case
?\C-a                     # control a = A - 0x40
?\C-A                     # case ignored for control chars
?\M-a                     # meta sets bit 7
?\M-\C-a                  # meta and control a

A numeric literal with a decimal point and/or an exponent is turned into a Float object, corresponding to the native architecture's double data type. You must follow the decimal point with a digit, as 1.e3 tries to invoke the method e3 in class Fixnum.

12.34 » 12.34
-.1234e2 » -12.34
1234e-2 » 12.34

Strings

Ruby provides a number of mechanisms for creating literal strings. Each generates objects of type String. The different mechanisms vary in terms of how a string is delimited and how much substitution is done on the literal's content.

Single-quoted string literals (' stuff ' and %q/stuff/) undergo the least substitution. Both convert the sequence
into a single backslash, and the form with single quotes converts \' into a single quote.

'hello' » hello
'a backslash \'\\\'' » a backslash '\'
%q/simple string/ » simple string
%q(nesting (really) works) » nesting (really) works
%q no_blanks_here ; » no_blanks_here

Double-quoted strings ("stuff", %Q/stuff/, and %/stuff/) undergo additional substitutions, shown in Table 18.2 on page 203.

Substitutions in double-quoted strings

\a Bell/alert (0x07) \nnn Octal nnn
\b Backspace (0x08) \xnn Hex nn
\e Escape (0x1b) \cx Control-x
\f Formfeed (0x0c) \C-x Control-x
\n Newline (0x0a) \M-x Meta-x
\r Return (0x0d) \M-\C-x Meta-control-x
\s Space (0x20) \x x
\t Tab (0x09) #{expr} Value of expr
\v Vertical tab (0x0b)

a  = 123
"\123mile" » Smile
"Say \"Hello\"" » Say "Hello"
%Q!"I said 'nuts'," I said! » "I said 'nuts'," I said
%Q{Try #{a + 1}, not #{a - 1}} » Try 124, not 122
%<Try #{a + 1}, not #{a - 1}> » Try 124, not 122
"Try #{a + 1}, not #{a - 1}" » Try 124, not 122

Strings can continue across multiple input lines, in which case they will contain newline characters. It is also possible to use here documents to express long string literals. Whenever Ruby parses the sequence <<identifier or <<quoted string, it replaces it with a string literal built from successive logical input lines. It stops building the string when it finds a line that starts with the identifier or the quoted string. You can put a minus sign immediately after the << characters, in which case the terminator can be indented from the left margin. If a quoted string was used to specify the terminator, its quoting rules will be applied to the here document; otherwise, double-quoting rules apply.

a = 123
print <<HERE
Double quoted \
here document.
Sum = #{a + 1}
HERE

print <<-'THERE'     This is single quoted.     The above used #{a + 1}     THERE
produces:
Double quoted here document.
Sum = 124
    This is single quoted.
    The above used #{a + 1}

Adjacent single- and double-quoted strings in the input are concatenated to form a single String object.

'Con' "cat" 'en' "ate" » "Concatenate"

Strings are stored as sequences of 8-bit bytes,[For use in Japan, the jcode library supports a set of operations of strings written with EUC, SJIS, or UTF-8 encoding. The underlying string, however, is still accessed as a series of bytes.] and each byte may contain any of the 256 8-bit values, including null and newline. The substitution mechanisms in Table 18.2 on page 203 allow nonprinting characters to be inserted conveniently and portably.

Every time a string literal is used in an assignment or as a parameter, a new String object is created.

for i in 1..3
  print 'hello'.id, " "
end
produces:
537767360 537767070 537767040

The documentation for class String starts on page 363.

Ranges

Outside the context of a conditional expression, expr .. expr and expr ... expr construct Range objects. The two-dot form is an inclusive range; the one with three dots is a range that excludes its last element. See the description of class Range on page 359 for details. Also see the description of conditional expressions on page 222 for other uses of ranges.

Arrays

Literals of class Array are created by placing a comma-separated series of object references between square brackets. A trailing comma is ignored.

arr = [ fred, 10, 3.14, "This is a string", barney("pebbles"), ]

Arrays of strings can be constructed using a shortcut notation, %w, which extracts space-separated tokens into successive elements of the array. A space can be escaped with a backslash. This is a form of general delimited input, described on pages 200--201.

arr = %w( fred wilma barney betty great\ gazoo )
arr » ["fred", "wilma", "barney", "betty", "great gazoo"]

Hashes

A literal Ruby Hash is created by placing a list of key/value pairs between braces, with either a comma or the sequence => between the key and the value. A trailing comma is ignored.

colors = { "red"   => 0xf00,
           "green" => 0x0f0,
           "blue"  => 0x00f
         }

There is no requirement for the keys and/or values in a particular hash to have the same type.

Requirements for a Hash Key

The only restriction for a hash key is that it must respond to the message hash with a hash value, and the hash value for a given key must not change. This means that certain classes (such as Array and Hash, as of this writing) can't conveniently be used as keys, because their hash values can change based on their contents.

If you keep an external reference to an object that is used as a key, and use that reference to alter the object and change its hash value, the hash lookup based on that key may not work.

Because strings are the most frequently used keys, and because string contents are often changed, Ruby treats string keys specially. If you use a String object as a hash key, the hash will duplicate the string internally and will use that copy as its key. Any changes subsequently made to the original string will not affect the hash.

If you write your own classes and use instances of them as hash keys, you need to make sure that either (a) the hashes of the key objects don't change once the objects have been created or (b) you remember to call the Hash#rehash method to reindex the hash whenever a key hash is changed.

Symbols

A Ruby symbol is the internal representation of a name. You construct the symbol for a name by preceding the name with a colon. A particular name will always generate the same symbol, regardless of how that name is used within the program.

:Object
:myVariable

Other languages call this process ``interning,'' and call symbols ``atoms.''

Regular Expressions

Regular expression literals are objects of type Regexp. They can be created by explicitly calling the Regexp.new constructor, or by using the literal forms, /pattern/ and %r{ pattern }. The %r construct is a form of general delimited input (described on pages 200--201).

/pattern/
/pattern/options
%r{pattern}
%r{pattern}options
Regexp.new( 'pattern' [, options
            ] )

Regular Expression Options

A regular expression may include one or more options that modify the way the pattern matches strings. If you're using literals to create the Regexp object, then the options comprise one or more characters placed immediately after the terminator. If you're using Regexp.new, the options are constants used as the second parameter of the constructor.

i Case Insensitive. The pattern match will ignore the case of letters in the pattern and string. Matches are also case-insensitive if the global variable $= is set.
o Substitute Once. Any #{...} substitutions in a particular regular expression literal will be performed just once, the first time it is evaluated. Otherwise, the substitutions will be performed every time the literal generates a Regexp object.
m Multiline Mode. Normally, ``.'' matches any character except a newline. With the /m option, ``.'' matches any character.
x Extended Mode. Complex regular expressions can be difficult to read. The `x' option allows you to insert spaces, newlines, and comments in the pattern to make it more readable.

Regular Expression Patterns

regular characters
All characters except ., |, (, ), [, \, ^, {, +, $, *, and ? match themselves. To match one of these characters, precede it with a backslash.

^
Matches the beginning of a line.

$
Matches the end of a line.

\A
Matches the beginning of the string.

\z
Matches the end of the string.

\Z
Matches the end of the string unless the string ends with a ``\n'', in which case it matches just before the ``\n''.

\b, \B
Match word boundaries and nonword boundaries respectively.

[ characters ]
A character class matches any single character between the brackets. The characters |, (, ), [, ^, $, *, and ?, which have special meanings elsewhere in patterns, lose their special significance between brackets. The sequences \ nnn, \x nn, \c x, \C- x, \M- x, and \M-\C- x have the meanings shown in Table 18.2 on page 203. The sequences \d, \D, \s, \S, \w, and \W are abbreviations for groups of characters, as shown in Table 5.1 on page 59. The sequence c1-c2 represents all the characters between c1 and c2, inclusive. Literal ] or - characters must appear immediately after the opening bracket. An uparrow (^) immediately following the opening bracket negates the sense of the match---the pattern matches any character that isn't in the character class.

\d, \s, \w
Are abbreviations for character classes that match digits, whitespace, and word characters, respectively. \D, \S, and \W match characters that are not digits, whitespace, or word characters. These abbreviations are summarized in Table 5.1 on page 59.

. (period)
Appearing outside brackets, matches any character except a newline. (With the /m option, it matches newline, too).

re *
Matches zero or more occurrences of re.

re +
Matches one or more occurrences of re.

re {m,n}
Matches at least ``m'' and at most ``n'' occurrences of re.

re ?
Matches zero or one occurrence of re. The *, +, and {m,n} modifiers are greedy by default. Append a question mark to make them minimal.

re1 | re2
Matches either re1 or re2. | has a low precedence.

(...)
Parentheses are used to group regular expressions. For example, the pattern /abc+/ matches a string containing an ``a,'' a ``b,'' and one or more ``c''s. /(abc)+/ matches one or more sequences of ``abc''. Parentheses are also used to collect the results of pattern matching. For each opening parenthesis, Ruby stores the result of the partial match between it and the corresponding closing parenthesis as successive groups. Within the same pattern, \1 refers to the match of the first group, \2 the second group, and so on. Outside the pattern, the special variables $1, $2, and so on, serve the same purpose.

Substitutions

#{...}
Performs an expression substitution, as with strings. By default, the substitution is performed each time a regular expression literal is evaluated. With the /o option, it is performed just the first time.

\0, \1, \2, ... \9, \&, \`, \', \+
Substitutes the value matched by the nth grouped subexpression, or by the entire match, pre- or postmatch, or the highest group.

Extensions

In common with Perl and Python, Ruby regular expressions offer some extensions over traditional Unix regular expressions. All the extensions are entered between the characters (? and ). The parentheses that bracket these extensions are groups, but they do not generate backreferences: they do not set the values of \1 and $1 etc.

(?# comment)
Inserts a comment into the pattern. The content is ignored during pattern matching.

(?:re)
Makes re into a group without generating backreferences. This is often useful when you need to group a set of constructs but don't want the group to set the value of $1 or whatever. In the example that follows, both patterns match a date with either colons or spaces between the month, day, and year. The first form stores the separator character in $2 and $4, while the second pattern doesn't store the separator in an external variable.

date = "12/25/01"
date =~ %r{(\d+)(/|:)(\d+)(/|:)(\d+)}
[$1,$2,$3,$4,$5] » ["12", "/", "25", "/", "01"]
date =~ %r{(\d+)(?:/|:)(\d+)(?:/|:)(\d+)}
[$1,$2,$3] » ["12", "25", "01"]

(?=re)
Matches re at this point, but does not consume it (also known charmingly as ``zero-width positive lookahead''). This lets you look forward for the context of a match without affecting $&. In this example, the scan method matches words followed by a comma, but the commas are not included in the result.

str = "red, white, and blue"
str.scan(/[a-z]+(?=,)/) » ["red", "white"]

(?!re)
Matches if re does not match at this point. Does not consume the match (zero-width negative lookahead). For example, /hot(?!dog)(\w+)/ matches any word that contains the letters ``hot'' that aren't followed by ``dog'', returning the end of the word in $1.

(?>re)
Nests an independent regular expression within the first regular expression. This expression is anchored at the current match position. If it consumes characters, these will no longer be available to the higher-level regular expression. This construct therefore inhibits backtracking, which can be a performance enhancement. For example, the pattern /a.*b.*a/ takes exponential time when matched against a string containing an ``a'' followed by a number of ``b''s, but with no trailing ``a.'' However, this can be avoided by using a nested regular expression /a(?>.*b).*a/. In this form, the nested expression consumes all the the input string up to the last possible ``b'' character. When the check for a trailing ``a'' then fails, there is no need to backtrack, and the pattern match fails promptly.

require "benchmark"
include Benchmark
str = "a" + ("b" * 5000)
bm(8) do |test|
  test.report("Normal:") { str =~ /a.*b.*a/ }
  test.report("Nested:") { str =~ /a(?>.*b).*a/ }
end
produces:
              user     system      total        real
Normal:   0.420000   0.000000   0.420000 (  0.414843)
Nested:   0.000000   0.000000   0.000000 (  0.001205)

(?imx)
Turns on the corresponding ``i,'' ``m,'' or ``x'' option. If used inside a group, the effect is limited to that group.

(?-imx)
Turns off the ``i,'' ``m,'' or ``x'' option.

(?imx:re)
Turns on the ``i,'' ``m,'' or ``x'' option for re.

(?-imx:re)
Turns off the ``i,'' ``m,'' or ``x'' option for re.

Names

Ruby names are used to refer to constants, variables, methods, classes, and modules. The first character of a name helps Ruby to distinguish its intended use. Certain names, listed in Table 18.3 on page 210, are reserved words and should not be used as variable, method, class, or module names.

Reserved words

__FILE__ and def end in or self unless
__LINE__ begin defined? ensure module redo super until
BEGIN break do false next rescue then when
END case else for nil retry true while
alias class elsif if not return undef yield

In these descriptions, lowercase letter means the characters ``a'' though ``z'', as well as ``_'', the underscore. Uppercase letter means ``A'' though ``Z,'' and digit means ``0'' through ``9.'' Name characters means any combination of upper- and lowercase letters and digits.

A local variable name consists of a lowercase letter followed by name characters.

fred  anObject  _x  three_two_one

An instance variable name starts with an ``at'' sign (``@'') followed by an upper- or lowercase letter, optionally followed by name characters.

@name  @_  @Size

A class variable name starts with two ``at'' signs (``@@'') followed by an upper- or lowercase letter, optionally followed by name characters.

@@name  @@_  @@Size

A constant name starts with an uppercase letter followed by name characters. Class names and module names are constants, and follow the constant naming conventions. By convention, constant variables are normally spelled using uppercase letters and underscores throughout.

module Math
  PI = 3.1415926
end
class BigBlob

Global variables, and some special system variables, start with a dollar sign (``$'') followed by name characters. In addition, there is a set of two-character variable names in which the second character is a punctuation character. These predefined variables are listed starting on page 213. Finally, a global variable name can be formed using ``$-'' followed by any single character.

$params  $PROGRAM  $!  $_  $-a  $-.

Method names are described in the section beginning on page 225.

Variable/Method Ambiguity

When Ruby sees a name such as ``a'' in an expression, it needs to determine if it is a local variable reference or a call to a method with no parameters. To decide which is the case, Ruby uses a heuristic. As Ruby reads a source file, it keeps track of symbols that have been assigned to. It assumes that these symbols are variables. When it subsequently comes across a symbol that might be either a variable or a method call, it checks to see if it has seen a prior assignment to that symbol. If so, it treats the symbol as a variable; otherwise it treats it as a method call. As a somewhat pathological case of this, consider the following code fragment, submitted by Clemens Hintze.

def a
  print "Function 'a' called\n"
  99
end

for i in 1..2   if i == 2     print "a=", a, "\n"   else     a = 1     print "a=", a, "\n"   end end
produces:
a=1
Function 'a' called
a=99

During the parse, Ruby sees the use of ``a'' in the first print statement and, as it hasn't yet seen any assignment to ``a,'' assumes that it is a method call. By the time it gets to the second print statement, though, it has seen an assignment, and so treats ``a'' as a variable.

Note that the assignment does not have to be executed---Ruby just has to have seen it. This program does not raise an error.

a = 1 if false; a

Variables and Constants

Ruby variables and constants hold references to objects. Variables themselves do not have an intrinsic type. Instead, the type of a variable is defined solely by the messages to which the object referenced by the variable responds.[When we say that a variable is not typed, we mean that any given variable can at different times hold references to objects of many different types.]

A Ruby constant is also a reference to an object. Constants are created when they are first assigned to (normally in a class or module definition). Ruby, unlike less flexible languages, lets you alter the value of a constant, although this will generate a warning message.

MY_CONST = 1
MY_CONST = 2   # generates a warning
produces:
prog.rb:2: warning: already initialized constant MY_CONST

Note that although constants should not be changed, you can alter the internal states of the objects they reference.

MY_CONST = "Tim"
MY_CONST[0] = "J"   # alter string referenced by constant
MY_CONST » "Jim"

Assignment potentially aliases objects, giving the same object different names.

Scope of Constants and Variables

Constants defined within a class or module may be accessed unadorned anywhere within the class or module. Outside the class or module, they may be accessed using the scope operator, ``::'' prefixed by an expression that returns the appropriate class or module object. Constants defined outside any class or module may be accessed unadorned or by using the scope operator ``::'' with no prefix. Constants may not be defined in methods.

OUTER_CONST = 99
class Const
  def getConst
    CONST
  end
  CONST = OUTER_CONST + 1
end
Const.new.getConst » 100
Const::CONST » 100
::OUTER_CONST » 99

Global variables are available throughout a program. Every reference to a particular global name returns the same object. Referencing an uninitialized global variable returns nil.

Class variables are available throughout a class or module body. Class variables must be initialized before use. A class variable is shared among all instances of a class and is available within the class itself.

class Song
  @@count = 0
  def initialize
    @@count += 1
  end
  def Song.getCount
    @@count
  end
end

Class variables belong to the innermost enclosing class or module. Class variables used at the top level are defined in Object, and behave like global variables. Class variables defined within singleton methods belong to the receiver if the receiver is a class or a module; otherwise, they belong to the class of the receiver.

class Holder
  @@var = 99
  def Holder.var=(val)
    @@var = val
  end
end

a = Holder.new def a.var   @@var end


Previous < Contents ^
Next >

Extracted from the book "Programming Ruby - The Pragmatic Programmer's Guide"
Copyright © 2001 by Addison Wesley Longman, Inc. This material may be distributed only subject to the terms and conditions set forth in the Open Publication License, v1.0 or later (the latest version is presently available at http://www.opencontent.org/openpub/)).

Distribution of substantively modified versions of this document is prohibited without the explicit permission of the copyright holder.

Distribution of the work or derivative of the work in any standard (paper) book form is prohibited unless prior permission is obtained from the copyright holder.