TDParser Programmers Guide

TDParser is a Ruby component that helps us to construct a top-down parser using method calls. This document describes how to use TDParser in two styles. Both of styles are similar to one of JavaCC on the surface. However, one is a style in which we define rules of a grammar as methods (like shown in sample4.rb). The other is a style in which each rule is defined as if it is a property of a grammar (see also sample5.rb).

Defining Rules in Module

The following class is a parser class, and it accepts expressions that consists of digits and +.

class MyParser
  include TDParser

  def expr
    token(/\d+/) - token("+") - rule(:expr) >> proc{|x| x[0].to_i + x[2] } |
    token(/\d+/) >> proc{|x| x[0].to_i }
  end
end

In this class, the method expr represents the following production rule.

expr := int '+' expr
      | int

In addition, at the first line of the method expr, values accepted by token(/\d+/), token("+") and rule(:expr) are assigned to x[0], x[1] and x[2] respectively. After that, in order to parse 1 + 2, we first split it into an array of tokens like ["1", "+", "2"], and then call the parse method of a parser object, which is created by MyParser.new(), as follows.

parser = MyParser.new()
parser.expr.parse(["1", "+", "2"])

Note that we can pass one of the following objects to the parse method.

In that syntax, + is right-associative. However, we can’t write as follows.

def expr
  rule(:expr) - token("+") - token(/\d+/) >> proc{|x| x[0].to_i + x[2].to_i }
  token(/\d+/) >> proc{|x| x[0].to_i }
end

This problem is called left-recursion problem. So we have to use one of the following rules instead.

def expr
  token(/\d+/) - (token("+") - token(/\d+/))*0 >> proc{|x|
    x[1].inject(x[0]){|acc,y|
      case y[0]
      when "+"
        acc + y[1]
      end
    }
  }
end

def expr  # javacc style
  n = nil
  (token(/\d+/) >> proc{|x| n = x }) -
  (token("+") - rule(/\d+/) >> proc{|y|
    case y[0]
    when "+"
      n += y[1].to_i
    end
  })*0 >> proc{|x| n }
end

In the rules, (...)*N represents N or more rules (...). x[1] has multiple sequences of tokens accepted by (...)*0. For example, if ["1", "+","1","+","2"] is parsed by the rule: token(/\d+/) - (token("+") - token(/\d+/))*0, we obtain [["+", "1"], ["+", "2"]] by x[1].

Defining Rules using TDParser.define()

The rule defined in the first sample script, shown in the previous section, can also be defined as follows.

parser = TDParser.define{|g|
  g.expr =
    g.token(/\d+/) - g.token("+") - g.expr >> proc{|x| x[0].to_i + x[2] } |
    g.token(/\d+/) >> proc{|x| x[0].to_i }
}

(See also sample5.rb and sample6.rb)

Parser Combinators

StringTokenizer

There is a simple tokenizer called TDParser::StringTokenizer in the library tdparser/utils. (See MyParser#parse in sample2.rb)