|
When I said HelloCup I was looking at a yacc based parser in a
language that didn't require me to handle my dirty pointers. Another
alternative to play with is Ruby which now has a yaccish parser
built in to the standard library - inevitably called racc. Racc has an interesting interplay between ruby and grammar
syntax. You define the grammar with a racc file which will generate
a parser class. Again I'll do my simple hello world case. The input text is
item camera
item laser
I'll populate item objects inside a catalog, using the following
model classes.
class Item
attr_reader :name
def initialize name
@name = name
end
end
class Catalog
extend Forwardable
def initialize
@items = []
end
def_delegators :@items, :size, :<<, :[]
end
Forwardable is a handy library that allows me to
delegate methods to an instance variable. In this case I delegate a
bunch of methods to the @items list.
I test what I read with this.
class Tester < Test::Unit::TestCase
def testReadTwo
parser = ItemParser.new
parser.parse "item camera\nitem laser\n"
assert_equal 2, parser.result.size
assert_equal 'camera', parser.result[0].name
assert_equal 'laser', parser.result[1].name
end
def testReadBad
parser = ItemParser.new
parser.parse "xitem camera"
fail
rescue #expected
end
end
To build the file and run the tests I use a simple rake file.
# rakefile...
task :default => :test
file 'item.tab.rb' => 'item.y.rb' do
sh 'racc item.y.rb'
end
task :test => 'item.tab.rb' do
require 'rake/runtest'
Rake.run_tests 'test.rb'
end
The racc command needs to be installed on your
system. I did it the easy way on Ubuntu with
apt-get. It takes the input file and creates one named
inputFileName.tab.rb. The parser grammar class is a special format, but one that's
pretty familiar to yaccish people. For this simple example it looks
like this:
#file item.y.rb...
class ItemParser
token 'item' WORD
rule
catalog: item | item catalog;
item: 'item' WORD {@result << Item.new(val[1])};
end
The tokens clause declares the token's we get from the lexer. I
use the string 'item' and WORD as a
symbol. The rule clause starts the production rules which are in the
usual BNF form for yacc. As you might expect I can write actions
inside curlies. To refer to the elements of the rule I use the
val array, so val[1] is the equivalent to
$2 in yacc (ruby uses 0 based array indexes, but I've
forgiven it). Should I wish to return a value from the rule
(equivalent to yacc's $$) I assign
it to the variable result. The most complicated part of using racc is to sort out the lexer.
Racc expects to call a method that yields tokens, where each token is a
two-element array with the first element being the type of token
(matching the token declaration) and the second element the value
(what shows up in val - usually the text). You mark the
end of the token stream with [false, false]. The sample
code with racc uses regular expression matching on a string. A better
choice for most cases is to use StringScanner, which is
in the standard ruby library. I can use this scanner to convert a string into an array of tokens.
#file item.y.rb....
---- inner
def make_tokens str
require 'strscan'
result = []
scanner = StringScanner.new str
until scanner.empty?
case
when scanner.scan(/\s+/)
#ignore whitespace
when match = scanner.scan(/item/)
result << ['item', nil]
when match = scanner.scan(/\w+/)
result << [:WORD, match]
else
raise "can't recognize <#{scanner.peek(5)}>"
end
end
result << [false, false]
return result
endTo integrate the scanner into the parser, racc allows you to
place code into the generated parser class. You do this by adding code
to the grammar file. The declaration ---- inner marks the
code to go inside the generated class (you can also put code at the
head and foot of the generated file). I'm calling a parse
method in my test, so I need to implement that.
#file item.y.rb....
---- inner
attr_accessor :result
def parse(str)
@result = Catalog.new
@tokens = make_tokens str
do_parse
end
The do_parse method initiates the generated
parser. This will call next_token to get at the next
token, so we need to implement that method and include it in the
inner section.
#file item.y.rb....
---- inner
def next_token
@tokens.shift
end
This is enough to make racc work with the file. However as I play
with it I find the scanner more messy than I would like. I really
just want it to tell the lexer what patterns to match and what to
return with them. Something like this.
#file item.y.rb....
---- inner
def make_lexer aString
result = Lexer.new
result.ignore /\s+/
result.keyword 'item'
result.token /\w+/, :WORD
result.start aString
return result
end
To make this work I write my own lexer wrapper over the base
functionality provided by StringScanner. Here's the code to set up
the lexer and and handle the above configuration.
class Lexer...
require 'strscan'
def initialize
@rules = []
end
def ignore pattern
@rules << [pattern, :SKIP]
end
def token pattern, token
@rules << [pattern, token]
end
def keyword aString
@rules << [Regexp.new(aString), aString]
end
def start aString
@base = StringScanner.new aString
end
To perform the scan I need to use StringScanner to compare the
rules against the input stream.
class Lexer...
def next_token
return [false, false] if @base.empty?
t = get_token
return (:SKIP == t[0]) ? next_token : t
end
def get_token
@rules.each do |key, value|
m = @base.scan(key)
return [value, m] if m
end
raise "unexpected characters <#{@base.peek(5)}>"
end
I can then alter the code in the parser to call this lexer
instead.
#file item.y.rb....
---- inner
def parse(arg)
@result = Catalog.new
@lexer = make_lexer arg
do_parse
end
def next_token
@lexer.next_token
end As well as giving me a better way to define the rules, this also
allows the grammar to control the lexer because it's only grabbing
one token at a time - this would give me a mechanism to implement
lexical states later on. On the whole racc is pretty easy to set up and use - providing
you know yacc. The documentation is on the minimal side of
sketchy. There's a simple manual on the website and some sample
code. There's also a very helpful presentation on racc. I also
got a few tips from our Mingle team who've used it for a nifty customization language inside Mingle.
|