|
I've done a small amount of playing around with SableCC
recently. It took a bit of effort to get a "Hello World" style
parser going, so I thought I'd put some notes here as to what I did
to get it working. I'm not saying this is the best way to do it, but
it may be useful if you're looking to play with it. SableCC is a compiler-compiler tool for the Java environment. It
handles LALR(1) grammars (for those who remember their grammar
categories). In other words it's a bottom up parser (unlike JavaCC
and Antlr which are top-down). Like with most compiler-compiler tools, you define a grammar for
your language in a grammar file and run the compiler-compiler to
generate a parser for this language. Since this is a hello-world
example my language is a bare minimum one just to get the
compiler-compiler going. The language just is a list of items like
this:
item camera
item laser
Where 'item' is a keyword and the second word is a name. I want
the parser to turn this into an instance of a Configuration class
which contains a list of items.
public class Configuration {
private Map<String, Item> items = new HashMap<String, Item>();
public Item getItem(String key) {
return items.get(key);
}
public void addItem(Item arg) {
items.put(arg.getName(), arg);
}
public class Item {
private String name;
public Item(String name) {
this.name = name;
}
The input I showed earlier should be read in to pass the following
test.
@Test public void itemsAddedToItemList() {
Reader input = null;
try {
input = new FileReader("rules.txt");
} catch (FileNotFoundException e) {
throw new RuntimeException(e);
}
Configuration config = CatalogParser.parse(input);
assertNotNull(config.getItem("camera"));
assertNotNull(config.getItem("laser"));
}
Of course using a compiler-compiler for this is total overkill
for this kind of case, but the point of a hello-world example is to
get the environment working on a simple case so you can move
onto the interesting cases with the basics working. Whenever I play
with a new technology I always like to get something like this going
first, before I delve into the interesting aspects. Using a compiler-compiler makes the build process a bit more
involved. You have to first run the compiler-compiler on the
grammar file to generate the parser, then compile both your custom
code and the generated parser to create the overall program, then run
(and test it). So at this point I can't get away with just doing
everything inside my IntelliJ and actually have to put together an
ant file. It was a while since I'd put together an ant file and so
it took me a bit to re-remember how to use ant language. To run
SableCC I used the Java task:
<property name = "gendir" value = "gen"/>
<target name = "gen" >
<mkdir dir="${gendir}"/>
<java jar = "lib/sablecc.jar" fork = "true" failonerror="true">
<arg value = "-d"/>
<arg value = "${gendir}"/>
<arg value = "catalog.sable"/>
</java>
</target>
I generate the code into the gen directory, to keep
it separate from the code I write myself in the src and
test directories. I then compile it all together with a
javac task.
<property name = "builddir" value = "classes/production/sable"/>
<path id="classpath">
<fileset dir = "lib">
<include name = "*.jar"/>
</fileset>
<pathelement path = "${builddir}"/>
</path>
<target name = "compile" depends = "gen, copyDats">
<mkdir dir="${builddir}"/>
<javac destdir="${builddir}" classpathref="classpath">
<src path = "src"/>
<src path = "${gendir}"/>
<src path = "test"/>
</javac>
</target>
As well as compiling, I also have to move two data files for the
parser into the build directory. The data contains the tables used
for the parser and lexer. The build directory is nested in the way I
have it so it will work nicely with IntelliJ. (I really ought to
separate the tests into a separate output directory too, but I was
feeling lazy.)
<target name = "copyDats">
<mkdir dir="${builddir}"/>
<copy todir = "${builddir}">
<fileset dir = "${gendir}" includes = "**/lexer.dat"/>
<fileset dir = "${gendir}" includes = "**/parser.dat"/>
</copy>
</target>
I then have a test task to run the tests.
<target name = "test" depends="compile">
<junit haltonfailure = "on">
<formatter type="brief"/>
<classpath refid = "classpath"/>
<batchtest todir="${builddir}" >
<fileset dir = "test" includes = "**/*Tester.java"/>
</batchtest>
</junit>
</target>
To get the parser up and going I need to define the grammar for
my simple language using SableCC's grammar syntax.
Package catalogParser;
Tokens
itemdef = 'item';
string = ['a' .. 'z'] +;
blank = (' ' | 13 | 10)+;
Ignored Tokens
blank;
Productions
configuration =
item *
;
item =
itemdef string
;
Like most compiler-compilers, SableCC splits the work into a
lexer and a parser. The lexer reads in characters and chunks them into
tokens as defined by the Tokens section of the grammar
file. In this case it's really simple: strings are lower case letters,
the keyword item is its own token. I also define what is
blank whitespace and tell the lexer to throw it away in the
Ignores clause. The lexer will then feed a stream of itemdef and string tokens to
the parser. The parser uses two productions to deal with this. It
describes a configuration as multiple items, and each item as an
itemdef and a string (for its name). This defines the grammar for my input, but doesn't say how to get
from the input to my configuration and item objects. In order to do
this I need to write some code to map between what I've parsed and
the objects I want to create. In most compiler-compilers I do this
by embedding actions into the grammar. SableCC, however, works
another way. It automatically creates a parse tree and then gives me
a visitor to walk this parse tree. I can then subclass the visitor
to do interesting things. In this case, as I walk the parse tree, I
take each item node on the parse tree and turn it into the real
items in my model.
public class CatalogParser extends DepthFirstAdapter {
private Configuration result;
public void outAItem(AItem itemNode) {
System.out.println("found item");
result.addItem(new Item(itemNode.getString().toString().trim()));
}
The parse tree uses naming conventions to bind the grammar to the
objects created in the tree. So the grammar creates nodes called
AItem to match the item production in the grammar. The method
outAItem is called as the visitor leaves the item node and allows me
to access whatever is on the item, in this case the underlying
string token. I can then create the item in my model using that
string as the name. The last bit of code is that to run the parser on the file, which
I do by making the catalog parser a command object.
public static Configuration parse(Reader input) {
Configuration result = new Configuration();
new CatalogParser(input, result).run();
return result;
}
public CatalogParser(Reader input, Configuration result) {
this.input = input;
this.result = result;
}
public void run() {
try {
createParser(input).parse().apply(this);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
private Parser createParser(Reader input) {
return new Parser(new Lexer(new PushbackReader(input)));
}
So that's the basics of getting it going. So far I haven't
started to delve further, so my thoughts about SableCC are somewhat
preliminary, but then the whole point of this blog is to write
half-formed thoughts. SableCC is a bit awkward to use. There's little documentation,
other than the author's thesis. Fortunately the thesis is much more
understandable than many others I've come across so I was able to
figure out how to get things going. During my work I made a mistake
in the grammar and found it tricky to get diagnostics. The error
messages weren't too informative and I resorted to debuggers and
print statements inside the generated parser code. Fortunately the
problem was in the tokenizing, so I realized my boo-boo by looking
at the output from the lexer. LALR parsers are notoriously hard to
follow, so I'm glad I didn't have to delve in there. Antlr scores
rather better on this front. Recursive descent parsers are easier to
follow and there is an Antlr book in the works which should help me
a lot as I explore that. So far I'm not convinced by the approach of removing parser
actions and automatically generating a parse tree. Since it's a
parse tree, you have to walk it to do anything useful. Nat Pryce let
me know about tree transformation rules in the latest SableCC, which
look more useful since it defines an abstract syntax tree rather
than a parse tree. You still have to walk it to create objects in
the domain model, but it's easier to walk. (The latest version of
Antlr has similar features.) One plus in tree walking is that
if I'm making changes to the tree walker I don't need to re-generate
- which keeps me in IntelliJ. However Antlr, has AntlrWorks which
plugs into IntelliJ and looks very nice.
|