Tools bliki


Android, BelkinKvmLinux, DebianJava, HelloAntlr, HelloCup, HelloSablecc, HotRod, InstallingDebian, IntelliCsharp, JRake, JRubyVelocity, KeyringLaptop, Knoppix, MercurialSquashCommit, PervasiveVersioning, PostIntelliJ, Subversion, VcsSurvey, VersionControlTools


VcsSurvey tools 8 March 2010 Reactions

When I discussed VersionControlTools I said that it was an unscientific agglomeration of opinion. As I was doing it I realized that I could add some spurious but mesmerizing numbers to my analysis by doing a survey. Google's spreadsheet makes the mechanics of conducting a survey really simple, so I couldn't resist.

I conducted the survey from February 23 2010 until March 3 2010 on the ThoughtWorks software development mailing list. I got 99 replies. In the survey I asked everyone to rate a number of version control tools using the following options:

  • Best in Class: Either the best VCS or equal best
  • OK: Not the best, but you're OK with it.
  • Problematic: You would argue that the team really ought to be using something else
  • Dangerous: This tool is really bad and ThoughtWorks should press hard to have it changed
  • No opinion: You haven't used it

The results were this:

ToolBestOKProblematicDangerousNo OpinionActive ResponsesApproval %
Subversion20726109993%
git651910148599%
Mercurial332720366297%
ClearCase03144141585%
TFS00322244540%
CVS0145911158417%
Bazaar11330801782%
Perforce126161544461%
VSS11116422773%

As well as the raw summary values, I've added two calculated columns here to help summarize the results.

  • Active Responses: The total of responses excluding "No Opinion". (eg for git: 65 + 19 + 1 + 0)
  • Approval %: The sum of best and ok responses divided by active responses, expressed as a percentage. (eg for git: (65 + 19) / 85)

The graph shows a scatter plot of approval percentage and active responses. As you can see there's a clear cluster around Subversion, git, and Mercurial with high approval and a large amount of responses. It's also clear that there's a big divide in approval between those three, together with Bazaar and Perforce, versus the rest.

Although the graph captures the headline information well, there's a couple of other subtleties I should mention.

  • Although the trio of Subversion, git, and Mercurial cluster close together on approval, git does get a notably higher amount of best scores: (65 versus 20 and 33).
  • VSS got the most "dangerous" responses, but a couple of people approved of it.
  • Neither TFS or ClearCase are liked much, but ClearCase got more "dangerous" responses than TFS (41 versus 22).
  • Don't read too much into small differences as I'm sure they aren't significant. I'm sure the difference in approval percentage between VSS, TFS, and ClearCase isn't signifcant, but the difference between these three and the leaders is.

Some caveats. This is a survey of opinion of ThoughtWorkers who follow our internal software development discussion list, nothing more. It's possible some of them may have been biased by my previous article (although unlikely, since I've never managed to get my ThoughtBot opinion-control software to work reliably). Opinions of tools are often colored by processes that are more about the organization than the tool itself. But despite these, I think it's an interesting data point.

I should also stress the important point to take away from this isn't the comparison between those close in the numbers, eg comparing git and Mercurial or comparing TFS and ClearCase. Any survey like this has a certain amount of noise in it, and I suspect the noise here is greater than such a difference. The important point is the big approval gap between the leading tools (Subversion, git, and Mercurial) and the laggards - essentially the point in VersionControlTools.


VersionControlTools tools 17 February 2010 Reactions

If you spend time talking to software developers about tools, one of the biggest topics I hear about are version control tools. Once you've got to the point of using version control tools, and any competent developers does, then they become a big part of your life. Version tools are not just important for maintaining a history of a project, they are also the foundation for a team to collaborate. So it's no surprise that I hear frequent complaints about poor version control tools. In our recent ThoughtWorks technology radar, we called out two items as version control tools that enterprises should be assessing for use: Subversion and Distributed Version Control Systems (DVCS). Here I want to expand on that, summarizing many discussions we've had internally about version control tools.

But first some pinches of salt. I wrote this piece based on an unscientific agglomeration of conversations with my colleagues inside ThoughtWorks and various friends and associates outside. I haven't engaged in any rigorous testing or structured comparisons, so like most of my writing this is based on AnecdotalEvidence. My personal experience in recent years is mostly subversion and mercurial, but my usage patterns are not typical of a software development team. Overwhelmingly my contacts like to work in an agile-xp approach (even if many sniff at the label) and need tools that support that work style. I expect many people to be annoyed by this article. I hope that annoyance will lead to good articles that I can link to.

(After writing this I did do a small VcsSurvey which didn't undermine my conclusions.)

Fundamentally there's three version control systems that get broad approval: subversion (svn), git, and mercurial (hg).

Behind the Recommendability Threshold

Many tools fail to pass the recommendability threshold. There are two reasons why: poor capability or poor visibility.

Many tools garner consistent complaints from ThoughtWorkers about their lack of capability. (ThoughtWorkers being what they are, all tools, including the preferred set, get some complaints. Those behind the threshold get mostly complaints.) Two in particular generate a lot of criticism: ClearCase (from IBM) and TFS (from Microsoft). One reason they get a lot of criticism is that they are very popular on client sites, often with company policies mandating their use (I'll describe a coping strategy for that at the end).

It's fair to say that often these problems are compounded by company policies around using VCS. I've heard of some truly bizarre work-flows imposed on teams that make it a constant hurdle to get anything done. Since the VCS is the tool that enforces these work-flows, it does tend to get tarred with that brush.

I'm not going to go into details about the problems the poor-capability tools have here, that would be another article. (This has probably made me even more unpopular in IBM and Microsoft as it is.) I will, at least for the moment, leave it with the fact that developers I respect have worked extensively with, and do not recommend, these products.

The second reason for shuffling a tool behind the recommendability threshold is that I don't hear many comments about some tools. This is an issue because less-popular tools make it difficult to find developers who know how to use them or want to find out. There are many reasons why otherwise good tools can fall behind there. I used to hear people say good things about Perforce, but now the feeling seems to be that it doesn't have compelling advantages over Subversion, let alone the DVCSs. Speaking of DVCSs, there are more than just the two I've highlighted here. Bazaar, in particular, is one I occasionally hear good things about, but again I hear about it much less often then git or Mercurial.

Before I finish with those behind the threshold, I just want to say a few things about a particularly awful tool: Visual Source Safe, or as I call it: Visual Source Shredder. We see this less often now, thank goodness, but if you are using it we'd strongly suggest you get off it. Now. Not just is it a pain to use, I've heard too many tales of repository corruption to trust it with anything more valuable than foo.txt.

So this leaves three tools that my contacts are generally happy with. I find it interesting that all three are open-source. Choosing between these tools involves first deciding between a centralized or distributed VCS model and then, if you chose DVCS, choosing between git and mercurial.

Distributed or Centralized

Most of the time, the choice between centralized and distributed rests on how skilled and disciplined the development team is. A distributed system opens up lots of flexibility in work-flow, but that flexibility can be dangerous if you don't have the maturity to use it well. Subversion encourages a simple central repository model, discouraging large scale branching. In an environment that's using Continuous Integration, which is how most of my friends like to work, that model fits reasonably well. As a result Subversion is a good choice for most environments.

And although DVCSs give you lots of flexibility in how you arrange your work-flows, most people I know still base their work patterns on the notion of a shared mainline repository that's used with Continuous Integration. Although modern VCS have almost magic tools to merge different people's changes, these merges are still just merging text. Continuous Integration is still necessary to get semantic consistency. So as a result even a team using DVCS usually still has the notion of the central master repository.

Subversion has three main downsides compared to its cooler distributed cousins.

Because distributed systems always give you a local disk copy of the whole repository, this means that repository operations are always fast as they don't involve network calls to central servers. This is a palpable difference if you are looking at logs, diffing to old revisions, and anything else that involves the full repository. If this is noticeable on my home network, it is a huge issue if your repository is on another continent - as we find with our distributed projects.

If you travel away from your network connection to the repository, a distributed system will still allow you to work with the repository. You can commit checkpoints of your work, browse history, and compare revisions on an airplane without a network connection.

The last downside is more of a social issue than a true tool issue. DVCS encourages quick branching for experimentation. You can do branches in Subversion, but the fact that they are visible to all discourages people from opening up a branch for experimental work. Similarly a DVCS encourages check-pointing of work: committing incomplete changes, that may not even compile or pass tests, to your local repository. Again you could do this on a developer branch in Subversion, but the fact that such branches are in the shared space makes people less likely to do so.

This last point also leads to the argument against a DVCS, that it encourages wanton branching, that feels good early on but can easily lead you to merge doom. In particular the FeatureBranch approach is a popular one that I don't encourage. As with similar comments earlier I must point out that reckless branching isn't something that's particular to one tool. I've often heard people in ClearCase environments complain of the same issue. But DVCSs encourage branching, and that's the major reason why I indicate that team needs more skill to use a DVCS well.

There is one particular case where subversion is the better choice even for a team that skilled at using a DVCS. This case is where the artifacts you're collaborating on are binary and cannot be merged by the VCS - for example word documents or presentation decks. In this case you need to revert to pessimistic locking with single-writer checkouts - and that requires a centralized system.

Git or Mercurial

So if you're going to go the DVCS route - which one should you choose? Mercurial and git get most of the attention, so I feel the choice is between them. Then the choice boils down to power versus usability, with a dash of mind-share and the shadow of github.

Git certainly seems to be liked for its power. Folks go ga-ga over it's near-magical ability to do textual merges automatically and correctly, even in the face of file renames. I haven't seen any objective tests comparing merge capabilities, but the subjective opinion favors git.

(Merge-through-rename, as my colleague Lucas Ward defines it, describes the following scenario. I rename Foo.cs to Bar.cs, Lucas makes some changes to Foo.cs. When we merge his changes are correctly applied to Bar.cs. Both git and Mercurial handle this.)

For many git's biggest downside was its oft-cryptic commands and mental model. Ben Butler-Cole phrased it beautifully: "there is this amazingly powerful thing writhing around in there that will basically do everything I could possibly ask of it if only I knew how." To its detractors, git lacks discoverability - the ability to gradual infer what it does from it's apparent design. Git's advocates say that much of this is because it uses a different mental model to other VCSs, so you have to do more unlearn your knowledge of VCS to appreciate git. Whatever the reason git seems to be attractive more to those who enjoy learning the internals while mercurial seems to appeal more to those who just want to do version control.

The shadow of github is important here. Even git-skeptics rate it as a superb place for project collaboration. Mercurial's equivalent, bitbucket, just doesn't inspire the same affection. However there are other sites that may begin to close the gap, in particular Google Code and Microsoft's Codeplex. (I find Codeplex's use of Mercurial very encouraging. Microsoft is often, rightly, criticized for not collaborating well with complementary open source products. Their use of Mercurial on their open-source hosting site is a very encouraging sign.)

Historically git worked poorly on Windows, poorly enough that we'd not suggest it. This has now changed, providing you run it using msysgit and not cygwin. Our view now is that msysgit is good enough to make comparison with Mercurial a non-issue for Windows.

People generally find that git handles branching better than Mercurial, particular for short-lived branches for experimentation and check-pointing. Mercurial encourages other mechanisms, such as fast cloning of separate repository directories and queue patching, but git's branching is a simpler and better model.

Mercurial does seem to have an issue with large binary files. My general suggestion is that such things are usually better managed with subversion, but if you have too few of them to warrant separate management, then Mercurial may get hung up by the few that you have.

Multiple VCS

There's often value to using more than one VCS at the same time. This is generally where there is a wider need to use a less capable VCS than your team wants to use.

The case that we run into frequently is where there is a corporate standard for a deficient VCS (usually ClearCase) but we wish to work efficiently. In that case we've had success using a different VCS for day-to-day team team work and committing to the corporate VCS when necessary. So while the team VCS will see several commits per person per day, the corporate VCS sees a commit every week or two. Often that's what the corporate admins prefer in any case. Historically we've done this using svn as the local VCS but in the future we're more likely to use a DVCS for local fronting.

This dual usage scenario is also common with git-svn where people use git locally but commit to a shared svn system. Git-svn is another reason for preferring git over mercurial. Using a local DVCS is particularly valuable for remote site working, where network outages and bandwidth problems can cripple a site that's remote from a centralized VCS.

A lot of teams can benefit from this dual-VCS working style, particularly if there's a lot of corporate ceremony enforced by their corporate VCS. Using dual-VCS can often make both the local development team happier and the corporate controllers happier as their motivations for VCS are often different.

One Final Note

Remember that although I've jabbered on a lot about tools here, often its the practices and workflows that make a bigger difference. Tools can certainly make it much easier to use a good set of practices, but in the end it's up to the people to use an effective way of working for their environment. I like to see approaches that allow many small changes that are rapidly integrated using Continuous Integration. I'd rather use a poor tool with CI than a good tool without.


MercurialSquashCommit tools 9 July 2009 Reactions

I've recently had a bit of a fiddle squashing some commits with Mercurial, so thought it was worth a post in case anyone else is looking to do this. I don't know whether this is the best procedure, but it seemed to work pretty well for me.

hg clone base working
# tip of base is revision 73
cd working
# do work, committing on the way
cd ..
hg clone working squash
cd squash
hg qimport -r 74:tip
hg qgoto 74.diff
hg qfold $(hg qunapp)
hg qfinish -a
cd ../base
hg pull ../squash

The basic task I was doing was some fairly severe moving around of files and folders. I wanted to do this in several steps to checkpoint my work as I went, but I wanted a single commit in the version history. (I gather git does this more easily with rebase.) Making a single commit makes it easier to understand what happened - particularly since moving files tends to complicate looking at repository logs. Moving files also complicates the process - a couple of times I ended up with a procedure that didn't work because it lost the ability to track the moves - I want to be able to go hg log -f and see when and what the original commits were before the move.

To begin I needed to enable the mq extension (mercurial queues) and set my diffs to git style. Git style diffs help to track file moves properly.

# in ~/.hgrc
[extensions]
mq=

[diff]
git=true 

When using Mercurial in this way, it seems the general way of working is to have multiple repositories. Mercurial encourages different repositories where other systems, eg git or svn, would use different branches. People argue about this, but it's the Mercurial way of working. For this example I had 'base' as my original repos.

My first step was to clone base into a working repos.

hg clone base working

At this point the tip of base (and working) was revision 73. I did the file moves, with several checkpoint revisions as I went.

cd working
hg mv foo1 newdir/foo1
.. more hg mv ..
hg ci -m "moving around"
.. more hg mv ..
hg ci -m "moving around"
.. more hg mv and hg ci..
cd ..

By the time I was done the last revision was 80.

To squash them down into a single commit I cloned another repos.

hg clone working squash

It's important to clone at this point because I was about to edit history, so wanted to keep the original history handy until I knew it had worked. I now moved into there.

cd squash

Now I turned all the commits I'd done for the revisions into patches for the mercurial patch queue mechanism.

hg qimport -r 74:tip

I made the first change the current patch

hg qgoto 74.diff

I squashed all the patches together into a single patch

hg qfold $(hg qunapp)

The commit message for this folded patch would be all the individual commit messages linked together. I wanted a single message for my clean commit.

hg qrefresh -m "reorganized files"

I then turned the patch into a regular commit.

hg qfinish -a

I now had a single commit with all that work. I looked through it to see that everything was sane, in particular testing hg log -f on some moved files to ensure the history was still there. Once I was convinced all was well, I pulled the single changeset into the base repos.

 
cd ../base
hg pull ../squash
  

It's interesting to see how the attention on version control system has changed over the years. Early on the primary and only purpose was audit - to be able to safely go back to older revision - mainly to diagnose problems. Then attention switched to how they enabled collaboration between people. This didn't replace the need for audit, but built on top of it. Now there's more attention to using them to provide a narrative of how a code base changes - hence the desire for history rewriting commands like this. Again this need is built on top of the other two, but introduces new capabilities and new tensions.

My thanks to my colleague Chris Turner for his help and I also found this page very useful.


Android tools 6 July 2009 Reactions

One of the side benefits of speaking at the Google IO conference last month was that I got a new phone - the HTC Magic android phone that Google gave to all attendees. I was actually in the market for changing my phone to something like this, so it came at a good time. Here's my impressions after carrying it around for a month or so.

My previous phone was a Nokia E61. I liked the E61 as a phone, but found it's web browser to be slow and unreliable and it that, together with the relatively small screen, was beginning to bug me - hence the desire for something else. Naturally I considered an iPhone, but although the company phone plan that I use is AT&T, it isn't possible to use the iPhone on it and I didn't fancy the hassle of sorting out a new phone plan. I tried a Blackberry storm for a few days, but (how about this for irony) the email was no good for me. Blackberries copy every email that comes into the email account, so it doesn't work well for an IMAP account with server-side filters - which is how I use my gmail account.

The short statement is that I do like the htc magic android.

The Good

  • Physically the device works very well for me. It's small, light, and fits well in my hand. The screen is bright, making web browsing is much nicer than with the Nokia.
  • Battery life seems reasonable, a day or two with my usual usage.
  • The app market seems to have a fair few useful things, I've downloaded a bunch of little apps which have seemed handy.
  • Video play works well. I've watched some TED videos and transcoded some other video using Handbrake which played well on the screen.
  • I like that I can upgrade the memory using a micro-SD card. It came with 2GB, and I'm upgrading to 8GB since it's pretty cheap.
  • I use gmail and Google calendar and the phone syncs nicely with those.
  • The phone charges via a mini-USB connection. One less charger to have to carry around.
  • I read one of the prags's books using an ebook reader and it worked pretty well.

The Bad

  • My biggest irritant so far is that it makes it hard to browse local HTML pages. It doesn't support file:// URLs. This is a big issue for me as I often copy static HTML files to my phone for reference purposes. There is a work-around, but it's kludgy.
  • Like every other calendar app on the planet, Google calendar suffers from TimeZoneUncertainty. This is a big issue with a phone that you want to change time zone as you travel.
  • I miss the Nokia's keyboard when typing. The soft keyboard just doesn't work as well.
  • While the touch navigation works pretty well, I'm sure I'd prefer the iPhone's multi-touch gestures.

The Uncertain

  • I haven't tried writing an app for it. I'd like to experiment, but I'm not allowing myself any such fun until I get the book finished.

HelloCup tools 13 May 2007 Reactions

As I explore parser generator tools for external DomainSpecificLanguages, I've said HelloAntlr and HelloSablecc. If you spend much time looking at parser generators you can't really avoid looking at the old stalwarts lex and yacc (or their gnu counterparts flex and bison). I want to explore the way lex and yacc operate, but my C has got too rusty. As Erich Gamma quipped, I've got too lazy to take out my own garbage. Fortunately there is an implementation of a yaccish system for Java, which is just what I need.

The Java implementation, like the classic lex and yacc, is two independent tools: JFlex and CUP. Although they are developed separately they do provide hooks to work together.

As with my previous posts along these lines, this is a overly-simple example just to get the tools working. I take an input file which says:

item camera
item laser

and turn it them into item objects inside a configuration using the following model:

public class Configuration {
  private Map<String, Item> items = new HashMap<String, Item>();
  public Item getItem(String key) {
    return items.get(key);
  }
  public void addItem(Item arg) {
    items.put(arg.getName(), arg);
  }
public class Item {
  private String name;
  public Item(String name) {
     this.name = name;
   }

to pass the following test

    @Test public void itemsAddedToItemList() {
      Reader input = null;
      try {
        input = new FileReader("rules.txt");
      } catch (FileNotFoundException e) {
        throw new RuntimeException(e);
      }
      Configuration config = CatalogParser.parse(input);
      assertNotNull(config.getItem("camera"));
      assertNotNull(config.getItem("laser"));
    }

The first issue is just to get the build going. As with my previous examples I want to take the grammar input files and generate the lexer and parser into a gen directory. Unlike my previous examples I don't do this directly in ant, instead I'm using ant to call a ruby script.

--- build.xml
 <target name = "gen" >
    <exec executable="ruby" failonerror="true">
      <arg line = "gen.rb"/>
    </exec>
  </target>

--- gen.rb
require 'fileutils'
include FileUtils

system "java -cp lib/JFlex.jar JFlex.Main -d gen/parser src/parser/catalog.l"

system "java -jar lib/java-cup-11a.jar src/parser/catalog.y"
%w[parser.java sym.java].each {|f| mv f, 'gen/parser'} 

Yes, I know it's a long way around, but with a lot of source files I'm using the approach in FlexibleAntlrGeneration to do my generation and I can't be bothered to sort it out in ant as well.

(When I attended CITCON recently, I was surprised to find out that people were much happier with ant than I thought. Grumpy me thinks it's a case of Stockholm Syndrome. Even when less grumpy I'm keeping my eye on things like Raven and BuildR which has now got some documentation. I'm so ready to ditch ant.)

You'll notice that CUP puts its output files in the current directory and I couldn't see how to override that behavior. So I generated them and moved them with a separate command.

Once I generate the code I then compile and test it with ant.

<target name = "compile" depends = "gen">
    <mkdir dir="${dir.build}"/>
    <javac destdir="${dir.build}" classpathref="path.compile">
      <src path = "${dir.src}"/>
      <src path = "${dir.gen}"/>
      <src path = "${dir.test}"/>
    </javac>
  </target>

  <target name = "test" depends="compile">
     <junit haltonfailure = "on" printsummary="on">
      <formatter type="brief"/>
      <classpath refid = "path.compile"/>
      <batchtest todir="${dir.build}" >
        <fileset dir = "test" includes = "**/*Test.java"/>
      </batchtest>
     </junit>
   </target>

Lex and yacc separate the lexer and parser into different files. Each is generated independently and combined during compilation. I'll start with the lexer file (catalog.l). The opening declares the output file's package and imports.

package parser;
import java_cup.runtime.*;

JFlex uses %% markers to break the file into sections. The second section consists of various declarations. The first bit names the output class and tells it to interface with CUP.

%%
%class Lexer
%cup

The next bit is code to be folded into the lexer. Here I define a function to create Symbol objects - again to hook into CUP.

%{
  private Symbol symbol(int type) {
    return new Symbol(type, yytext());
  }
%}

The Symbol class is defined in CUP and is part of its runtime jar. There are various constructors taking various information about the symbol and where it is.

Next up are some macros to define words and whitespace.

Word = [:jletter:]*
WS = [ \t\r\n]

The final section is the actual lexer rules. I define one to return the item keyword and the other to return simple words to the parser.

%%
"item"      {return symbol(sym.K_ITEM);}
{Word}      {return symbol(sym.WORD);}
{WS}        {/* ignore */}

So the lexer will send a stream of K_ITEM and WORD tokens to the parser. I define the parser in catalog.y. Again it starts with package and import declarations.

package parser;
import java_cup.runtime.*;
import model.*;

I'm parsing the data into a configuration object, so I need to declare a place to put that result. Again this code is copied directly into the parser object.

parser code {: Configuration result = new Configuration(); :}

With CUP I need to define all the rule elements that I'll use in the productions.

terminal K_ITEM;
terminal String WORD;
non terminal  catalog, item;

The terminals are the tokens I get from the lexer, the non terminals are the rules I'll build up myself. If I want to get a payload from the token, I need to declare its type, so WORD is a string.

The catalog is a list of items. Unlike with antlr or sablecc I don't have EBNF here, so I can't say item*, instead I need a recursive rule.

catalog ::= item | item catalog;

The item rule itself contains the embedded action to put the item into the configuration.

item ::= K_ITEM WORD:w {: parser.result.addItem(new Item(w)); :}
          ;

A little wrinkle to note here is that the actions are put into a separate class to the parser object, so to get to the result field I defined earlier I have to use the parser field of the actions object. I should also mention that once I do much further with this I start to use an EmbedmentHelper to keep action code simple.

People who have used yacc before might notice that I can label the elements of the rule to refer to them in the actions instead of the $1, $2 convention used in yacc. Similarly if the rule returns a value CUP uses RESULT rather than $$.

My memories of lex and yacc are dim, but these tools do seem to mimic the style of using them pretty well. My biggest beef so far is the error handling, which caused me much more fuss than antlr. My feeling so far is that if you're new to parser generators then antlr is the better choice (particularly due to its book). However if you're familiar with lex and yacc then these two are similar enough to build off that knowledge.


HelloAntlr tools 7 March 2007 Reactions

After saying HelloSablecc I also wanted to try out Antlr, which is another compiler-compiler for the Java space. As with that entry, this is just about getting Antlr going with a very simple "hello world" style grammar.

Like SableCC, Antlr is a compiler-compiler tool. It's been around for a while, and I've run into a few projects that use it. Unlike SableCC (and the venerable lex/yacc combo) it generates a recursive descent parser using LL grammars. Compiler heads like to argue about whether LL or LALR are better, I'll not step into that debate here.

My simple case is to parse a file of a list of items like this:

item camera
item laser

Each line has the 'item' keyword followed by a single word for the name of an item. I shall load each item object into a configuration object that keeps them all together.

public class Configuration {
  private Map<String, Item> items = new HashMap<String, Item>();
  public Item getItem(String key) {
    return items.get(key);
  }
  public void addItem(Item arg) {
    items.put(arg.getName(), arg);
  }
public class Item {
  private String name;
  public Item(String name) {
     this.name = name;
   }

Here's a test for that, using the file I showed above.

 @Test public void readTwoItems() {
    Reader input = null;
    try {
      input = new FileReader("catalog.txt");
    } catch (FileNotFoundException e) {
      throw new RuntimeException(e);
    }
    Configuration config = ParserCommand.parse(input);
    assertNotNull(config.getItem("camera"));
    assertNotNull(config.getItem("laser"));
    assertEquals(2, config.getItems().size());
  }

As before - using a compiler-compiler for this problem is silly, but then is printing "hello world" on a console. For the same reason as I always write "hello world" with a new environment, I like to write something dirt simple to just make sure I can get things working at all before I start doing anything real with it.

One hassle with using an compiler-compiler like this is that it makes the build process more complicated. I have to run antlr on the grammar file to create java classes for the parser, then include them in the compilation. So it's time to fight with ant again - here's the ant target:

  <property name = "dir.parser" value = "${dir.gen}/parser"/>
  <path id = "path.antlr">
    <fileset dir = "${dir.lib}">
      <include name = "antlr*.jar"/>
      <include name = "stringtemplate*.jar"/>
    </fileset>
  </path>
  <target name = "gen" >
    <mkdir dir="${dir.parser}"/>
    <java classname="org.antlr.Tool" classpathref="path.antlr" fork = "true" failonerror="true">
      <arg value="-o"/>
      <arg value="${dir.parser}"/>
      <arg value="Catalog.g"/>
     </java>
  </target>

This generates code into the gen directory. This way generated code is separate from source code I write myself. Another target does the compilation

 <property name = "dir.build" value = "classes/production/antlrLair"/> 
 <target name = "compile" depends = "gen">
    <mkdir dir="${dir.build}"/>
    <javac destdir="${dir.build}" classpathref="classpath">
      <src path = "src"/>
      <src path = "${dir.gen}"/>
      <src path = "test"/>
    </javac>
  </target>

I can then run the tests with a final target.

<target name = "test" depends="compile">
    <junit haltonfailure = "on">
      <formatter type="brief"/>
      <classpath refid = "classpath"/>
      <batchtest todir="${dir.build}" >
        <fileset dir = "test" includes = "**/*Test.java"/>
      </batchtest>
     </junit>
   </target>

Antlr works with a grammar file Catalog.g. The grammar file defines the productions in the grammar and also actions that the parser takes when it encounters productions. The grammar file also defines the lexer (you can split them if you want). In this sense Antlr is more traditional (and flexible) than SableCC. SableCC doesn't allow actions, instead you generate a parse tree or AST and walk that with java. Antlr allows arbitrary actions, or it supports building a tree in the same manner as SableCC. (Antlr also uses a grammar file to walk the tree.) Since I'm building up a simple domain model of items and a configuration I'll forgo the tree building and do all the work in my actions.

I'll go through this file in chunks, with descriptions as I go. I start with a grammar heading

grammar Catalog;

Antlr supports a number of points at which you can inject code into the generated parser (instead of the generating a superclass which SableCC does.) I put package declarations and imports into the header.

@header{
package parser;
import model.*;
}
@lexer::header {
package parser;
}

The next code injection is to put code into the body of the generated class. Essentially this adds members to the class, hence the name of the command.

@members {
  public Configuration result = new Configuration();
}

Now I can get into the productions of the grammar. I'll do this top down, since it's a top-down parser. I begin by saying that the catalog consists of multiple item clauses followed by the end of the file.

catalog :  item* EOF;

Next I define the item clause as the literal string 'item' followed by a string.

item 	: 'item'  name=STRING 
   {result.addItem(new Item ($name.text));};

Here I also put in the action, which is to create a new item in the model with the name set to the value of the string. The code inside the curlies is java code which is added to the parser after that term is recognized. I can name elements in the production which I then refer to in the action. Here I've given the string the name 'name', which makes sense in context even though it makes for an awkward write-up.

The last productions define the lexer elements for string and whitespace.

STRING 	: ('a'..'z' | 'A'..'Z')+ ;
WS : (' ' |'\t' | '\r' | '\n')+ {skip();} ;

The action of whitespace is to skip (ignore) it.

There are a few things that make Antlr easier to work with than SableCC. Antlr has a nice IDE called AntlrWorks that can plug into IntelliJ. The tool will give you syntax highlighting and completion on grammar elements, plot syntax diagrams for your grammar, and allow you to enter test fragments to parse - displaying the resulting parse tree. It's a very helpful tool to see what the parser is doing. However there's no highlighting/completion for the code inside actions, which is an understandable pain.

Another good feature of Antlr is the fact that there is a decent book on it in the works. The book gives detailed coverage of how the tool works and useful background on language and compiler principles. It does assume you're working on a full blown language and that you'll be generating code - which isn't necessarily so for DSL work. However the detail it gives looks like it will be invaluable as I probe deeper.

Antlr's actions seem like an easier bet if you want to populate a model - I'm not sure how useful an intermediate parse tree or AST would be here. Again further investigation will give me a better feel. The more complex the language the more useful it is to have an intermediate tree representation. I like Antlr's flexibility in allowing you to do actions or tree building with transformations.

Inevitably I did have problems even with this simple example. My biggest blocker was that I originally defined the catalog term as catalog : item*;, that is without the EOF. I then got confused because the parser didn't indicate an error when it got spurious input (like xitem foo). This wasn't helped by inconsistencies between Antlr and AntlrWorks (the latter did show an error and older versions of AntlrWorks would handle whitespace differently too.)

(Another big cause of trouble was getting ant and JUnit to work. I don't want to have to think about the amount of time I've spent over the years trying to diagnose classpath problems, especially with the infamous "Ant could not find the task or a class this task relies upon." message.)


Links
home
bliki
feed 
Translations
Japanese
Spanish
Korean
Chinese
Thai
Categories
agile
design
dsl
leisure
refactoring
ruby
thoughtWorks
tools
uml
writing
Blog Roll
ThoughtBlogs
TW Alumni
Nicholas Carr
Steve Cook
Brian Foote
Simon Harris
Gregor Hohpe
/\ndy Hunt
Ralph Johnson
Patrick Logan
David Ing
Brian Marick
Jeremy Miller
Jimmy Nilsson
Samuel Pepys
Keith Ray
Johanna Rothman
Kathy Sierra
Dave Thomas