| EAA-dev Home |

WORK-IN-PROGRESS: - this material is still under development

Implementing an External DSL

Last significant update: 18 Aug 07

[TBD: Need to completely overhaul this narrative as the main patterns get established.]

Working with an external DSL puts you directly into the territory of traditional language tools, the kinds of tools often used for implementing compilers and interpretors for general purpose programming languages. The knowledge and tools of this community are very useful for DSL development, but there is a catch. The tools and writings of the programming language community almost always assume you are working with a general purpose language. DSLs are lucky to get a mention in passing. While many of the principles apply equally to general purpose and domain specific languages, there are differences. In addition you don't need to understand as much to work with DSLs, which essentially means you don't need to go all the way up the learning curve that you need to go for general purpose languages.


Delimiter Directed Translation

Faced with a simple DSL the most obvious approach to parsing it is Delimiter Directed Translation. Essentially you pick some character as a delimiter - usually the line-ending. You chop up the text of the DSL by that delimter and process each line. With each line you figure out what kind of line you have and break it up, usually with some simple tokenization such as separating on white space.

For simple cases where each line is independent of others then this style can work quite well. It breaks down, however, as you get more structure. Consider the case where we group the commands and events together

events
  doorClosed  D1CL
  drawOpened  D2OP
#  ...
end

commands
  unlockPanel PNUL
  lockPanel   PNLK
#  ...
end

It's easy to see here that we can break down the file into lines, but what happens when we process the line doorClosed D1CL, is this an event or is it a command. In order to tell we need to keep information about the state of where we are in the parse with a Context Variable. As a DSL gets more involved, keeping track of all these states gets to be very hard work.

Delimiter Directed Translation runs into these kinds of problems a lot even with only moderately complicated languages. Even a modestly complex parser is a lot of trouble, so we don't talk about it too much because the alternative is much easier for all but the trivial cases.


Syntax-Directed Translation

If you've read any book on programming languages, you'll have come across the notion of a grammar. A grammar is a way of defining the legal syntax of a programming language. So we can define the syntax for the event and command list above with productions like this:

list : eventList commandList;
eventList : 'event' eventDec* 'end';
eventDec  : identifier identifier;
commandList : 'command' commandDec 'end';
commandDec : identifier identifier; 

This tells us pretty clearly what the legal syntax for events is. As a bonus what we are looking is actually a DSL. Take this grammar and feed it into a suitable parser generator, and the parser generator will generate a parser that can test whether some input is legal syntax for our language.

This much is cool, but we can go much further. We can also combine other code with the grammar to allow us to other things too, including translating this simple list of events into an abstract representation. This Syntax Directed Translation allows us to build the entire parser mechanism for an external DSL around the grammar.

We get two benefits from this. The first benefit is that our generated parser is able to keep track of the context of the parse, at least to some extent - enough to tell whether it's in a command or an event. We don't have to write the code to deal with that ourselves, and I'm always in favor of less work.

The greater reason, however, is that becuase the grammar is a DSL, it gives us not just the code to parse our language, but also a compact and clear statement of the syntax of our language. If you want to figure out from the code of Delimiter Directed Translation what a legal syntax is, you'd be faced with a fair bit of work - or need some documentation (and hope it's kept up to date). The grammar is the documentation of the syntax of the DSL and since it's code it's always up to date.

(As we'll see the picture isn't quite as pretty as this. The grammar doesn't contain all of what we might like to think of as the syntax. But it does get most of it.)

The parts of Syntax Directed Translation

When you use Syntax Directed Translation you'll find the process is broken into a sequnce of activities.

You can usually identify and talk about these stages, but it's often not easy to actually see the separation in the code. Most parser generators do a clear job of separating the lexer and parser, but the semantic analysis and output production is often done in code that's embedded into the parser, and as a result there isn't a clear object that does it.

You might notice here a bit of terminological confusion. Often when programmers speak about parsing, they mean taking some text and turning it into something meaningful. In the language community, however, that overall process is called translation. They use 'parsing' to refer to the stage of translation that operates on a token stream and recognizes it as valid syntax or finds errors.

This confusion presents me with a particularly ticklish problem since both senses of the word 'parsing' are widely used and rather hard to avoid. I shall use it both ways with the hope that the context will make clear what I mean. If I really need to make the distinction I'll use the terms 'pure parsing' and 'translation', but in this section I'll only use parsing to mean pure parsing.

With Delimiter Directed Translation you can hand-write the parser yourself, or use a parser-generator toolkit. Most of the time it's easier to use a parser-generator, but there are occasions where you're doing something simple or there isn't a parser generator available. In this case the best kind of parser to write is a Recursive Descent Parser since it's pretty easy to write given the right kind grammar. In either case it makes sense to separate lexing and parsing into separate classes.

Lexing takes an input character stream and turns into a stream of more meaningful tokens. Each token has a type and a payload - the underlying text.

Let's take a line that says command unlockPanel PNUL. On disk this is just a sequence of characters. The tokenizer would break this up into three tokens, the first of which is a keyword, and the next two are identifiers:

<type = command-keyword, text = "command">
<type = identifier, text = "unlockPanel">
<type = identifier, text = "PNUL">

We specify how these tokens are formed with lexer rules, which look something like this:

command-keyword: 'command';
identifier: [:alpha:]*;

The actual syntax of the lexer rules varies depending on the toolkit you are using, but they are usually some form of regular expression syntax.

We write the grammar for the parser in terms of these tokens. So for this kind of line we would have a command clause that consists of the command keyword followed by two identifiers.

command: command-keyword identifier identifer;

We might then say that commands are one kind of statement we can have in our language, the others being events and states.

statement: command | event | state;

The grammar defines a tree structure of what clauses we expect in our language and how they fit together. A parser generator tool can take this grammar and lexer rules and generate parser and lexer classes for us. This parser class takes the token stream produced by the lexer as input and will tell us if it is valid syntax for our language, pointing out any errors on the way.

In order to figure out whether a token stream is valid for a language, the parser notionally builds up a parse-tree. This parse-tree is a tree structure of the elements of the input language.

So assume we have the line command unlockPanel PNUL and a grammar that says (as above)

statement: command | event | state;
command: command-keyword identifier identifer;

Processing these leads to the following parse-tree.

Figure 1

Figure 1: Parse tree for the sample input.

There is no single standard for the syntax of grammar files, although they pretty much all follow some close variant of BNF. Grammars don't just vary in terms of their syntax, they also follow differing grammar types (things like LL and LALR, which we'll discuss later). As a result you can't just copy a grammar from one parser generator into another, even for the same language.

A pure parser does no more than just say whether the input is syntactally correct or not. While this is handy information, we usually require rather more. As a result we like to augment the parser to allow it to do output production and possibly some semantic analysis.

The most obvious form of augmentation is to allow us to put code into the grammar which is executed when a grammar rule is recognized. If we are writing our own parser, this just means adding additional code into the code we wrote for parsing. Parser-generators handle this by using Host Code Embedment to put fragments of code into the grammar itself. So we might augment our command rule with something like:

command: command-keyword identifier identifier 
         {print "recognized command name $1 with code $2"}
 

This host code usually has some mechanism for referencing the tokens recognized in the grammar rule, the exact syntax agian varying with actual parser-generator you use.

Semantic analysis is essentially rules about the input language that can't be satisfied by the grammar alone. The classic example of this kind of rule is one that says that you can't use an identifier without declaring it first. The parser's context is strictly hierarchic - it can tell where in the parser tree you are, but can't make this kind of check. These checks can be done in embedded code.

The output production is the final thing you want to acheive with the overall translation process. As we can embed any code we like, we can obviously do anything we like. In practice, however, there are three categories of things we do with our parser.

Embedded Translation uses the embedded code to create the objects of the abstract representation itself. Once the parser has finished parsing the code you have a fully formed abstract representation which you can interpret or use as the basis for code generation.

Tree Construction also returns a data structure, but this is one that's much closer to the original input and usually needs a further processing step to actually form an abstract representation. This data structure could be the parse tree used by the parser, but usually it is an Abstract Syntax Tree. An abstract syntax tree (AST) is a representation of the input that removes unnecessary nodes and reorders the parse-tree to make it more logical to process. As a result an AST is much easier to manipulate than a parse-tree.

In order to produce an abstract representation, you'll need a further translation step that walks the AST and produces the abstract representation. Naturally this involves an extra step in the whole translation process. The extra step does introduce some complexity, but can be very helpful if there is a complicated transformation between the input language and the abstract representation because doing a complicated task in two simpler steps is easier than trying to do it in one complicated step.

A noticeable consequence of Tree Construction is that since the code that produces the AST is quite regular, you can use a DSL for this too. Antlr, for example, has a DSL for Tree Construction which means that an antlr grammar using this technique needs no embedded host code at all.

In some cases you can execute or code generate off the AST directly. In this case the AST is the abstract representation so you can think of it as either (or both) patterns. There's also a definite gray area between an AST and an abstract represetation. In my mind the key difference is that an AST is a data structure that's designed around the structure of the input while the abstract representation is designed around the purpose of the DSL.

The last case, Embedded Interpretation doesn't use any further representation at all: there is no AST nor abstract represetation. Running the parser code immediately produces the output of the DSL. I generally argue against this case as it's usually trying to do too much in one place.

Another case where you don't have any AST or abstract represetation is where you use Embedded Translation where the embedded logic does code generation directly. I advise against this as the abstract represetation helps is a very useful intermediate step allowing you to break the overall problem into two simpler steps. It also gives you the flexibility to produce different kinds of output from the abstract represetation without altering the parser code.

In a similar vein you see two approaches to how people embed code into the parser. One approach is to embed all necessary code into the parser, while the other is to use an Embedment Helper. In the latter case the embedded code consists only of calls to the Embedment Helper. I much prefer using an Embedment Helper, partly because any host code in the grammar loses any IDE tooling, but mostly because lots of host code obscures the structure of the grammar which removes one of the great advantages of using one.

Significant Revisions

18 Aug 07: