| DSL-WIP Home |

WORK-IN-PROGRESS: - this material is still under development

Implementing an External DSL

Last significant update: 18 Aug 07

Implementing an external DSL differs from internal DSLs in that the parsing process operates on pure text input which is not constrained by any particular language. The techniques we can use to parse text are essentially those that have been in use for decades in parsing programming languages. There is also a long-running language community that's developing these tools and techniques.

But there is a catch. The tools and writings of the programming language community almost always assume you are working with a general purpose language. DSLs are lucky to get a mention in passing. While many of the principles apply equally to general purpose and domain specific languages, there are differences. In addition you don't need to understand as much to work with DSLs, which essentially means you don't need to go all the way up the learning curve that you need to go for general purpose languages.

Within the language community, the key tool for parsing is a parser generator toolkit, of which yacc and antlr are two of the best known examples. A parser generator toolkit allows you to specify the structure of the input language using a grammar and uses that grammar as a DSL to drive a parser, usually through code generation. Using a grammar to drive a parse in this way is called Syntax Directed Translation. Much of the writing in the language community is about how to do Syntax Directed Translation in the face of a general purpose language, which usually implies dealing with large programs in an efficient manner, and in dealing with complexities of input languages that can't be described well with a BNF grammar.

Although DSLs can, and often should, make use of these same parser generator tools, DSLs can usually be far simpler than general purpose languages. DSL scripts are usually much smaller and thus don't require the hoops you need to jump through to get decent performance for larger languages. DSLs also express simpler concepts, and thus can be designed to avoid the kind of awkward cases that require more sophisticated handling of general purpose languages. Consequentially the onus is much more on the writer of a parser to make it clear and easy to follow - particularly since parsers tend to be a specialized area with reputation for being hard.

As a result the discussion I provide in this book is quite different to classical language processing approaches. In particular I'm going to focus on two main decision areas for external DSLs: what I'll call the output production strategy, and the syntactic analysis strategy. Each of these comes with its own patterns and choices.


Output Production Strategy

When we want to parse some input, we have to ensure we know what we want to do with the result - what is our output going to be?. I've already argued that most of the time the output we should build is a Semantic Model, which we can then either interpret directly or use as an input for code generation. I won't rehash that argument again now, other than to point out that this is immediately a significant difference to the underlying assumptions you find in the established language community.

Within that community there is a strong emphasis on code generation, and usually parsers are constructed to directly produce the output code with no Semantic Model in sight. This is a reasonable approach for general putpose languages, but it isn't the approach I suggest for DSLs. That difference is something that's important to bear in mind when you read material produced by the language community - which includes most documentation for tools such as parser generators.

Given that our output is a Semantic Model, there are a couple of broad ways we can get there, which boils down to whether we get there in one or two steps. The single step way is Embedded Translation, where you place calls directly in the parser to create the Semantic Model during the parsing process. In this approach you gradually build up the Semantic Model while you are going through the parse. As soon as you understand enough of the input to recognize a part of the Semantic Model, you go ahead and create it. Often you'll need some intermediate parsing data before you can actually create the objects in the Semantic Model - this usually involves storing some information in Symbol Tables

The alternative route to this is the two step route - Tree Construction. In this approach you parse the input text and produce a syntax tree that captures the essential structure of that input text. You then run a second phase that walks the parse tree and populates the Semantic Model.

The syntax tree structure is one that is very much based on the input script. The script is divided into to small chunks, usualy referred to as tokens, which are then arranged into a hierarchy that represents the logical structure of the DSL script. In its simplest form, this syntax tree captures every piece of text in the input script - this direct syntax tree is called a parse tree. However many elements of the input text are there to mark the structure of the text and don't need to be captured in the tree representation. As a result we can remove these uneccesary tokens and produce an abstract syntax tree, which captures the essential structure of the input yet is a simpler structure to work with.

The great advantage of using Tree Construction is that this splits up the task of parsing into two simpler tasks. While recognizing the input text you can focus only on how to build up an AST. This makes it easier to write the code that's intermingled with the recognizing code. Indeed many parser generators provide a DSL for tree construction that further simplified this part of the process. Walking the tree to populate the Semantic Model is then a more regular programming exercise, and you have the whole tree to examine in order to determine what to do. If you've written code to process XML you can liken Embedded Translation to using SAX and Tree Construction to using a DOM.

There is also a third option - Embedded Interpretation. Embedded Interpretation runs an interpretation process during the parse itself and its output is the final result. A classic example of Embedded Interpretation is a calculator that takes in arithmetic expressions and produces the answer as a result. Therefore Embedded Interpretation doesn't product a Semantic Model. Although it does come up from time time, it's a rare case.

You can also use Embedded Translation and Tree Construction without a semantic model. Indeed this is quite common to see when using code generation. Most examples you see of parser generators will do one of these. Although it may make sense, particularly for simpler cases, it's an approach I reccomend only rarely. Usually I find the Semantic Model overwhelmingly helpful.

So most of the time the choice is between Embedded Translation and Tree Construction. This decision depends on the costs and benefits of that intermediate syntax tree. The great benefit of Tree Construction is that it splits the parsing problem into two. Usually it's easier to combine two simple tasks than it is to write one more complicated task. This becomes increasingly true as complexity of the overall translation increases. The more involved the DSL and the greater the distance between the DSL and the Semantic Model, the more useful it is to have an intermediate syntax tree, particularly if you have tooling support to create an abstract syntax tree.

I seem to be making a convincing argument here for Tree Construction. Certainly a common argument against it - the memory that syntax tree takes up - whithers away when processing small DSLs on modern hardware. But despite the many reasons that favor Tree Construction I'm not entirely convinced - sometimes it feels that building up and walking the tree is more trouble than it's worth. I have to write the code to create the tree and code to walk it - often it's easier to just build the Semantic Model right there and then.

So I'm conflicted on the choice. Other than the vague notion that increasing complexity of translation favors Tree Construction, I have mixed feelings. My best advice is to try a little of both and see which you prefer.


Syntactic Analysis Strategy

Syntactic Analysis is the term common in the language community to recognizing structure in a stream of input text. Let's consider the following that might be a variation on programming my introductory state machine.

event doorClosed  D1CL
event drawOpened  D2OP
command unlockPanel PNUL
command lockPanel PNLK
     

Syntactic analysis is about recognzing that the line event doorClosed D1CL is an event definition and telling it apart from a command definition. Once we know we have an event definition we can then arrange it into a syntax tree if we're using Tree Construction or create objects in the Semantic Model if we're using Embedded Translation. The question is how do we recognize the definition for what it is?

The simplest way to do it, is one that I'm sure you've done yourself, even if you've never dabbled with serious parsing. Divide the input text into lines, then process each line. If it starts with 'event' you know it's an event, if it starts with 'command' you know it's a command. You can then break up the line accordingly to find the key bits of information. This style I refer to as Delimiter Directed Translation. The general formulation is to pick some delimeter characters that breaks the input into statements - usually line-endings, chop the input into separate statements using that delimeter, and then feed each chunk into a separate processing step to figure out what's on the line. Usually there's some clear marker in the line that tells you what kind of statement you are dealing with.

Delimiter Directed Translation is very simple to use and uses tools that most programmers are familiar with - string splitting and regular expressions. It's limitation is that it doesn't give you any inherent way to handle the hierarchic context of your input.

Let's consider that I'd formulated my definitions like this

events
  doorClosed  D1CL
  drawOpened  D2OP
end

commands
  unlockPanel PNUL
  lockPanel   PNLK
end
    

Now splitting it up into lines isn't quite enough. There isn't enough information on the line doorClosed D1CL to tell if this is an event or a command I'm defining. There are ways to do this (and I explore one in an example for Delimiter Directed Translation) but you have to do it yourself. The more hierarchic context you get, the more effort you have to go to in managing it yourself.

In order to handle DSLs with this kind of structure the next step is to use Syntax Directed Translation. In this technique we first define a formal grammar for the input language something like:

list : eventList commandList;
eventList : 'event' eventDec* 'end';
eventDec  : identifier identifier;
commandList : 'command' commandDec 'end';
commandDec : identifier identifier; 

If you've read any book on programming languages, you'll have come across the notion of a grammar. A grammar is a way of defining the legal syntax of a programming language. Grammars are almost always written in some form of BNF. Each line is a production rule and states a name followed by the legal elements of that rule. Once we have a grammar like this, it's then fairly easy to write code that will recognize the input text. We can write this code ourselves, usually to produce a Recursive Descent Parser, or we can feed the BNF directly into a parser generator tool, which will generate a parser for us.

The kind of parser that's generated from Syntax Directed Translation is very capable of handling hierarchic structures like these, after all this kind of thing is essential for general purpose languages. As a result you can handle many things that are awkward with Delimiter Directed Translation much more easily.

Furthermore becuase the grammar is a DSL, it gives us not just the code to parse our language, but also a compact and clear statement of the syntax of our language. If you want to figure out from the code of Delimiter Directed Translation what a legal syntax is, you'd be faced with a fair bit of work - or need some documentation (and hope it's kept up to date). The grammar is the documentation of the syntax of the DSL and since it's code it's always up to date.

The downside of using Syntax Directed Translation is that it's a technique that isn't as widely known as it should be. Many people are under the impression that working with parser generators is quite hard. I think that this fear often comes from the fact that Syntax Directed Translation is usually described in the context of parsing a general purpose language - which introduces a lot of complexities that you don't face with a DSL. I hope this book will encourage you to try and work with Syntax Directed Translation for yourself and you'll discover that it isn't really that tough to do.

I'll go into more detail about Syntax Directed Translation in that section, but one highlight worth knowing about now is the two-stage nature that Syntax Directed Translation is usually done. These two stages are called lexing (also scanning or tokenizing) and syntactic analysis (also referred to, confusingly, as parsing). The lexing stage takes the input text and transforms it into a stream of tokens. Tokens have a type and content. In our state machine language the text "state idle" would turn into two tokens: [content: "state', type: state-keyword][content: "idle", type: identifier].

The syntactic analyser then takes this stream of tokens and arranges it into a parse tree data structure, based on the grammar rules. However the fact that the lexer does its work first has some significant consequences. Firstly it means that I have to be careful about how I use my text. I might want to have a state declared like this state initial state, but this is tricky because the lexer will classify the second "state" as a state keyword rather than an identifier. To avoid this I have use some scheme of Alternative Tokenization.

The second consequence of this is that whitespace is discarded before the parser gets to see anything. This makes syntactic whitespace more difficult to deal with - especially using whitespace to handle code structure in the manner of python.

The reason that this split occurs is that it makes it much easier to write each of the two elements, as well as improving performance, particulary on the more limited hardware that many of these tools were originally designed for.

If you were being exceptionally sharp-eyed earlier on, you might have noticed that I talked about writing a grammar for a language. One language can be described by many different grammars. Grammars would vary because you are using different forms of BNF, you are using different parser generators, or just because you want to factor your grammar in different ways for different purposes.


A short dip into language theory

This is a good moment to dip our toes into some language theory, in particular the way in which the programming language community classifies grammars. This scheme, called the Chomsky Hierarchy, was developed by the linguistist Noam Chomsky in the 1950's. It was based at looking at natural languages rather than computer languages, but basis its classification on the mathematical properties of a grammar used to define their syntactic structure. I'll save you the details (if you're interested there's lots of compiler books and web pages you can work through) but highlight the three key categories: regular, context-free, and context-sensitive.

A regular grammar is important to us because it can be processed using a finite-state machine. One common way to do this is a regular expression. Thus you can, in theory, write a parser for a regular language by just coming up with a suitable regular expression. In practice that regular expression would be too complicated to use lightly, but you can tweak things a bit to get something powerful enough. (I used the phrase "regular language" above, that means a language for which you can come up with a regular grammar.)

In terms of computer languages, regular grammars have one big problem - they can't deal with nested blocks. A regular language can parse an expression like 1 + 2 * 3 + 4 but can't parse 1 + (2 * (3 + 4)). You may hear this as saying that regular grammars "can't count". In parsing terms this means that you can't use a finite state machine to parse a language that has nested blocks. Obviously this is bit of a bummer when it comes to computer languages, as any general purpose language has be able to arithmetic. It also affects block structure - programs like this

  for (int i in numbers) {
    if (isInteresting(i) {
     doSomething(i);
    }
  }
    

need nested blocks so are not regular.

To handle nested blocks you need to step up to what's called a context-free grammar. I find this rather confusing because the way I look things a context-free grammar does add hierarchic context to your grammar, allowing us to count. A context-free grammar can be implemented using a pushdown machine, which is a finite-state machine with a stack. Most language parsers use context-free grammars and most parser generators use them. As a result any modern programming language is usually described using a context-free grammar.

Although context-free grammars are widely used, they can't handle all the syntactic rules that we might like. The common exception case to context-free grammars is the rule that you must declare a variable before you use it. The problem here is that the declaration of a variable often occurs outside the particular branch of a hierarchy you are in when you use the variable. While a context-free grammar hold hierarchic context, that's not enough context to handle this case. The next step up in the Chomsky Hierarchy is context-sensitive grammars. A context sensitive grammar could handle this case, but we don't know how to write general context-sensitive parsers. As a result we have to write our own code to handle rules like defining variables.

The hierarchy is inclusive, in that all regular languages can be described by a context-free grammar, and thus processed by a pushdown machine.

I made this dip into language classification theory primarily because it gives you some insight into which tool to use for processing a DSL. In particular it tells you that if you use nested blocks, you'll need something that can handle a context-free language. In particular it argues that if you need nested blocks you'll need to use Syntax Directed Translation rather than Delimiter Directed Translation.

It also brings out that if you only have a regular language, you don't need to use a parser generator to process it. As it turns out you may find that it's easier to use a parser generator in any case. Once you've got used to using them, they are sufficiently straightforward that it usually isn't overkill to use one even for a regular language.

This division is also part of why Syntax Directed Translation is traditionally divided into separate phases for lexing and syntactic analysis. Lexing is usually done with a finite-state machine while syntactic analysis uses the pushdown machine. This limits what you can do in the lexer, but allows the lexer to be fast. There are exceptions to this, in particular I use Antlr for most examples in this book and Antlr uses a push-down machine for both lexing and syntactic analysis.

There are some parser tools out there that only handle regular grammars. Ragel is one better-known example. Also you can use lexers on their own to recognize a regular grammar. However if you're getting into Syntax Directed Translation I'd suggest starting with a context-free tool.

While the notions of regular and context-free grammars are the most common ones you're likely to run into, there is a relative newcomer that is also interesting. This is a form of grammar called a Parsing Expression Grammar (PEG). PEGs use a different form of grammar which can handle most context-free situations and some context-sensitive ones. PEGs don't separate lexing and syntactical analysis (called scannerless parsing) and it seems that a PEG is more usuable than a context-free grammar in many situations. However as I write this PEGs are still relatively new, and tools are both rare and immature. This may change, of course, by the time you read this, but that's the reason I don't talk much about PEGs in this book. The best known PEG parsers are Packrat parsers.

(The lines between the PEGs and more traditonal parsers are not solid, however. Antlr, for example, has incorporated many ideas from PEGs.)


Mixing-in Another Language

One of the biggest dangers that you face with an external DSL is that may accidentally evolve to become a general purpose language. Even if things don't deteriorate that far, a DSL can easily become overly complex, particularly if you have a lot of special cases which need particular treatment but are only used rarely.

Imagine we have a DSL that allocates sales leads to salesmen based on the kind of product that's being asked for and the state in which the customer is based. We might have rules like:

*** missing file: part/hostCodeEmbed/frags.txt in code/ ***

Now what happens if Scott starts regularly playing golf with a bigshot of Baker Industries, which gives him links into all sorts of companies called Baker This and Baker That. We decide we want to handle this problem by assigning him all floor wax leads with a company whose name starts with "baker" in the New England states.

There may be a dozen special cases like this, all of which need extending the DSL in a particular direction. But including a lot of special tweaks for individual cases will add a lot of complication to the DSL. For those rare cases, it's often worthwhile to handle them using a general purpose language by using Foreign Code. Foreign Code embeds a small bit of a general purpose language into the DSL. This code isn't parsed by the DSL's parser, rather it is just slurped as a string and put into the Semantic Model for later processing. Doing this here may lead to a rule like this

*** missing file: part/hostCodeEmbed/frags.txt in code/ ***

This isn't as clear as extending the DSL would be, but this mechanism can handle a wide range of cases. Should regex matching become a common condition we can always extend the language later.

In this case the general purpose language I've used is Javascript. Using a dynamic language is useful for Foreign Code because it allows you to read and interpret the DSL script. You can also do this with static code, but then you have to use code generation and weave the host code into the generated code. This technique is familiar to people who use parser generators since this is how most parser generators work.

This example uses general purpose code, but you can also use the same technique with another DSL. This gives you a step towards modular grammars. At the moment you have to be cautious with this. Current parser technologies aren't well-suited to mixing different languages together. So often it's better to extend a grammar than to use Foreign Code for another DSL. I do think that a shift to tools that support more modular grammars would be useful in making work with DSLs effectively.

One of the problems with using Foreign Code is that you need to tokenize the embedded code differently to how you scan the code in your main language, so you need to use some approach of Alternative Tokenization.

The simplest approach for Alternative Tokenization is to quote the embedded code inside some clear delimiters that can be spotted by the tokenizer and thus slurped as a single string.

Alternative Tokenization isn't just for handling Foreign Code. There are cases where depending on the parsing context you may want to interpret what should usually be a keyword as part of a name. Quoting can do the trick, but another implementation of Alternative Tokenization may involve less syntactic noise.


Assorted Tactics

Most of what I describe in this section are topics that I like to think of as broad strategic issues in external DSLs. I think of these as strategies because they are large-scale decisions on how you parse the DSL. As well as strategies, there are a plethora of tactical decisions. Many of these tactical decisions depend on the particular strategies and tools you pick, and I don't pretend to be attempting a comprehensive description of them here. However in my exploration of external DSL techniques there were a couple of tactics that seemed to crop up enough to be worth spending some further time on.

If you take a look at compiler books, they almost always start with an example of parsing an arithmetic expression. I understand why they do this: it's a familiar domain, and also one that brings out why you need a context-free grammar rather than a regular one. But in practice these kinds of *** unable to resolve patternRef: NestedExpression *** come with some sharp corners, particularly when you have a top-down parser.

Parsers for context-free languages come in two broad flavors: top -down (often LL) and bottom up (often LR or LALR). The two styles of parser work quite differently internally, and this manifests itself in terms of the grammars they can deal with. Most parser generators designed for context-free grammars can't actually deal with any context-free grammar - there are additional limitations based on the kind of parsing algorithms they use.

Top down parsers have one particularly noticeable limitation: they can't deal with left recursion. The most natural way to write a rule for addition is to say something like

    expr : expr '+' expr; 
  

but this is left recursive because the term on the left hand side of the rule body is the same as the parent term. Bottom-up parsers can handle this just fine, but top down ones need to do some contortions. These contortions are really just one well-undertood idiom, but they do complicate the grammar, which is why many people prefer bottom-up parsers. The good news for top-down parsers is that you only really run into this problem when dealing with *** unable to resolve patternRef: NestedExpression *** and once you know understand the idioms for *** unable to resolve patternRef: NestedExpression *** you can churn them out relatively mechanically. The resulting grammar will still be not as clear as for a bottom-up parser, but knowing the idiom will get you there much quicker.

The other tactical area I've dropped in is dealing with Syntactic Newlines. Newlines are commonly used as statement separators in programming languages, but using them with Syntax Directed Translation has some edge cases that you have to watch for.

This raises a broader question about using statement separators at all. General purpose languages regularly use statement separators, commonly using semi-colons or newlines. Since DSLs are so simple, they can often be designed without needing statement separators. The main problem of avoiding statement separators is that it can make it harder to localise error diagnostics - statement separators act as checkpoints for the parser. If you choose to use statement separators you then have to choose between visible separators (such as a semi-colon) or a newline. Using Syntactic Newlines introduces some niggles in the parse, mostly around dealing properly with blank lines, but has less syntactic noise than a visible statement separator.

I don't have a strong opinion on whether you should use statement separators or what form they should take if you do use them.

[TBD: Add some discussion about using XML][TBD: Discuss dangling else problem (and other classic parsing issues?]

Significant Revisions

18 Aug 07:

17 Nov 08: Rewrote to match current view of patterns

05 Jan 09: Added sections on mixing in another language and assorted tactics