| EAA-dev Home |

WORK-IN-PROGRESS: - this material is still under development

Code Generation

Last significant update: 18 Apr 08

So far in my discussion of implementing DSLs I've talked about parsing some DSL text, usually with the aim of populating a Semantic Model. In many cases once we can populate the Semantic Model our work is done - we can just execute the Semantic Model to get the behavior we're looking for.

While executing the Semantic Model directly is usually the easiest thing to do, there are plenty of times when you can't do that. You may need your DSL-specified logic to execute in a very different environment, one where it's difficult or impossible to build a Semantic Model or parser. It's in these situations that you can reach for code generation. By using code generation you can take the behavior specified in the DSL and run it in almost any environment.

When you use code generation you have two different environments to think about: what I shall call the DSL processor and the target environment. The DSL processor is where the parser, Semantic Model and code generator live. This needs to be a comfortable environment to develop these things. The target environment is your generated code and its surroundings. The point of using code generation is that you have to separate the target environment from your DSL processor because you can't reasonably build your DSL processor in your target environment.

Target environments come in various guises. One case is an embedded system where you don't have the resources to run a DSL processor. Another is where the target environment needs to be a language that isn't suitable for DSL processing. Ironically the target environment may itself be a DSL. Since DSLs have limited expressiveness, they usually don't provide abstraction facilities that you need for a more complex system. Even if you could extend the DSL to give you abstraction facilities, that would come at the price of complicating the DSL - perhaps enough to turn it into a General Purpose Language. So it can be better to do that abstraction in a different environment and generate code in your target DSL. A good example of this is specifying query conditions in a DSL and then generating SQL. We might do this to allow our queries to run efficiently in the database, but SQL isn't the best way for us to represent our queries.

Limitations in the target environment aren't the only reasons to code generate. One reason may be lack of familiarity with the target environment. It may be easier to specify behavior in a more familiar language and then generate the less familiar one. Another reason for code generation is to better enforce static checking. We might characterize the interface of some system with a DSL, but the rest of the system wishes to talk to that interface using C#. In this case you might generate a C# API so that you get compile time checking and IDE support. When the interface definition changes you can then regenerate the C# and have the compiler help you identify some of the damage.


Choosing What to Generate

One of the first things to decide when generating code, is what kind of code you are going to generate? The way I look at things there are two styles of code generation you can use: Model-Aware Generation and Model Ignorant Generation. The difference between the two lies in whether or not you have an explicit representation of the Semantic Model in the target environment.

Figure 1

Figure 1: A very simple state machine

As an example let's consider a state machine. Two classic alternatives for implementing a state machine are nested conditionals and state tables. If we take a very simple state model, such as Figure 1 a nested conditional approach would look like this:

[TBD: There will be more detail on this when I write the computational model section on state machines.]
  #ruby
  def handle event
    case @current_state
    when 'off'
      case event
      when 'switchUp'
        @current_state = 'on'
      end
    when 'on'
      case event
      when 'switchDown'
        @current_state = 'off'
      end
    end
  end  
[TBD: Consider replacing the ruby here with java/c#]

We have two conditional tests nested one inside the other. The outer conditional looks at the current state of the machine and the inner conditional switches on the event that's just been received. This is Model Ignorant Generation because the logic of the state machine is embedded into the flow of control of the language - there's no explicit representation of the Semantic Model.

With Model-Aware Generation we put some representation of the semantic model into the generated code. This needn't be exactly the same as that used in the DSL processor, but it will be some form of data representation. For this case our state machine is a touch more complicated

  def initialize start_state
    @states = {}
    @current_state = start_state
  end
  def define_transition source, trigger, target
    @states[source] = {} if nil ==  @states[source]
    @states[source] = {trigger => target}
  end
  def handle event
    puts "received #{event}"
    state_row = @states[@current_state]
    return if nil == state_row
    new_state = state_row[event]
    if nil == new_state
      return
    else
      @current_state = new_state
    end
  end

Here I'm storing the transitions as nested maps. A map of states the values of which are maps keyed by event with the target state as values. I may not have explicit state, transition, and event classes - but the data structure captures the behavior of the state machine. As a result of being data-driven, this code is entirely generic and needs to be configured by some specific code to make it work.

  sm = StateMachine.new 'off'
  sm.define_transition 'off', 'switchUp'  , 'on'
  sm.define_transition 'on' , 'switchDown', 'off'

By putting a representation of the Semantic Model into the generated code, the generated code takes on the same split between generic framework code and specific configuration code that I talked about in the introduction. Model-Aware Generation preserves the generic/specific separation while the Model Ignorant Generation folds the two together by representing the Semantic Model in control flow.

The upshot of this, is that if I use Model-Aware Generation the only code I need to generate is the specific configuration code. I can build the basic state machine entirely in the target environment and test it there. With Model Ignorant Generation, I have to generate much more code. I can pull out some code into library functions which don't need to be generated, but most of the critical behavior has to be generated.

As a result it's much easier to generate code using Model-Aware Generation. The generated code is usually very simple. You do have to build the generic section, but since you can run and test it independently of the code generation system, this usually makes it much easier to do.

As a result it's my inclination to use Model-Aware Generation as much as possible. However it often isn't possible. Often the whole reason for using code generation is that the target language can't represent a model easily as data. Even if it can, there may be processing limitations. Embedded systems often use Model Ignorant Generation because the processing overhead of code generated with Model-Aware Generation would be too great.

There's another factor to bear in mind if it's possible to use Model-Aware Generation. If you need to change the specific behavior of the system, you can replace only the artifact corresponding to the configuration code. Imagine we're generating C code. We can put the configuration code into a different library than the generic code - this would allow us to alter the specific behavior without replacing the whole system (although we'd some run-time binding mechanism to pull this off).

We can go even further here and generate a representation that can be read entirely at run-time. We could generate a simple text table like

off switchUp   on
on  switchDown off

This would allow us to change the specific behavior of the system at run-time, at a cost of the generic system having the code to load the data file at start-up.

At this point you're probably thinking that I've just generated another DSL which I'm parsing in the target environment. You could think of it this way, but I don't. To my mind the little table above isn't really a DSL because it isn't designed to be for human manipulation. The textual format does make it human readable, but that's more of a useful feature for debugging. It's primarily designed to make it really easy to parse so that we can quickly load it into the target system. When designing such a format, human readability comes a distant second to simplicity of parsing. With a DSL human readability is a high priority.

[TBD: Is this classification a reasonable one to use for generating classes from data descriptions? If not what are the consequences.]

How to generate

Once you've thought about what kind of code the generate, the next decision is how to go about the generation process. When generating a textual output there are two main styles you can follow: Transformer Generation and Templated Generation. With Transformer Generation you write code that reads the Semantic Model and generates statements in the target source code. So for the states example you might get hold of the events, generate the output code to declare each event, likewise with the commands, and again for each state. Since the states contain transitions, your generation for each state would involve navigating to the transitions and generating code for each of these too.

With Templated Generation you begin by writing a sample output file. In this output file, wherever there is something that is specific to a particular state machine, you place special template markers that allow you to call out to the Semantic Model to generate the appropriate code. If you've done templated web pages with tools like ASP, JSP and the like you should be familiar with this mechanism. When you process the templates it replaces the template references with generated code.

With Templated Generation you are driven by the structure of your output. With Transformer Generation you may be driven by either input, output or both.

Both approaches to code generation work well, and to choose between them you're usually best off to experiment with each and see which one seems to work best for you. I find that Templated Generation works best when there's a lot of static code in the output and only a few dynamic bits - particularly since I can look at the template file and get a good sense of what gets generated. As a consequence of this I think that you're more likely to use Templated Generation if you are using Model Ignorant Generation. Otherwise, which actually is most of the time, I like Transformer Generation.

I've discussed these as opposite approaches, but that doesn't mean you can't mix them. Indeed usually you do. If you're using Transformer Generation you'll probably use string format statements to write out a little chunk of code - and these are miniature cases of Templated Generation. Despite this I think it's useful to have a clear idea of what your overall strategy is and be conscious about switching over. As with most things involving programming, the time you stop being thoughtful about what you are doing is the time when you make an unmaintainable mess.

[TBD: Add something about byte-code manipulation]

[TBD: Add some material on handling relationships between hand and generated code. Eg calling relationships use of Generation Gap, Partial Classes etc]

Significant Revisions

18 Apr 08: