The Status Quo for DSL Toolkits
Domain-specific languages and code generation have proved to be very helpful tools in many of our projects. We have successfully employed them to describe protocols, APIs of event-driven embedded systems, different hardware platforms, components and their collaboration in a given software, deployment scenarios (which combinations of software components can be linked into meaningful programs?), input/output elements in data-driven systems and much more.
However, several colleagues and I share the impression that the tools of the trade come with cumbersome restrictions and dependencies:
Fully integrated language workbenchs like MPS enforce their way of doing things, and it is often hard to integrate the input and output artifacts, as well as the transformation process, into normal revision tools and build systems. If you decide to use MPS, then this decision will influence your entire development setup in a drastic fashion.
Fully-featured DSL toolkits like XText are slightly easier to integrate, but they still introduce some heavy-weight infrastructure and a lot of concepts. XText, for example, requires you to include Java and Eclipse in the environment, which is particularly silly if the production code itself is written in other languages (and other IDEs). It is also far from easy to use XText as a command-line tool as required within proper build systems (cf. the endless forum threads on how to achieve this with Maven). And what’s more, XText requires your DSL toolsmith to master several languages at once (e.g., Java, Extend, and Expand in XText1). Unfortunately, some of these languages are so restricted that it can become a real nuisance to add certain features (e.g., preserving inline comments in the input/output transformation).
Old-fashioned parser/generator toolkits like bison, yacc and friends are less demanding with respect to the infrastructure required for their operation, but they are not very accessible, and many of them force you to implement the transformation in programming languages that are known to be particularly ill-suited for handling strings and data structures dynamically.
What We Really Need
The DSLs we have designed for real projects so far have almost always been relatively simple: They are much less expressive than programming languages (in technical terms, such DSLs tend to fall into the LL(1) category nearly by themselves) and usually come with very small amounts of keywords and linguistic devices. By the way, this is not because we didn’t manage to include more complex features; it simply turned out that simple languages suffice to pick all the fruit we were looking for — we got a lot of value for a small invest.
This fact leads to the question whether we should employ heavy-weight DSL toolkits for our lightweight DSLs. And to the answer. Which is NO.
It is important to note that our line of argument becomes even more valid for safety-critical projects. For instance, what customer would like to verify XText, and all the infrastructure it depends on, in its entirety, just for the sake of implementing a rather simple DSL?
Therefore, a while back, we posed the following hypothesis: We should investigate lightweight approaches to parser/generator toolkits that integrate particularly well with our usual development environments and build systems. Ideally, such a toolkit would be tailored to our exact requirements (accessible, extensible, no premature restrictions, no accidental complexity, etc.), and it would not introduce any new infrastructural dependencies (such as specific IDEs, additional programming languages, or problematic licenses).
Towards a Case Study
The above considerations led to the following idea: Why don’t we try and implement a slim DSL toolkit in Ruby? After all, Ruby is the scripting language of choice in many of our projects — particularly when they make use of the gaudi/rake build system developed by Vassilis Rizopoulos. Ruby is also very expressive, allows for efficient string processing (by means of regular expressions and functions such as split/gsub/strip), and has excellent facilities for handling dynamically growing data structures.
The most important design decision for a DSL toolkit, be it written in Ruby or any other language, arguably is how to implement the parser. This is simply because parsing almost always is the most difficult part of the transformation chain (tokenize, parse, maybe validate, generate). But since we aim for an implementation that is as accessible and easy to maintain as possible, we should look for the most straightforward (but still correct) way of doing the job. Which leads us to a small excursion into the world of parsing as it is done today.
A DSL toolkit necessarily includes a tokenizer, a parser, and generators. The tokenizer splits the input file, which is just a stream of characters, into a stream of basic tokens: keywords, identifiers, constants, etc. The parser reads this stream of tokens and builds up an internal model of what the author of the input file tried to express. The generators read the resulting model, which usually is a tree of typed elements, and use it to produce output files that contain the desired results (often code, but also documents, configuration files and whatever else lends itself to automatic generation).
As outlined above, the hardest part in this transformation chain is the parser. There is a lot of theory behind parsing (such as LR grammars, pushdown automata, shift/reduce tables, abstract syntax trees and the like), and that makes the correct implementation of parsers a nontrivial task.
Or so we are taught.
The fundamental problem with classic parser construction, simply put, is that it fails to reflect the production rules of the underlying grammar in a natural fashion. For example, it is prohibitively difficult to understand what an LR language is all about based on its shift/reduce tables. And this is quite disappointing because parsing is simply the deduction of which production rules of the grammar were used, in which order, in order to form the given input.
Fortunately, there is a different approach: Recursive descent parsing. The idea is to write parsers that do not occlude anything — that is, parsers that reflect the recursive application of production rules by recursive calls to parse functions, where each parse function reflects a fundamental element of the language (such as “a type definition” or “a message specification”, for example). That way, there is no need to rely on intermediate artifacts that are extremely hard to interpret.
(See http://en.wikipedia.org/wiki/Recursive_descent_parser for a more detailed description of this concept, including example code.)
Along Came Diesel
Given our design goals, it was obvious that our lightweight DSL toolkit (a private side-project dubbed “Diesel” by then) should be implemented as a recursive descent parser.
Another key decision was to understand Diesel as an *approach* to writing specific DSL toolkits, not as a complex software that generates specific DSL toolkits based on specifications in a meta language. Since recursive descent parsers directly reflect the production rules of the input language, such a meta language would only constitute an artificial additional level without any relevant gain — violating the very idea of a lightweight implementation.
In so far, Diesel is mostly a style of implementing a project-specific DSL toolkit. Every instantiation of such a toolkit consists of these steps:
- Define the core elements of the DSL using Ruby structs, where the elements relate to each other just as their representations in the input language do (e.g., the root element could be a protocol specification, comprised of type definitions and arbitrarily many message specifications, which in turn are comprised of payload definitions).
- Extend the default tokenizer by any additional features you need (e.g., handling of special kinds of comments).
- Using previous instances of Diesel as a reference, write down the parse functions that reflect the production rules of the input language. If this is done in a clean fashion, the recursive structure of these functions should match the type recursions seen in the first step above.
- If desired, add validators that work on the internal model produced by a run of the parser.
- Implement one generator for each type of output artifact.
In order to evaluate Diesel properly, we decided to start with a DSL for the description of the API of an event-driven system — a realistic application we know from many real projects.
To cut things short, these were our main results:
- Coding the tokenizer in Ruby is a five-minute job.
- Coding the parser is as straightforward as we had hoped.
- Helpful error messages for cases of bad input are easy to produce (where did it happen, what did the parser expect, what did it get instead?).
- Adding useful validators turns out to be a lot easier than in XText1, for example, because one has full access to both the complete model produced by the parser and the expressiveness of Ruby (thanks to Tobias Kniep for an educated comparison).
- Coding different generators (C code, build system configs) is easy.
- The entire package — tokenizer, parser, validator, generators — integrates very nicely with the gaudi/rake build system (thanks to Vassilis Rizopoulos for doing this part).
- The total amount of code written, and hence the entire footprint of the DSL toolkit, for this specific instance of Diesel is in the order of only ten kBytes.
A general question we did not resolve so far is whether the generators should be procedural (as they are now) or template-based (as many people prefer when writing code generators). This question is also a matter of taste, and it was not crucial to the evaluation of Diesel itself.
From Evaluation to Production
As of the writing of this report, we are in the process of taking Diesel into the real world: It will be used to generate protobuf specifications and C++/C# code from a DSL that describes the public test interface of a prototype for a highly integrated medical device.
This second application of the Diesel approach made us introduce a new feature to the tokenizer, namely the distinction of pure comments (which are simply stripped from the input on tokenization) from transparent comments (that is, inline documentation that is to be preserved in parts of the output produced by the generators). Again, because the design of Diesel is so minimalistic and we are in full control of our Ruby code, this feature was absolutely trivial to incorporate.
Even though the detailed requirements of the message specification DSL and its generators are quite open at the moment, we are convinced that Diesel will do the job smoothly. After all, the DSL in question is no more complex than the event API DSL used in the first case study.