Here’s an idea.
Suppose every language compiler or interpreter worth your while starts providing you with a public library of AST — Abstract Syntax Tree — nodes for that language. Instead of just compiling or just interpreting, you could point the executable (or a library method) at a file of code in that language and say “please parse”, and you’re handed a tree of AST nodes in a standard interchange format, like JSON, XML or YAML.
Suppose this interchange format has some very thin but solid common ground — a way to represent the offset and length of each node in the file (including whitespace and comments), if the expression is indeed visible (accessing instance variables via implicit self/this or using an implicit type conversion are examples of constructs that aren’t directly visible but certainly exist; they often have subnodes), and good rules for nesting and cross-referencing to other nodes. A method invocation could refer to the ID of the method explicitly if known, or use of a variable could carry a reference ID to the variable’s definition.
Suppose that libraries were written for common languages to deal with the plain metal interchange format and handle any nodes in a generic way, but that specific libraries could stack on top and offer better semantics for a certain language/AST node combination, such that you could, say, write a Ruby script to parse and manipulate Objective-C, if only you had an Objective-C model or the knowledge to manually work out what to do with the Objective-C nodes.
Clarification: Since this wasn’t obvious, I’d like to point out that this is not a proposal to make everything have the same standardized AST model. That’ll fall apart quickly. I’m proposing rather the opposite – that the tree itself, and its facilities, and a standard serialization is invented, and that these trees can carry any AST node you dare think up from any AST model, but that common operations are done right once. (Clarification ends.)
Suppose that a general algorithm was built that was easy to any implementing library to adopt, that allowed creation of differential trees; where a copy of the original tree could be made, manipulated and the new representation molded in the shape of the original source code with the modifications in place, or a diff file generated.
If we did all that, and we should, we would finally be able to write intelligent code to manipulate code — cross-language. Consistently, using open and established formats. We know exactly why regular expressions fail in cases like refactoring; it’s because of ambiguity. Surgical precision is the domain of the parser part of the compiler or interpreter; advice that it alone can determine. (In most cases, if you can determine edge cases on par with the compiler/interpreter, you’ve just effectively built something that knows as much as a parser and probably spends most of its time doing parsing or resolving.) Why shouldn’t you be able to partake in this?
It’s 2009. Apple and the LLVM project are building a C/C++/Objective-C frontend, clang, that can outsource its knowledge of the parsed code. A future version of C# is planned to be able to host its own compiler and code generation through a set of code model objects, and the .NET DLR — Dynamic Language Runtime — runs entirely by compiling code down to expression trees and running them with polymorphic inline caching.
We’re getting there. This is a big effort, but it is also a big leap for programming kind. And if someone would start doing things like this, no one would be happier than I.
Moreover, I advise that the iPhone software platform must be opened.
The semantics of different languages would not be ironed out by using an AST representation, and for that matter different languages would usually require different AST’s, so I don’t see how this would be beneficial. What am I missing?
By Eric · 2009.02.27 01:30
What you’re missing is that each language would provide its own AST model with its own nodes, but that there would be a common way of solving common problems, like correlating a node with its text range in the source file. (Where did the
ifstart? Where did the parentheses around the condition fall? And so on.)I’m not proposing that everything is retrofit into the same AST model, rather the opposite, which is that the tree serialization is standardized and extensible. This in turn makes it possible for a base library to deal with the tree and libraries stacked on top to deal natively with various AST models, but since the base library supports the essentials, even if someone hadn’t ported a library supporting your target AST model (say, C#’s) to the language in which you intend to do the processing (say, Ruby), you could work with the raw nodes and cobble together what you’d need.
By Jesper · 2009.02.27 01:37
A related project: the Parrot Virtual Machine, which is meant to be a target bytecode, AST, and smart-assembly for dynamic languages in general. Last time I looked at the recommended compiler structure, you could hand any of the three to the system and have it compile it down to bytecode and run it for you.
It’s not quite as general as your idea, but it’s an interesting and relevant project nonetheless.
By Jordy/Jediknil · 2009.02.28 11:09
That’s true, but it doesn’t have to do with transforming source code using ASTs. Which is to say that it’s interesting, but it is related only in that it uses ASTs.
By Jesper · 2009.02.28 13:25