Sherman: A Compiler

Introduction

Sherman is compiler for a REBOL-like language. Depending on the back end, a program compiled with Sherman typically runs between e4 to e7 times faster than the same program interpreted by REBOL 1.0. (The native code back end is not supported, yet; expect performance in the e4 range unless you want to hack the native code back end yourself.)

Sherman is made "open-source" under a fairly liberal license. The various environments on which the runtime system is hosted have liberal licences, but they vary from system to system.

Sherman 0.5 is not a turnkey product. This is a shaky alpha release developed in a short time on an extremely limited budget. If you want a full-featured REBOL development environment, get it from REBOL Technologies. However, if you want performance and features that you cannot get from REBOL Technologies, and you don't mind getting your hands a little dirty, Sherman is for you.

Source Language

The source language for Sherman is very much like the REBOL language version 1.0. It is not identical. It is possible to write programs in REBOL that cannot be compiled by Sherman, and it is possible to write code for Sherman that cannot run in REBOL. However, the overlap is intentionally large, and many, if not most, REBOL 1.0 programs can be compiled with Sherman with little or no changes, and it is rather easy to write in a style acceptable to both REBOL and Sherman.

Sherman has the following features not found in REBOL 1.0:

Sherman does not support the following features of REBOL 1.0:

These are less onerous restrictions that one might expect. An analysis of the example code posted on the REBOL Technologies web site shows that control-flow constructs typically have literal blocks and that generating code on the fly at runtime is an unusual activity.

Should your application require these features, I'd suggest rethinking the application or sticking with a REBOL interpreter.

Intermediate Language

Sherman is unusual in that it does not compile directly to machine code but to an intermediate language. This was done for a number of reasons:

For Sherman 0.5, the intermediate language is Scheme. Scheme offers a wide variety of implementations and platforms. This allows you to select a runtime platform that provides the features you need. Those desiring a very small footprint or the ability to run on a palmtop computer might want to use SIOD, those wanting a persistant object oriented database might select Rscheme, those wanting the extremely high performance available with a closed-world compiler might want Stalin.

Sherman could be retargetted to use Common Lisp, C, ML, or Java as an intermediate language. C would be portable, but there would be the difficulty of writing a runtime system. Java offers an extraordinary amount of hype and would be an ideal back end for those whose primary interest is in generating attention.

Back End and Runtime Environment

Currently MzScheme from Rice University is the supported back end for Sherman. MzScheme offers a medium performance bytecode compiler, a small runtime footprint, and a rich environment for implementing the Sherman runtime system, including the ability to generate native compiled code and link in other compiled modules. MzScheme is under active development, and upcoming implementations will include COM and ActiveScripting interfaces to allow code compiled with Sherman to plug into other products.

The native code benchmarks for Sherman were derived from the Liar compiler for MIT Scheme. It is easy to use MIT Scheme as the back end for Sherman, but since it is no longer in active development, I'd rather focus attention on MzScheme.

The Stalin compiler assumes a "closed-world" model, i.e. that code is to be compiled into a monolithic executable and that new code will not be loaded into the runtime environment. This is the least flexible, but allows for extraordinary optimization to take place. Stalin has been known to generate code that performs better than the equivalent C code. Stalin currently only works under Unix, but a portable version should be available soon.

REBOL 2.0 Compatability

REBOL 2.0 has yet to be released, preliminary indications are that the language model of REBOL 2.0 will be quite different from that of REBOL 1.0. The easiest way of dealing with this may be to change the underlying host language. Depending on how the language features of REBOL evolve, changes could be made to the compiler and/or to the runtime host to track these features.

Sherman 0.5

Sherman 0.5 consists of several Scheme files. Some of these files are taken from other sources, like MIT Scheme (why re-invent the wheel?)

Instructions

First, install the latest version of MzScheme from Rice University. In the directory in which you install MzScheme, you will find a directory called "collects". Create a subdirectory in "collects" called "sherman".

Second, download the Sherman files as listed above. Copy the sherman collects files (block.ss, free.ss, runtime.ss, set.ss, sherman.ss, tables.ss, and stream.ss) into the "collects/sherman" directory you created in the previous step. If you are a MzScheme wizard, feel free to hack the collects path.

Third, get an executable version of r2s. You can compile r2s with FLEX, and then compile the resulting scanner with a C compiler, or you can download the latest executable for Windows.

Fourth, start up MzScheme and load up Sherman. I usually write a small Scheme startup script to do this so I can just start scheme.

Call (sherman "script.r") to compile your program. There are several intermediate files that are generated.

  1. .rs - This is a MzScheme readable file that is the result of translating your program with r2s.
  2. .ss - This is a Scheme program that Sherman generated.
  3. .zo - This is the byte code compiled final result.

You can load the byte-code compiled program into the Sherman runtime environment and call it from Scheme.

There is a batch file for windows that automates these last few steps. You can call

sherman compile trivial.r

sherman run trivial.zo

The first line will compile the file trivial.r into trival.zo, the second will run the compiled file.

The compiler prints out dots while it is working. Some relatively small files can be very hard to compile, so if you are getting an infinite amount of dots, check out the troublshooting guide below.

Three small rebol files are provided: ack.r, change.r and trivial.r After running these, the interpreter enters a read-eval-print loop. This is a Scheme interpreter so it will not accept non-Scheme input. Type "(quit)" to leave the interpreter.

Troubleshooting and FAQ

The compiler doesn't stop compiling, or the output file is huge.

The source language has an ambiguous syntax. For a given block of n elements, there are slightly less than (n + 1)! possible meanings. REBOL 1.0 determines the meaning incrementally at runtime, but Sherman attempts to deduce the meaning at compile time. When in doubt, it produces a conditional statement that handles all the possible meanings. For even a moderate sized block, this could easily exceed, say, the number of silicon atoms in the universe.

There are a few simple things that you can do that will reduce or even eliminate the ambiguity of the language. If you do these, the generated file will be the same size or smaller than the original.

Feature X is missing.

Write to me to ask, or add it yourself. The whole point of open source is that you can change the features of the language as you need them, rather than having to wait for some faceless corporation to decide that you need them.

Joe Marshall