Unix Review > Archives > 2004 > April 2004
Print-Friendly Version

April 2004

Regular Expressions: Rapid Development of An Assembler Using Python

by Miki Tebeka and Cameron Laird

In February, Regular Expressions profiled HLA as an example of how "high-level" a mature system for x86 assembly can become. This month, we turn several of those dimensions around, and look at a technique to use a high-level language to develop a lightweight assembler for an architecture that's not in the mainstream. Guest columnist Miki Tebeka of Zoran Corporation illustrates his approach with a model assembly language, which he implements as a Python-coded application. One of the bonuses Python provides is a strong pre-processor.

New assembly languages

Development of an assembler for the novel architectures with which Zoran works is generally tedious. Most common is to use GNU's binutils, although this is an uninviting package that demands careful C coding. Read the "Preliminary Notes on Porting BFD" to appreciate the difficulty. It's not fun. By leveraging Python, though, we eliminate the need to write or maintain a special-purpose parser and lexer. Python makes it possible and even pleasant to construct a new assembler quickly.

The fundamental idea of this approach is to write in a syntax that can be interpreted as well-formed Python, while setting up the semantics of an assembler. Although the language looks different from traditional assembly languages, it's still easy to read. Each statement invokes an architecture-specific Python function created to implement the assembler. Each function then corresponds to a single machine-level assembly instruction.

We define a distinct class for each instruction. Instantiating a class saves the class's parameters and appends the instance to a global COMMANDS array. Each class has a function called genbits(), which generates instruction-specific machine code. A concrete example makes this construction clear.

An example assembly language
Imagine a model processor with these characteristics:

  • Eight general registers r0-r7
  • ALU operators: move, add, sub
  • Memory operators: load, store
  • Unconditional jump: jmp

The binary format of each machine-language command is a 16-bit number. The three most significant bits are the instruction code; arguments follow. The registers are coded according to their number: r0 <- 0, and so on.

The result
A sample program looks like:

  add(r0, r2, r3)
  sub(r2, r4, r4)
  load(r2, 0x200)
  label("L1")
  move(r2, r7)
  jmp(L1)

Astute applications programmers will complain that this program never terminates! As a model for embedded programs, though, that's not necessarily such a bad fit; many embedded programs have a main loop that ends only when you close the power supply. In any case, all we need for now is a language large enough to illustrate the technique of assembling with Python, while still small enough to be immediately comprehensible.

Assembly

The technique takes advantage of Python's execfile() function. execfile() reads any file, and interprets it as Python source. It reminds some C programmers, for example, of #include, which also reads source files from the file system.

To make best use of execfile() in our scheme, we need to initialize the environment for the particular assembler we're defining. execfile() provides a mechanism for this — an optional globals argument, which sets the Python interpreter's "global" environment. In our case, we store assembly-language instructions in a global COMMANDS array (or "list", as Python programmers commonly call it).

Initializing the environment is easy enough:

ENV = globals().copy() 
          # Clean user environment

Next, we enumerate the registers:

for i in range(8): 
    ENV["r%d" % i] = i 
and the instructions:
for op in (add, sub, move, load, store, label, jmp): 
    ENV[op.__name__] = op

This allows our parse() function to be as simple as:

    def parse(fname):
        global COMMANDS

        execfile(fname, ENV, {})
        return COMMANDS

Implementing instructions

The base class for all instruction is the ASM class:

    class ASM:
        '''Base ASM instruction'''
        def __init__(self):
            COMMANDS.append(self)

A class called ALU3 defines our three-operand command architectures:

    OP_SHIFT = 13
    SLOT1_SHIFT = OP_SHIFT - 3
    SLOT2_SHIFT = SLOT1_SHIFT - 3
    SLOT3_SHIFT = SLOT2_SHIFT - 3
    
    class ALU3(ASM):
        def __init__(self, src1, src2, dest):
    		ASM.__init__(self)
            self.src1 = src1
            self.src2 = src2
            self.dest = dest
    
        def genbits(self):
            return (self.code << OP_SHIFT) | \
                   (self.src1 << SLOT1_SHIFT) | \
                   (self.src2 << SLOT2_SHIFT)  | \
                   (self.dest << SLOT3_SHIFT)

With this definition, each instance simply specifies its own instruction code:

    class sub(ALU3):
        '''`sub' operator'''
        code = 1

The full implementation in asm.py (Listing 1) shows that all the commands use the same scheme.

Labels

Labels are merely symbolic names for given locations. When an assembly-language program calls a label, we just add the new symbolic name to our environment:

    def label(name):
        ENV[name] = len(COMMANDS)

Generating code

Python makes it easy to "twiddle bits". The language's "array" module is handy in this role. We use 16-bit assembly instructions and simply write the sequence of them to an external file:

    # Main operation
    commands = parse(argv[1]) # Parse input file
    a = array("H") # Use unsigned short (16bit)
    for cmd in commands:
        a.append(cmd.genbits())
    open(outfile, "wb").write(a.tostring())

Note: You can a.byteswap() before generating the code to change the endianess of the output.

Viewing results

You can use any convenient Hex editor to view the output .o file. Lead author Miki's favorite utility is xxd, available for most operating systems and works well with Vim (try :%!xxd and :%!xxd -r). Run xxd -c2 -b a.o and the output from the demonstration program you should see is:

    0000000: 00000001 00110000  .0
    0000002: 00101010 01000000  *@
    0000004: 01101010 00000000  j.
    0000006: 01001011 10000000  K.
    0000008: 10100000 00000011  ..

Our little assembler is complete in less than a hundred lines of Python source. Next, let's review more advanced issues this assembly-language technique raises.

Error Handling

For error handling, we need file and line information. Luckily, Python allows communication with its parser to retrieve such data. The following black-magic function registers the file and line of current statements:

    from inspect import getouterframes, currentframe
    def here():
        try:
            return getouterframes(currentframe())[2][1:3]
        except:
            return "???", 0

and then add the following line to ASM.__init__:

    self.file, self.line = here()

Let's define an error and warning functions:

    def out(type, file, line, msg):
        '''Output message'''
        print "%s:%d: %s: %s" % (file, line, type, msg)

    def error(file, line, msg):
        '''Print error message'''
        out("error", file, line, msg)

    def warn(file, line, msg):
        '''Print warning message'''
        out("warning", file, line, msg)

Now, if we want to check whether an address fits into 16 bits, we can do the following:

    class MemOp(ASM):
    '''Memory operation'''
    def __init__(self, reg, addr):
        ASM.__init__(self)
        self.reg = reg
        if addr >= (1 << 16): # Check that address is valid
            warn(self.file, self.line, "0x%X too big, will truncate" % addr)
            addr &= ((1 << 16) - 1) # Mask all bits above 16
        self.addr = addr

We also want the interpreter to report errors it finds in the assembly source. The built-in SyntaxError exception and sys.exc_info traceback info give this capability:

    from sys import exc_info
    try:
        commands = parse(infile)
    except SyntaxError, e:
        error(e.filename, e.lineno, e.msg)
        raise SystemExit(1)
    except Exception, e:
        # Get last traceback and print it
        # Most of this code is taken from traceback.py:print_exception
        etype, value, tb = exc_info()
        while tb: # Find last traceback
            last = tb
            tb = tb.tb_next
        lineno = last.tb_lineno # Line number
        f = last.tb_frame
        co = f.f_code
        error(co.co_filename, lineno, e)
        etype = value = tb = None # Release objects (not sure this is required ...)
        raise SystemExit(1)
Linking

Building a linker by itself is a significant task. However, in this scheme you may be able to avoid writing a linker. Python's "import" mechanism simply add the assembly-language instructions in order to the COMMANDS array. In some environments, a linker is required to do little more than this.

Finishing Touches

A minimal command-line interface will give our assembler a more "professional" look. Python's new "optparse" module makes this a breeze:

    from os.path import splitext
    from optparse import OptionParser
    parser = OptionParser(usage="usage: %prog [options] FILE", version="0.1")
    parser.add_option("-o", "--output", help="output file", dest="outfile",
        default="")

    opts, args = parser.parse_args()
    if len(args) != 1:
        parser.error("wrong number of arguments") # Will exit
    if not opts.outfile:
        opts.outfile = splitext(infile)[0] + ".o"

Conclusion

Python's built-in parser gives the ability to construct an assembler for a simple assembly language rapidly. If you have an occasion for an assembler, you can make your own, rather quickly, without having to design and implement a parser.

More than that, the great advantage of this assembler is that you have the power of Python as a pre-processing/macro system. You can define symbolic names and use them, read .ini files and set compilation conditional variables. Notice you may also set compilation conditional variables with the traditional -D command-line option — simply add another option to the command-line parser. For more ideas on how to modify and use this assembler, please see the sidebar.

Miki Tebeka is a tool developer working at Zoran Corporation.
Cameron Laird is vice president of Phaseit, Inc, and contributes regularly to
UnixReview.com and Sys Admin magazine.

 

What's Next?

There are several important areas where we can enhance this assembler.

Output Format

If we use a conventional linker, it'll expect a standard format, such as "elf". This requires that we adjust the binary output to match "elf" format.

Another common output format is a "C" array. This is most used in environments that expose a special application for loading code. In these cases, we teach our Python-coded assembler to output the "C" source directly, or use xxd's "-i" switch.

Debug Information

To support debugging, we must emit debug information. Again, if we use a mainstream system (such as gdb), we need to program debug information into this system. "stabs" is an example of such a system. A custom debugger can make an even simpler approach possible — we might use a file where each line corresponds to a program address (line 0 is address 0, line 1 is address 1, ...), and in each line we place file:line.

All it takes for our Python-coded assembler to provide this is:

  dbg = open("debug.txt")
  for cmd in COMMANDS:
      print >> dbg, "%s:%s" % (cmd.file, cmd.line)
  

Making it "gas" compatible

To start the revolution toward easy assembly, it will help to make our assembler compatible with "gas" or another standard assembler. This way, we can plug it into an existing system and start replacing an old system step by step. There are several issues we need to address if we want to be gas-compatible:

  • A gas-compatible assembler must support all gas command-line switches. The most recent release appears to total 33 of these command-line switches!
  • We will also will need to emit a "standard" output format, such as "elf".
  • Many systems run the C pre-processor first and the assembler operates on its output. This changes the way we compute file and line information since the C pre-processor emits file and line information in its output file.
  • We need to support a linker in our object files. Most of the work for this is to add relocation information to the object files.

Sys Admin Spotlight

CMP DevNet Spotlight

Global Web Site Performance Improvement
Jeffrey Fulmer explains how to get a comprehensive picture of your site's performance and describes some tips for improving it.

In the News

Windows Server 2008 RC1 Available For Download
Microsoft will launch the final version of the server OS at a February 27th event in Los Angeles.


Google Revamps iPhone Interface
Google's iPhone home screen makes services such as Gmail, Calendar, and Reader more accessible through the use of Ajax menu tabs.


CradlePoint Offers Personal HotSpot To Go

CradlePoint today announced a new personal Wi-Fi hotspot product that you can carry with you everywhere you go. Have hotspot, will travel.


Mom Faces Massive File-Sharing Fines After DOJ Sides With RIAA
Justice Department prosecutors argue that Jammie Thomas' $222,000 judgment is in line with the U.S. Copyright Act law.


France Telecom Sells 30,000 iPhones, Some Unlocked
Orange expects to sell up to 100,000 iPhones by the end of the year and between 400,000 and 500,000 by the end of 2008.


Palm Refreshes Treo 750 With Windows Mobile 6
The update includes the ability to send and receive e-mails formatted in HTML with tables, bullets, and colored text.


Microsoft Turns To Inkblots For Password Generation
The image associations are not only unique to the user, they're also "hard to forget," the researchers said.


CD-ROM

Sys Admin and The Perl Journal CD-ROM version 11.0

Version 11.0 delivers every issue of Sys Admin from 1992 through 2005 and every issue of The Perl Journal from 1996-2002 in one convenient CD-ROM!

Order now!




MarketPlace

Easy & Powerful Server Monitoring that Just Works
Fortune 500 clients include Financial, Healthcare & Telco Companies. Free Trial Download.

Government (GSA) Contract Services
Expand into a $43 billion market today. Government (GSA) contract assistance, $4800 fee. BBB member.

Discover WinDev 11 RAD
and develop 10 times faster ! ALM, IDE, .Net, PDF, 5GL, Database, 64-bit, etc. Free Express version

We Buy & Sell Used Cisco
Hula Networks is overstocked on many items including, used Cisco, Juniper, Foundry and Extreme networking equipment and can therefore offer outstanding pricing. We buy Cisco and sell Cisco networking equipment

Wanna see your ad here?