lupus is currently certified at Master level.

Name: Paolo Molaro
Member since: 1999-12-29 16:53:59
Last Login: 2011-11-11 09:50:34

FOAF RDF Share This

No personal information is available.


Recent blog entries by lupus

Syndication: RSS 2.0
The MS CLR 4.0 catches up to a Mono 1.0 feature

This morning Google Reader brought up this blog post about a technique used internally by the MS CLR version 4.0.
It is basically a way to share complex code between different architectures, thus reducing the maintenance burden and allowing better compatibility in the tricky area of marshalling (when the runtime goes from managed code to unmanaged code it needs to perform many operations, including massaging the data exchanged between the two worlds).

At the beginning Mono had specialized x86 chunks of code to do the work, but we soon realized the problems with that.
At the time I proposed to use a small, specialized bytecode set (after all the operations involved are very few), which would have been great for the interpreter I was working on.
Dietmar instead pushed for the use of IL bytecode: this would have been a slower solution for the interpreter, but it would allow access to a wider set of operations and it would reduce the burden on the JIT backend, which, after all, already had an IL frontend.
We went for the IL solution (the time frame was the beginning of 2002) and we're glad to see that MS choose to adopt the same technique a few years later.

There are two important considerations about this story in the context of Mono development

  • we started to always choose techniques that would favour the JIT against the interpreter, both in the case of speed and maintenance burden. This was a path that eventually lead to the discontinuation of the interpreter.
  • we built a codebase that is easily portable across architectures, to the point that we consider a JIT port only slightly more complicated than an interpreter port (because of the many features of Mono/CLR even an interpreter needs quite a bit of very low-level knowledge about how an architecture works)

As an extension, we later used the same IL-based technique to implement many other runtime helper methods that would otherwise have been to be written in low-level architecture-specific code: remoting helpers, garbage collection fast paths, delegate runtime methods etc.
For the curious, most of this code is in metadata/marshal.c in the Mono source code.

Novell hack week

One of the things that have been sitting in my TODO list for a few years is to improve the performance of the Regex engine in Mono, both by speeding up the interpreter and by compiling the regular expressions to IL code so that the JIT could optimize it. This seemed like a good project for hack week and Zoltan joined me doing the implementation.

I worked on a new compiler/interpreter combination that uses simplified opcodes, with the aim of making their execution faster and also making the translation to IL code easier. As an example, the old interpreter used a 'Char' opcode to match a single char, but this opcode has several runtime options to ignore the case, negate the assertion and go backward in the string. Each option required decoding and conditional branching at runtime, so removing this overhead should improve performance significantly.

Zoltan made a clever observation: he could reuse the new interpreter for bytecodes that the IL engine he was working on couldn't yet handle and so he used dynamic methods with the same signature as the method in the interpreter that evaluates the regex bytecode. This design has also the nice property that compiled regular expressions will be garbage collected (as opposed to the MS runtime that, at least in the 1.1 version, will leak in this case).

IL-compiling regular expressions has several benefits: it completely removes dispatch and decoding overhead and the JIT will use very fast instructions like compare with immediate to implement the 'Char' bytecode.

I used a few microbenchmarks to test the speed of the new engines, but I'll report here just the results of running the regexdna test from the language shootout: in other cases the speedup is even bigger.

  • Old interpreter: 10.1 seconds
  • New interpreter: 5.4 seconds
  • IL engine: 1.3 seconds.

Most of the new code is in svn, though it's not enabled since it's still incomplete. We'd need another week or so to make it usable instead of the old engine and we haven't allocated the time yet to complete this work, but it sure looks promising.

The future of C#

C# as a programming language is still young and has evolved nicely in the last few years: most of the new features take it closer to a high-level language, allowing shorter and nicer code for common patterns. A language needs to be able to cover a wide spectrum of usages to be considered mature but there is one side of the spectrum that has been neglected in C# for far too long: the low level side close to IL code.

Someone already hacked up a program that uses a preprocessor and ilasm to inject IL into C# code, but this approach has many drawbacks (too many to list them all here).

Inline IL code should integrate with the rest of the C# program as closely as possible, allowing, for example, to reference labels defined in C# code from branches in IL code or using the names of local variables and arguments for opcodes like ldloc and ldarg.

The proposal here is to allow IL code as a new statement in the form:

This is similar to the traditional way inline assembly has been done (gcc's __asm__ ("code") statement), it's very unlikely to clash with other possible uses of the unsafe keyword and also conveys the notion that IL code may break type safety, IL language rules etc. It has also the added property that all the code needed to implement this support could be easily copied in a separate library and used in standalone programs to, say, simplify code emission for Reflection.Emit (this inline IL support has been implemented inside the Mono C# compiler, so it's C# code that uses Reflection.Emit as the backend).

So, without further ado, the standard sample program written with inline IL:

class InlineIL {
  static void Main () {
    unsafe @"
     ldstr ""Hello, world!""
     call void class [mscorlib]System.Console::WriteLine(string)
Ok, that kind of code is written more easily in C# proper, so what about things that IL code can do, but that C# code can't? Ever wanted to be able to change the value of a boxed integer? In C# you can't, but this is very easy with inline IL:

static void ChangeInt (object intval, int new_value) {
  unsafe @"
    unbox [mscorlib]System.Int32
The following code will print 2:

  object boxed = 1;
  ChangeInt (boxed, 2);
  Console.WriteLine (boxed);
Of course you can access types and fields defined by the C# program currently being compiled, consider the following example:

class InlineIL {
  static int field1;
  static void Main () {
    int localvar = 1;
    unsafe @"
      ldloc localvar
      stsfld int32 class InlineIL::field1
Note that in this case, the compiler won't emit warnings about field1 and localvar being never used and of course you'll get an error if you mispell the field name in IL code as you would in C# code.

The main usage of the new feature would be for some corlib methods in mono or for more easily implementing test cases for the JIT and runtime test suites: some specific IL code patterns (that may not be expressed in C# or that there is no guarantee C# will compile to the effecting code) can be easily written while the rest of the boilerplate code needed by the unit testing program can be written in much more readable C#. That said, this opens many possibilities for a creative hacker, finally free of the constraints of C#.

Happy hacking!

Memory savings with magic trampolines in Mono

Mono is a JIT compiler and as such it compiles a method only when needed: the moment the execution flow requires the method to execute. This mode of execution greatly improves startup time of applications and is implemented with a simple trick: when a method call is compiled, the generated native code can't transfer execution to the method's native code address, because it hasn't been compiled yet. Instead it will go through a magic trampoline: this chunk of code knows which method is going to be executed, so it will compile it and jump to the generated code.

The way the trampoline knows which method to compile is pretty simple: for each method a small specific trampoline is created that will pass the pointer to the method to execute to the real worker, the magic trampoline.

Different architectures implement this trampoline in different ways, but each with the aim to reduce its size: the reason is that many trampolines are generated and so they use quite a bit of memory.

Mono in svn has quite a few improvements in this area compared to mono 1.2.5 which was released just a few weeks ago. I'll try to detail the major changes below.

The first change is related to how the memory for the specific trampolines is allocated: this is executable memory so it is not allocated with malloc, but with a custom allocator, called Mono Code Manager. Since the code manager is used primarily for methods, it allocates chunks of memory that are aligned to multiples of 8 or 16 bytes depending on the architecture: this allows the cpu to fetch the instructions faster. But the specific trampolines are not performance critical (we'll spend lots of time JITting the method anyway), so they can tolerate a smaller alignment. Considering the fact that most trampolines are allocated one after the other and that in most architectures they are 10 or 12 bytes, this change alone saved about 25% of the memory used (they used to be aligned up to 16 bytes).

To give a rough idea of how many trampolines are generated I'll give a few examples:

  • MonoDevelop startup creates about 21 thousand trampolines
  • IronPython 2.0 running a benchmark creates about 17 thousand trampolines
  • an "hello, world" style program about 800
This change in the first case saved more than 80 KB of memory (plus about the same because reviewing the code allowed me to fix also a related overallocation issue).

So reducing the size of the trampolines is great, but it's really not possible to reduce them much further in size, if at all. The next step is trying just not to create them.
There are two primary ways a trampoline is generated: a direct call to the method is made or a virtual table slot is filled with a trampoline for the case when the method is invoked using a virtual call. I'll note here than in both cases, after compiling the method, the magic trampoline will do the needed changes so that the trampoline is not executed again, but execution goes directly to the newly compiled code. In one case the callsite is changed so that the branch or call instruction will transfer control to the new address. In the virtual call case the magic trampoline will change the virtual table slot directly.

The sequence of instructions used by the JIT to implement a virtual call are well-known and the magic trampoline (inspecting the registers and the code sequence) can easily get the virtual table slot that was used for the invocation. The idea here then is: if we know the virtual table slot we know also the method that is supposed to be compiled and executed, since each vtable slot is assigned a unique method by the class loader. This simple fact allows us to use a completely generic trampoline in the virtual table slots, avoiding the creation of many method-specific trampolines.

In the cases above, the number of generated trampolines goes from 21000 to 7700 for MonoDevelop (saving 160 KB of memory), from 17000 to 5400 for the IronPython case and from 800 to 150 for the hello world case.

I'll describe more optimizations (both already committed and forthcoming) in the next blog posts.

Mono GC updates

I guess some of you expected a blog entry about the generational GC in Mono, given the title. From my understanding many have the expectation that the new GC will solve all the issues they think are caused by the GC so they await with trepidation.
As a matter of fact, from my debugging of all or almost all those issues, the existing GC is not the culprit. Sometimes there is an unmanaged leak, sometimes a managed or unmanaged excessive retention of objects, but basically 80% of those issues that get attributed to the GC are not GC issues at all.
So, instead of waiting for the holy grail, provide test cases or as much data as you can for the bugs you experience, because chances are that the bug can be fixed relatively easily without waiting for the new GC to stabilize and get deployed.
Now, this is not to say that the new GC won't bring great improvements, but that those improvements are mainly in allocation speed and mean pause time, both of which, while measurable, are not bugs per-se and so are not part of the few issues that people hit with the current Boehm-GC based implementation.

After the long introduction, let's go to the purpose of this entry: svn Mono now can perform an object allocation entirely in managed code. Let me explain why this is significant.

The Mono runtime (including the GC) is written in C code and this is called unmanaged code as opposed to managed code which is all the code that gets JITted from IL opcodes.
The JIT and the runtime cooperate so that managed code is compiled in a way that lets the runtime inspect it, inject exceptions, unwind the stack and so on. The unmanaged code, on the other hand, is compiled by the C compiler and on most systems and architectures, there is no info available on it that would allow the same operations. For this reason, whenever a program needs to make a transition from managed code to unmanaged (for example for an internal call implementation or for calling into the GC) the runtime needs to perform some additional bookeeping, which can be relatively expensive, especially if the amount of code to execute in unmanaged land is tiny.

Since a while we have made use of the Boehm GC's ability to allocate objects in a thread-local fast-path, but we couldn't take the full benefit of it because the cost of the managed to unmanaged and back transition was bigger than the allocation cost itself.
Now the runtime can create a managed method that performs the allocation fast-path entirely in managed code, avoiding the cost of the transition in most cases. This infrastructure will be also used for the generational GC where it will be more important: the allocation fast-path sequence there is 4-5 instructions vs the dozen or more of the Boehm GC thread local alloc.

As for actual numbers, a benchmark that repeatedly allocates small objects is now more than 20% faster overall (overall includes the time spent collecting the garbage objects, the actual allocation speed increase is much bigger).

23 older entries...


lupus certified others as follows:

  • lupus certified lupus as Journeyer
  • lupus certified joey as Journeyer
  • lupus certified davidw as Journeyer
  • lupus certified cgabriel as Apprentice
  • lupus certified dhd as Journeyer
  • lupus certified timj as Journeyer
  • lupus certified shawn as Journeyer
  • lupus certified martin as Journeyer
  • lupus certified blizzard as Journeyer
  • lupus certified alan as Master
  • lupus certified Telsa as Apprentice
  • lupus certified jamesh as Journeyer
  • lupus certified miguel as Master
  • lupus certified federico as Master
  • lupus certified hp as Master
  • lupus certified yosh as Master
  • lupus certified notzed as Journeyer
  • lupus certified vincent as Journeyer
  • lupus certified srivasta as Journeyer
  • lupus certified branden as Journeyer
  • lupus certified jgg as Journeyer
  • lupus certified ciro as Journeyer
  • lupus certified panta as Journeyer
  • lupus certified sama as Apprentice
  • lupus certified antirez as Journeyer
  • lupus certified paci as Journeyer
  • lupus certified joke as Apprentice
  • lupus certified dido as Apprentice
  • lupus certified eugenia as Journeyer
  • lupus certified mvo as Journeyer
  • lupus certified Skud as Master
  • lupus certified merlyn as Master
  • lupus certified chip as Master
  • lupus certified Simon as Journeyer
  • lupus certified ebizo as Apprentice
  • lupus certified jackson as Journeyer
  • lupus certified duncanm as Journeyer
  • lupus certified harinath as Journeyer

Others have certified lupus as follows:

  • lupus certified lupus as Journeyer
  • LotR certified lupus as Journeyer
  • timj certified lupus as Apprentice
  • dhd certified lupus as Journeyer
  • cgabriel certified lupus as Journeyer
  • davidw certified lupus as Master
  • feldspar certified lupus as Journeyer
  • joey certified lupus as Journeyer
  • bombadil certified lupus as Journeyer
  • marcel certified lupus as Journeyer
  • vincent certified lupus as Master
  • broonie certified lupus as Journeyer
  • srivasta certified lupus as Journeyer
  • branden certified lupus as Journeyer
  • jamesh certified lupus as Journeyer
  • bma certified lupus as Journeyer
  • Joy certified lupus as Journeyer
  • ciro certified lupus as Journeyer
  • chaos certified lupus as Journeyer
  • jpick certified lupus as Journeyer
  • csurchi certified lupus as Journeyer
  • Denny certified lupus as Journeyer
  • dido certified lupus as Master
  • panta certified lupus as Journeyer
  • antirez certified lupus as Journeyer
  • joke certified lupus as Journeyer
  • eugenia certified lupus as Master
  • exa certified lupus as Master
  • ebizo certified lupus as Master
  • baux certified lupus as Journeyer
  • timur certified lupus as Master
  • fxn certified lupus as Master
  • etbe certified lupus as Master
  • ewsdk certified lupus as Master
  • kiwnix certified lupus as Master
  • mdupont certified lupus as Journeyer
  • jackson certified lupus as Master
  • cesar certified lupus as Master
  • lypanov certified lupus as Master
  • jluke certified lupus as Master
  • freax certified lupus as Master
  • pvanhoof certified lupus as Master
  • zbowling certified lupus as Master
  • mkestner certified lupus as Master
  • jserv certified lupus as Master
  • andrea certified lupus as Master
  • cinamod certified lupus as Master
  • shana certified lupus as Master

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page