I’ll start this blog with something mundane to avoid arousing any unwarranted excitement.

I’ve spent some time recently looking at the ELF binary format for Linux executables, as well as the DWARF format used for storing debugging information. In my gut, I feel like there’s a lot of interesting information to be mined here, particularly for program analysis. Some possible ideas:

  • Write an infrastructure similar to CIL, but operating on binaries instead of C code. In the future I’ll write more about this topic. There are several advantages of this approach. First, you’re seeing the actual code that the machine executes, instead of the CIL parser’s own interpretation. Second, you don’t have to go to the trouble of inserting any tools in the build process; you just need to ensure that the program is compiled with “-g”. This is especially important for bigger programs that use lots of libraries. Third, it’s easier to support other languages, particularly C++.
  • Write a program understanding tool to look up definitions and uses of symbols, fields, macros, etc. DWARF includes all the necessary information. Eventually such a tool could be integrated with emacs like ctags currently is. I pretty much want something like cscope, but without having to worry about the fidelity of cscope’s “fuzzy parser.”
  • Generate stack maps for precise garbage collection. Currently the only GC that works with GCC by default is Boehm’s, which is conservative and not particularly snappy in my experience. Accurate garbage collection requires a map describing all the locations on the stack containing pointers. GCC does not generate such maps, and modifying it to do so would probably be hard. LLVM claims that they support accurate collection, but their GC page suggests that they do so via a shadow stack, which is no better than what GCC offers. Instead, I would like to use DWARF information to figure out where the pointers are. I’ve done a few tests at -O3 to see how much information GCC actually preserves. As far as I can tell, it gives you everything you need, even after optimizations like function inlining and scalar replacement of aggregates, which you might expect to be problematic.

Perhaps none of these ideas will pan out, but I think it’s an interesting area to explore. There’s a lot of research on binary analysis and instrumentation in academia (mostly for security and optimization), but mostly the code seems to be proprietary. I’ve already developed some simple tools in this direction. I plan to release them when they mature a little. I’ll try to post on this tomorrow.

Now to the topic at hand. I was disassembling some code the other day (just using objdump, nothing fancy) and I noticed a weird pattern emitted by GCC.

billm@plums:~$ objdump -d /bin/ls | less
8049ba9:       8d bc 27 00 00 00 00    lea    0x0(%edi,%eiz,1),%edi

This pattern appears all over the place (though sometimes with other registers subsituted for edi). Mostly I’ve noticed it in between functions (after one returns and before the next begins) but it also appears within function bodies.

What does this instruction do? The weirdo syntax makes it particularly difficult to determine. Apparently %eiz is a pseudo-register that just evaluates to zero at all times (like r0 on MIPS). I decoded the same instruction using an older version of objdump, and it instead gave the more normal result “lea 0×0(%edi),%edi”. I think the binutils people must pull a lot of all-nighters to make their crazy AT&T syntax even more inscrutable with each new version. Why can’t we just use Intel syntax? It seems way more intuitive to me.

Returning to the problem at hand, the instruction “lea 0×0(%edi),%edi” still doesn’t make much sense. Why would you ever want to load the effective address of [edi] into edi? This is semantically equivalent to “mov edi, edi”, or, more simply, to a NOP.

I eventually found a mailing list post by binutils guru Ian Lance Taylor that reveals the answer. Sometimes GCC inserts NOP instructions into the code stream to ensure proper alignment and stuff like that. The NOP instruction takes one byte, so you would think that you could just add as many as needed. But according to Ian Lance Taylor, it’s faster for the chip to execute one long instruction than many short instructions. So rather than inserting seven NOP instructions, they instead use one bizarro LEA, which uses up seven bytes and is semantically equivalent to a NOP.

I must admit to being a bit confused still. Most of the time, these semantic NOPs are not executed so you might as well use regular NOPs. Occasionally, the LEAs do occur in places where they might be executed, but in those cases I don’t always understand why a NOP is needed. Sometimes the next instruction after the NOP is a branch target, and so I can understand that the compiler might want the CPU to start loading those instructions on a cache line boundary if possible. But sometimes the instruction after the NOP is not a branch target. In that case, why is there a random NOP in the instruction stream? It just seems wasteful of cache. Perhaps someone can enlighten me.

Anyway, wasn’t that interesting?

Tags: , , ,

One Response to “Why Does GCC LEA EIZ?”

  1. Leonardo Santagada says:

    In pypy they tried to make their on track all allocations without using a shadow stack I think using some inline asm with comments to track the registers being used but It does rely in some markers that only gnu asm inserts on linux, so there is no way to run it on mac os x.

    Maybe with debugging information it is possible to do this on macs (and maybe one day on windows). I will try to talk to the pypy people about it, thanks for the idea :)

Leave a Reply