High Performance Delphi

 Home   Fundamentals   Guide   Code   Links   Tools   Feedback 

High Performance Delphi



Additions and Updates

Devoted to making Delphi the high performance computing tool of choice.


Make your voice heard by Borland!

Borland has recently launched Quality Central where-in you can report bugs and, more importantly, make suggestions. You can also "rate" other suggestions and even assign special "votes" for suggestions that you think are especially important. I have already submitted a few performance-related suggestions and will continue to add more. Feel free to add your own, but I recommend that you keep them as specific as possible. (Adding "make optimization better" will undoubtedly get a lot of high ratings but I suspect that not much will get done with it.) If you have an idea but are unsure on exactly what to submit just e-mail it to me and I'll try and work it into shape.

In any case, it is extremely important that you get in there and show your support for both Quality Central and performance-related issues. Note that the downloadable client is still a work in progress even though the database itself is fully operational. This means that you will probably need to devote a small amount of time to getting to know how it works.

Delphi Version 7 optimization developments

News from BorCon 2002: Delphi version 7 is to be ".Net" oriented. Whether you view this as a good thing or a bad thing, it apparently means that the compiler is being modified. Significantly, there has been some vague talk by Danny Thorpe about application speed and compiler optimization. Interestingly, he made a comment about what amounts to automatic function inlining. (My words not his.) From the description, it looks like small routines will be inlined without any intervention whatsoever. Also there was some talk about a code profiling tool being included. See full story.

Cache miss penalties on the rise

I have not had a chance to completely scope this out, but there are strong indications that newer processors really dislike cache misses during data access. Mark Horridge did a quick comparison on a new Athlon with DDR memory and found that it was actually slower than older machines when you force a lot of cache misses. Take-home point: Structure your data and data access properly.

Delphi earning its keep

Mark Vaughan of Numerical Methods in Pascal has put together a list of real world, science and engineering apps written in Delphi/Pascal. And you thought you we were the only one using Delphi for real work. He is still expanding the list so go ahead and drop him a note on what you are doing.

Minor Site Reorganization

I reorganized the site mapping. Look for the "Code" link on the navigation bar. It now contains the examples, memory manager, and distributed computing links.

The Guido Pages

I am pleased to announce that Guido Gybels’ pages on floating point types and Assembler are available again for public consumption. Check them out, you may just learn something today.

Delphi 6 is here!

That is, I finally have it. Below are a few quick observations that you may want to know about.

Major changes to System.pas

System.pas and many of the other RTL units have been revamped for compatibility with Kylix. It is going to take me a while to get through them all but here are a few quick bits.

Trunc/Frac modified

These routines had increasingly become a source of problem because they changed the entire FP control word rather than just the rounding flag. This has been fixed. However, this also makes them even slower.

Version 6 variants are slow

The new variant model introduced in Version 6 is slower than the pre-version 6 ones. A few quick tests indicate that the they can easily be a factor of 3 slower. Borland is "looking into the issue". This is particularly bad news as the much awaited complex variable type is actually implemented as a variant.

Complex/Custom Variants are really slow

So how bad is it? Something like 30 times slower than normal variants, 90 times slower than D5 variants, and probably something like 200-1000 times slower than a non-variant approach. Quite a penalty for more elegant equations. More on this after I do some extensive testing.

Many new functions in Math.pas

It looks like Earl F Glynn and Ray Lischner have helped beef up this unit substantially. Many of the additions are "shortcut" routines to simplify the dreary task of comparing/checking floating point numbers plus some other stuff that I probably missed.

New version of the memory manager

I am pleased to say that the memory manager has been taking a pounding on real world apps with hundreds of threads, Gigabytes of allocations, and even 8 CPU's! Due to all of this a couple of low frequency bugs have emerged and been corrected.

Also, due to popular demand I am looking into making a new version that will be slower but more backward compatible with the built-in manager. The goal is for it to use the same block architecture (hopefully making it compatible with MemProof, MemSleuth etc.) and free unused memory back to the OS.

I've moved

...and I have lost all my old e-mails. So if you sent me any feedback in the last couple months and I did not respond it is because I lost it. Send it again if you can as I appreciate the feedback.

Old News

Additions and Updates for Dec 2000

Updated Dec 14: Visual C++ vs Delphi Case Study: TSCP computer chess program

As part of a benchmarking undertaking I have performed a literal translation of a computer chess program written in C.

Added: C to Pascal Converted computer Chess Program

As part of a benchmarking undertaking I have performed a literal translation of a computer chess program written in C. More on this later.

Added: Problems with testing Method Pointers

For no particularly obvious reason, assigned(MethodPointer) does a 16bit (word) sized check for zero rather than the much more efficient 32bit (dword) check.

Updated: Small change made to Memory Manager

A low frequency of occurance problem with Windows not allocating memory when it should has been, well, covered up.

Additions and Updates for May 2000

Updated (May 15): Quexal Link

This is a very interesting MMX/SSE code generator that is compatible with Delphi (i.e. generates db statements) This product allows you to write MMX/SSE even if you don't know assembler/MMX codes. Definitely worth a look.

Updated: Trunc vs Round

Note: Round() rounds by the IEEE standard "round to nearest or even" method not commonly used "round to nearest or larger" method.

Added: MMX Expert Link

Added: Quexal Link

This is a very interesting MMX/SSE code generator that is compatible with Delphi. It still has bugs in it (in fact it doesn"t yet generate code properly) but it also has great potential. Definitely worth keeping an eye on.

New: Remote profiling

In my ongoing effort to expand into distributed computing, I have created a modified version of z_prof that presents a DCOM interface allowing remote access to z_prof"s timing data. Additionally, a remote program server is included that allows you to copy and run programs on the remote Win98 machine(s) from the comfort of your own chair.

Updated: Memory manager replacement

Just a couple of bug fixes.

Additions and Updates for January 2000

New: performance comparison of large integer types

So you need to work with integers with a range greater than 2^32. This topic covers the performance aspects of all possible types that could be used to handle the load with good old longint thrown in as a comparison.

New: Memory manager replacement

Introducing HPMM. A high performance memory manager that returns highly aligned memory. You can use it separately or as a drop in replacement for the build-in memory manager.

New: Fixing longstring performance

Wish you could turn back the clock on the infamous lock instruction added in D5 to longstrings to make them more threadsafe? Well you can. All you need to do is recompile the system.pas after making a couple of simple changes. Additionally, you can make some other changes that can result in even better than D4 performance.

New: Fast exponential and logarithms

If you have a need for fast, low precision exponentials and logarithms then this collection of routines is just the ticket. The most likely uses include neural networks and other "tuned" simulations that rely more on the shape of these functions than on the exact value.

Updated: Changing the Floating Point Control Word

Another error correction. Changing the FP control word does in fact change the result for addition, subtraction and multiplication operations in addition to division and square root. However, the execution time is only affected for division and square root. This goes against the popular belief but is documented in the intel manuals.

Updated: BitCount examples

A couple new versions of DWord sized bit counters have been added, including a new champion based on a lookup table. Lookup table tables are about as interesting as a rock so honorable mention goes to Charles Doumar who gave me a wonderfully complex version that beats all the non-lookup versions.

Updated: StrLen examples

Two routines have been modified and a comparison of code positioning has been added.

Additions and Updates for November 1999

New: Distributed Computing

What better way to speed up your favorite computationally intensive program than to spread it over several computers. This month I take a first shot at this topic including an example.

New in Version 5: Alignment of local extended variables

Extended variables declared within a routine used to always be misaligned, now they are dword aligned.

New in Version 5: Thread safety of strings and dynamic arrays

The thread safety of strings and dynamic arrays has been improved by preventing reference count problems. Previously, Reference counts were read altered then saved resulting in the potential for another reference on another thread to read or write in between those operations. This has been fixed by directly altering the reference count and locking that single instruction to prevent preemption. Everything has a price unfortunately. The lock CPU instruction prefix used to achieve this thread safety is quite expensive on Pentium II processors. My measure of the effect of this change is an additional 28 cycles per refcount adjustment, which in a worst case scenario can result in a 2x decrease in performance. Real world reports have placed the impact in the 0 to 20% range.

New in Version 5:Working with Shortstrings

The default method of handling shortstrings has changed. Presumably in an effort to phase out all the old shortstring methods and maintain only one set of string routines, shortstrings are now converted to longstrings prior to many manipulations. This effectively makes these shortstring operations much slower.

Added: How to avoid floating point checks for zero

Under certain circumstances it can be beneficial avoid a direct comparison to check for a zero in a floating point variable and instead utilize typecasting to test the underlying representation of the variable. This is because floating point comparisons require a true floating point based zero check by taking advantage of the way zero is stored.

Coming attractions...

  • More on COM
  • Profiler Review (hopefully)
  • Memory Manager issues

Additions and Updates for August 1999

Surprisingly, I have a fair amount to add this time around. Mostly little stuff though. I expect (hope) that here will be lots of news once D5 finally hits the street. I would also like to say that I appreciate the feedback/input that I"ve been getting. Keep it up.

First Word on Delphi 5

Version 5 is due to be released within a month or so. Unfortunately, it appears that the hopes and dreams of the performance minded have not turned into new features. There isn"t even a hint of compiler related enhancements on the feature matrix. Check it out at www.borland.com/delphi/.

Database Related optimizations ????

I"ve been getting some prodding to expand into database related optimization techniques. The basic problem is that I am not the Delphinian for the job as I know little about databases or flurry of acronyms that seems to surround them. However, I am willing to act as organizer if you send me tips. A second alternative is to somehow utilize John Kaster"s budding CodeCentral as the repository for this info. Feedback?

New: Downloadable Version

Due to popular demand, I"ve added a simple means to download all the pages contained in this site HTML

Updated: Concatenating Strings

Well, I have to confess that there was an error of sorts relating to this topic. Format is NOT the best way to concatenate strings. In all cases the much simpler and more readable "addition" method is the best. NOTE: This may not hold in Delphi version 2.

Added: Case statements

Case statements are actually pretty well optimized. However, they can"t predict the frequency of occurrence of the individual cases. You have to do this yourself.

Added: Floating Point Exceptions

This is really only an issue of importance if you write your own floating point assembler routines. FP exceptions (such as divide by zero) aren"t actually triggered when the error occurs. Instead they are delayed until the next floating point instruction.

Added: Pentium II specific bottlenecks

It has occurred to me that while many of the techniques presented here are based upon how Pentium II processors bottleneck, I have never actually stated how this works. The long and detailed version can be found in Intel"s documentation and in Agner Fogs Pentium Optimization Guide. Here I present a quickie version slanted towards Delphi"s compiler output. Having a general understanding of this process will may help you decide what needed optimizing and what does not.

New: Moving and zeroing memory

The provided methods for moving memory and filling it with zeros are move and fillchar respectively. These are based around the rep movsd and rep stosd assembler instructions, which are fairly efficient. However, there is some extra cleanup code associated with each of these that can affect reduce their efficiency, especially on smaller tasks. Additionally, there are special data alignment considerations on Pentium II CPU"s that can have a substantial effect.

Added: For statement information

There is really nothing new here. However, questions about how the for statement works come up often enough that I thought I would spell it out here.

Updated: Algorithm Optimization page

Thinking of abstract algorithm optimizations out of the blue is tough so thanks to Dr. John Stockton for pointing one that I had overlooked. Namely, saving and reusing intermediate results.

Updated: Linked lists vs. Arrays

This had grown pretty stale as it only considered Pentium performance. On a Pentium II, arrays are faster in every situation.

New: Interfaces and performance

I waited until the 11"th hour to look into performance and interfaces, and of course promptly realized that it would take much longer than I wanted to spend. Consequently, this is just a first stab into interfaces.

Coming attractions...

Only one really. It"s called D5. I expect that there will be a flurry of information on this in the newsgroups, so check there for the blow-by-blow. I"ll try and get all the detail compiled and organized as quickly as possible.

Additions and Updates for June 1999

Added: use the {$MINENUMSIZE 4} compiler directive

If you use enumerated types (such as TSuits=(Diamonds,Hearts,Clubs,Spades) ) include the {$MinEnumSize 4} directive to force all enum variables to be 32bit. If you have compatibility issues you can simply turn it on for the type declarations of interest. For instance:


Utilization of this directive is especially effective for enumerated types greater than 256 elements. These result in word sized variables which are quite slow.

updated: put constants together in front of variables in floating point code

If possible move floating point constants to the front of the expression. The compiler will combine them at compile time.

Added: movzx is faster on PII than xor/mov but slower on pentium

A common requirement is to load values smaller than 32 bits in to a register. Since they do not overwrite the entire register it is necessary to zero out the register first. Alternatively, you can use the built in instruction movzx (move with zero extend). On Pentiums and before this instruction was slower than using xor reg,reg/mov reg,{value}. However, The PII has streamlined this instruction so that now it is preferred over the xor/mov combination. Note that the compiler chooses between these two options based on a set of rules that is apparently fairly complicated, as I have yet to figure them out.

NEW: Algorithm optimization page

Added: virtual methods vs static methods

I finally got around to pulling this comparison out of dejanews. The penalty for virtual methods is substantially higher than for static methods on a PII. The misprediction penalty has actually _increased_ by a factor of 2.5 to over 5, depending on how you measure, relative to other code. This means that the overhead for indirect calls, the mechanism behind virtual methods, has gone up accordingly. Unlike conditional jumps (if statements) indirect calls are not well predicted. Consequently, it is quite easy to have a misprediction penalty on every call. Here are the numbers (times relative to statically called procedure with no parameters):

  • static call=1
  • virtual call (no penalty)=1.6
  • virtual call (penalty)=3.9

More detail inside.

Updated: Bit Counting example

Dave Shapiro sent me a new bit counting algorithm that is about 3x faster than the one I previously had. All the original versions plus two new ones are now included.

Coming attractions...

Not much but here are the things still in the works:

  • more info on case statements
  • more info on for statements

 Home   Fundamentals   Guide   Code   Links   Tools   Feedback 

Copyright © 2002 Robert Lee (rhlee@optimalcode.com)