Code optimization is a funny thing...
You start out with a desire to show proof of concept, and then all of a sudden you're riding the momentum of user demand. The focus of this page is to share information I learned on an adventure driven by lust of code optimization. If you're a tweak freak the information on this page will surely shock and amaze you. If you enjoy reading this article, let me know and I'll consider writing more of this kind of stuff. First, a newbie brief..
memcpy is defined in a Win32 C/C++ programming environment as
void memcpy(void *dst, void *src, long nbytes)
It serves a single purpose, copy a user defined amount of bytes (nbytes) from a source location (src) to a destination (dst) location.
This reliable and often used function call can take up large amounts of cpu time. Obviously this amount of cpu time varies with the amount of memory being copied, the host memory bus type and speed, host cpu type and speed, ad infinitum..
Today media-enabled software is all the rage. Full screen video, high definition sound, high resolution texture mapped sprites.. Large amounts of memory is required for these kinds of applications, and lots of copying is being done in real time.
Consider this simple example...
30 frames per second, 1024*768 24-Bit color video.
Cost: 70,778,880 Bytes per second (2,359,296 bytes per frame)
Yep... and that's uncompressed. But don't get too excited.. today's systems are capable of handling this.. The testbench for my adventure, an ASUS Pentium III 933Mhz w/ 512MB is able to copy 120MB/Sec using memcpy.
So from a coding perspective, if you're using memcpy to do an insane amount of copying, do you know anything more about memcpy aside from its function call parameters? How was is it written? Does it take advantage of any 'special' CPU features to make it faster? I bring this up because memcpy is an exported function from Microsoft's Kernel32.DLL, and Kernel32 updates are few and far between, and think about it - CPU updates happen more frequently than operating system upgrades!
I made some assumptions...
90% of desktops today have some form of MMX extension built into the CPU.
(Late pentiums, Pentium II's or higher)
85% of the MMX capable desktops have some form of write combining capability.
(Pentium II's or higher)
50% of the MMX capable desktops have some form of prefetching capability.
(Pentium III's or higher)
I do some research...
Does memcpy any CPU specific instructions to speed up memcpy?
The answer is sadly...
Scary stuff.. All this memcpy and no integrated support to detect underlying processor and 'do the write (pun intended) thing' ? Damn.. that's nuts. And so the seemingly harmless curiosity becomes an adventure that consumes three days of my life.
So where's the beef? What did I learn? What's to share?
Instead of trying to reproduce whats already been written, I'm going to first lead you to this page at SGI which provides four source code listings. Have a quick look and come back when you're done. Don't bother compiling anything yet.. it does work.
All done? Super!
For reasons of national security and the protection of canadian beavers I will only list pseudocode on this page. If you follow my instructions carefully you'll be memcpy'ing at 400MB/Sec instead of a measly 120MB/Sec!
Pull out that assembler of yours... you're going to have some fun!
Create a new project, call it anything you like, define two functions, an initialization function and a replacement function for memcpy, say.... something inventive like... fastmemcpy!
void fastmemcpy(void *dst, void *src, long nbytes)
In the initialization function we need to detect the system CPU.
If you already have a facility to test for MMX and processor version, you're set. If you don't know how to detect these kinds of settings safely in a windows environment then have a look at this page.
So let's define two global variables.
Now in the fastmemcpy routine we will jump to whatever copy routine best suits the processor capabilities.
If we have streaming extensions (SIMD), we can call SGI's example #4.
If we're at least a pentium II with MMX and don't have streaming extensions, we can call SGI's example #2.
If we're at least a pentium II with MMX and don't have streaming extensions, we can call SGI's example #1.
If we don't have MMX or streaming extensions then we'll use the plain old memcpy.
Now we're not done just yet.
The SGI examples work fast by copying blocks of memory in fixed amounts, 64 (P/PII) or 2048 (PIII) bytes at a time. You'll need to make some adjustments unless you intend to *always* copy those fixed amounts.
In the fastmemcpy routine, check nbytes and do the right thing. For example..
if (nbytes < block copy size of target copy routine)
------use plain old memcpy
------run inlined SGI ASM copy routine
------if (nbytes % block copy size) > 0
----------copy the balance of bytes using memcpy or your own homebrew ASM code
And now my adventure becomes yours.
I'm going to leave you with some interesting links which I hope survive the interest of this page.
Do with it magical things. Peace be with you...