We first briefly assess the merits and limits of the two existing
quicksort algorithms, especially considering their cache locality.
We present two new quicksort alternatives for improving memory
performance further. Experimental results will be reported in the
LaMarca and Ladner in the same paper  present two quicksort
algorithms for cache optimization. The first one is called
memory-tuned quicksort, which is a modification of the base
Instead of saving small subarrays to sort in the end, the memory-tuned
quicksort sorts these subarrays when they are first encountered in order to
reuse the data elements in the cache.
The second algorithm is called multiquicksort. This algorithm applies
a single pass to divide the full data set into multiple subarrays,
with the hope that each subarray will be smaller than the cache capacity.
The performance gain of these two algorithms from experiments
reported in  is modest. We implemented the two algorithms
on simulated machines and on various high-end workstations and
obtained consistent performance. We also found that the performance
of quicksort and its cache-optimized alternatives are very sensitive
to the types of the data set being used. These algorithms were not
efficient on unbalanced data sets.
In practice, the quicksort algorithms exploit cache locality well on
balanced data. A challenge is to make the quicksort perform well
on unbalanced data sets. We present two cache-optimized quicksort
alternatives that work well on both balanced and unbalanced data sets.
Flashsort  is extremely fast for sorting balanced data sets.
The maximum and minimum values are first identified in the data set to
identify the data range. The data range is then evenly divided into classes
to form subarrays. The algorithm consists of three steps:
``classification" to determine the size of each class,
``permutation" to move each element into its class by using a single
temporary variable to hold the replaced element, and
``straight insertion" to sort elements in each class by using
Sedgewick's insertion sort .
This algorithm works very well on balanced data sets because the sizes
of the subarrays after the first two steps are similar and are small
enough to fit in the cache.
This makes flashsort highly effective (O(N)). However, when the
data set is not balanced, the sizes of the generated subarrays are
disproportionate, causing ineffective usage of the cache,
and making flashsort as slow as insertion sort (O(N2)) in the worst case.
In comparison with the pivoting process of quicksort,
the classification step of flashsort is more likely to generate
balanced subarrays, which favors better cache utilization.
On the other hand, quicksort outperforms insertion
sort on unbalanced subarrays. By combining the advantages of
flashsort and quicksort, we present a new quicksort alternative,
flash quicksort, where
the first two steps are the same as in flashsort
and the last step uses quicksort to sort the elements in each class.
To further improve overall performance, we employ another cache
optimization to improve temporal locality in flash quicksort.
We call this alternative inplaced flash quicksort.
In this algorithm, the first and third steps are the same as in
flash quicksort. In the second step, an additional array is used
as a buffer to hold the permuted elements.
In the original flashsort, a single
temporary variable is used to hold the replaced element. A cache line
normally holds more than one element. The data structure of
the single variable minimizes the chance of data reusage.
Using the additional array, we attempt to reuse elements in
a cache line before their replacement and to reduce the instruction
count for copying data elements.
Although this approach increases the required memory space, it
improves both cache and overall performance.
Figure 6 shows the instruction counts
(left figure) and the L1 misses (right figure) of
memory-tuned quicksort, flashsort,
flash quicksort, and inplaced flash quicksort,
on the Unbalanced data set on the simulated Pentium III memory
architecture, which has a higher memory latency and a larger L2 cache
(512 KBytes) than the Pentium II.
The instruction count curve of flashsort was too high to be presented
in the left figure of Figure 6.
The same figure shows that the instruction count of memory-tuned quicksort
also increases rapidly as the data set size grows.
In contrast, the instruction counts of
flash quicksort and inplaced flash quicksort change little as the data
set size increases.
The simulation also shows that the number of L1 misses increases much
more rapidly as the size of the data set grows in the memory-tuned
quicksort and flashsort than in the flashsort and inplaced flashsort
The simulation results are consistent with our algorithm analysis,
and show the effectiveness of
our new quicksort alternatives on unbalanced data sets.
Simulation comparisons of the instruction counts (left figure)
and the L1 misses (right figure) of the
quicksort algorithms on the Unbalanced data set
on the simulated Pentium III.
(The instruction count curve of the flashsort was too high to be presented
in the left figure).