XML Offload and Acceleration with Cell Broadband Engine
- , , ,
- , ,
XML processing is becoming performance critical for an increasingly wide range of applications, occupying up to 90% of the CPU time. Novel approaches to XML processing can thus have a decisive impact on market competitiveness. Inspired by a novel state-machine based approach to XML processing, we have developed a high-performance XML parser that runs efficiently both on Intel processors and the new Cell Broadband Engine processor. It offers an industry-standard Simple API for XML interface and is, on an Intel Pentium M processor, 2.3 times as fast as libxml although it offers the same interface. With eight parallel parse threads running on a single Cell Broadband Engine processor, the total parse throughput is estimated to be 3.0 times that of libxml on an Intel Pentium M processor. At this point, our parser processes XML 1.0 documents encoded in UTF-8, but can easily be extended to support XML 1.1 and other encodings.
The high-performance XML parser described in this paper is part of the system architecture for XML offload and acceleration presented in Cell Processor-Based Workstation for XML Offload System Architecture and Design  and System Architecture for XML Offload to a Cell Processor-Based Workstation . This system architecture aims to exploit the novel Cell Broadband Engine (Cell BE) processor architecture for XML processing and consists of two elements: a basic offload infrastructure that delegates XML processing tasks from Java-based application systems to designated offload systems, and a high-performance XML parser on these offload systems. While our paper at the XML 2005 Conference  focused on the basic offload infrastructure and the technology underlying the XML accelerator parser, this paper gives more details about the parsers design and implementation.
The challenge of parsing XML
XML is a World Wide Web Consortium (W3C) recommended document format ,  and widely used even in places that it was not originally designed for, e.g., inter-process communication using the Simple Object Access Protocol (SOAP) . Unfortunately, XML parsing is inherently inefficient a problem that is ultimately due to XMLs design goal of being human-readable and that coupled with the wide use of XML generates a bottleneck in an increasing number of applications. It is not unusual for an application to spend 30% of its time just parsing XML, and the situation is much worse for some database applications that process very large XML files.
Cell BE as an XML accelerator
Our project goal was to employ the unmatched cost/performance ratio of the new Cell BE processor architecture implemented in the new Cell BE processor to offer XML parsing services on a dedicated processor and thus free up resources on more costly computer systems, e.g., mainframe computers.
The Cell BE processor features nine cores on a single chip: one Power Processor Element (PPE) running the operating system and eight independent Synergistic Processing Elements (SPEs) running the application threads . Figure 1 illustrates this structure. While the PPE is an almost fully-fledged IBM Power Architecture processor, the SPEs are in essence vector units optimized for Single Instruction Multiple Data (SIMD) operations such as parallel arithmetics: For example, each SPE can execute eight 16bit-additions in parallel in a single instruction. The SPEs instruction set supports the execution of self-contained programs, however due to the size restrictions (since the Cell BE processor has nine cores on a single chip, each core has a limited chip area and a limited number of transistors at its disposal) the SPEs do not offer advanced control structures such as branch prediction, out-of-order speculative execution, or multiple instruction issue.
As XML parsing is inherently serial in nature, it appears practically infeasible to make use of all eight SPEs in a single parse thread. Instead, our approach is to parse eight documents in parallel, which is just as good in many applications, especially in an offload environment.
Low-memory state-machine approach
Each of the Cell BE processors eight SPEs has only 256KBytes local store, a fast memory in SRAM technology. The only access to the much larger main memory is through Direct Memory Access (DMA), an asynchronous protocol for memory transfers. Due to the latency of DMA requests the program itself and all immediately accessible data must reside in the local store. The Zrich XML Accelerator (ZuXA) state machine approach developed by IBM researchers features a very low memory footprint and is thus ideally suited for the limited memory available on the Cell BE processor.
Cell BE XML parser architecture
Our approach to employ the parallelism of the Cell BE processor architecture is to use each SPE as a stand-alone parsing engine; thus eight XML documents can be parsed in parallel. This approach assumes, of course, that there are always enough parsing requests at a time in order to use the Cell BE processors full potential. As the parser is targeted at a dedicated XML offload environment, this should not be an issue in practice.
The parser consists of a front-end, which is located on the PPE core, and eight independent SPE parsing units, that form the parsers back-end. Figure 2 shows this design.
The front-end acts as a controller that delegates incoming parsing requests to idle SPE parsing units. This mechanism is implemented by a job stack onto which incoming parse requests are pushed.
Each SPE thread has its corresponding PPE thread that is responsible for fetching an outstanding parse job from the job stack, controlling data streaming back and forth between PPE and SPE through DMA operations, and invoking Simple API for XML (SAX) event callbacks on the registered parsing client functions. SPE and PPE communicate through the Cell BE processors intrinsic mailbox mechanism.
As stated before, XML parsing is done by a program loaded onto each SPE. The parsing process itself can be seen as a transformation from plain XML into a normalized internal stream of binary records each representing a SAX event. Because these binary event streams directly encode SAX events, the controller threads on the PPE can efficiently extract the according arguments and trigger SAX events by calling the appropriate registered callback functions.
The SAX interface provided by the front-end makes it easy to integrate our XML parser with new as well as existing applications. If so desired, reading the binary event stream directly would eliminate the overhead of creating SAX events and thus might offer a slight performance advantage.
Another important fact is that we developed our parser both on the Cell BE platform and a conventional single processor platform. Front-end and back-end parts of our parser can be compiled into a single executable by setting a compiler flag. This allows our parser to be employed even beyond the Cell BE platform it was targeted at in the first place. We were surprised about how well our parser performed on a conventional processor. For more details on our performance metrics see the section Performance measurements. The next section will give a brief introduction into the distinctive challenges of XML parsing on the Cell BE processor and the parsing algorithm we implemented.
XML-parsing state machine
As mentioned before, ZuXA is a state-machine based approach to XML processing. It has a very small memory footprint and is thus ideally suited for the Cell BE processors SPEs, each of which has only 256KBytes of local memory. It is based on a novel finite state machine (FSM) technology that is described in ,  and , and consists of:
A set of states.
A collection of internal data structures such as a state stack and storage area for namespace declarations, element names, and attributes.
A set of transitions between the states, which take the state machine from one state to another when they fire. The firing of each transition depends on various aspects of the state machines state and on the input character, and upon firing a transition executes a sequence of instructions from a predefined instruction set.
Thus the state machine can be modeled as a graph with one node for each state, and directed arcs representing the transitions. Each arc is then labeled with the conditions and instructions associated with that transition.
Having internal data structures such as stacks renders the state machine much more powerful than a traditional finite state machine a normal FSM could not detect a^nb^n and similar languages, which can easily be proved by the Pumping Lemma, described for example on http://en.wikipedia.org/wiki/Pumping_lemma, let alone parse XML.
A refined set of states and transitions has been developed to implement the logic of XML parsing as described by  respectively . The resulting state diagram is specified in an abstract format from which central parts of the parsers source code, namely rule selection and instruction handling, are generated in an automated fashion.
Compiling the state machine
Only the main loop and the instruction set implementation along with the data structures they work on have been implemented by hand. The remaining parser code was generated from an abstract specification of the parser using a short Perl script.
Internal data structures
As said before, a number of basic data structures in the parser have been written by hand. Among these are: a memory containing the keywords that are defined by the W3Cs XML recommendation , e.g., DOCTYPE and CDATA; a stack of strings that stores the element names as the FSM descends into deeper nesting levels of the document structure; a table containing the names of attributes, their namespace prefixes as well as the namespace Uniform Resource Identifiers (URIs); and a tree storing the namespace declarations as they occur.
We employed several techniques for speeding up the parsing process. Most of these are generic and speed up parsing on any processor, i.e., they are not specific to the Cell BE processor.
Grouping transitions into buckets
From each state originates a set of up to seven state transitions, each consisting of a destination state, a condition, and an associated sequence (or bundle) of instructions. The transitions originating from a state are given in decreasing order of priority, i.e., the first transition in this sequence whose condition matches will fire.
To avoid having to check up to seven conditions until we find the first matching one, we divide the transitions originating from a state into 16 groups, or buckets. These groups overlap, i.e., a transition can occur in more than one or even in all 16 buckets. The buckets are indexed by four bits from the UTF-8 input characters first byte. Although a state can contain up to seven outgoing transitions, through our careful selection of these bits we can guarantee that each bucket holds at most three transitions. A transition is then in all buckets for which the four index bits being part of the input have values under which it could potentially fire. Thus, e.g., a transition matching only one specific character such as < will be in only one bucket, while instructions whose conditions do not depend on the input will end up in all 16 buckets.
When we then read an input character, from the state number and four bits out of this input character we can calculate the number of the corresponding bucket. This bucket then contains a reduced set of transitions that can potentially fire up to three, compared to the maximum of seven rules in a state.
Compiling rule selection logic
We use seven bits to identify a state. These seven bits together with the four bits from the input can take 2^11 = 2048 values, hence there are 2048 buckets in total. Many of the 2048 buckets are identical, i.e., they contain the same conditional instruction bundles (CIB). The CIB is defined as the sequence of state transitions without the destination states, i.e., a sequence of (up to three) pairs of a condition and a sequence of instructions. We assign an unique identifier to each CIB and then map each of the 2048 buckets to one of the CIBs. We have to keep a separate array of size 2048 * 3 for the destination states, because there are up to three transitions per bucket and because the CIBs do not contain the destination state.
The small number of CIBs then allows us to compile the rule selection within each CIB directly into C code. Remember that the transitions for a state (and thus within a bucket) are sorted in decreasing order of priority and we have to identify and execute the first transition whose condition matches. We thus generate a conditional instruction bundle handler that contains a large
switch statement. Each
case of the
switch statement corresponds to one CIB and contains up to three transitions. Every bucket that maps to a specific CIB thus contains these transitions in the given order, even if they may point to different destination states. Using the index of the transition that fired, the corresponding destination state for each of the 2048 buckets is looked up in a 2048 * 3 array.
Efficient selection of transition rules
One of the most performance critical steps is the selection of the state machine transition that will fire depending on the current internal state (the state number and the internal data structures) and input character, as this step is in the main loop and has to be executed for every single character. (Actually, at some points, e.g., within content, where several cycles are spent in the same state because a transition loops back onto it, it is possible to extend the chunk of input processed at once by the state machine beyond a single character.)
Limitations and caveats
Given the relatively short time in which the parser was developed, it is obvious that some features had to be neglected in order to provide a fast and stable implementation. This chapter is to describe these limitations and enable the reader to judge the impact on performance metrics and effort to be undertaken to overcome them.
At this point, our parser can only handle XML 1.0 documents encoded in UTF-8. A further limitation is that our parser does not parse internal (or external) DTDs but skips them instead. Given that, it is clear that no validity checking can be performed. Skipping a DTD has some impact on well-formedness checking, too. Documents that contain DTDs and make use of parameter entities might be wrongly considered well-formed, as entity references cannot be resolved by our parser. Further minor limitations are: event stream records can not exceed four KBytes, and multi-byte characters that do not conform to UTF-8 might remain undetected.
We think that all these limitations, apart from the validation issue, can be overcome with some very justifiable effort. However, as the parser already uses most of the SPEs 256KBytes of local store, we see no way to add validation to the features list, as far as the Cell BE processor is concerned.
As various parts of the implementation rely on UTF-8 as character encoding, the most straightforward way to overcome this limitation would be to introduce a character encoding transformation layer. This would of course imply some penalty on the end-to-end latency.
We think that DTD/Schema validation and support for different character encodings are the only features that could negatively impact performance.
Support for XML 1.1
It should be possible with not too much work to extend the parser in such a way that it supports XML 1.1. The main difference between XML 1.0 and 1.1 is that some character classes have changed, and thus of course the character classifier has to be changed accordingly.
Classification during the execution of the state machine is handled by the character classifier which is generated by a Perl script, thus the latter has to be extended to respect the new XML 1.1 character classes.
Remember that a rule has to be in a bucket if it is at all possible that this rule will fire if the selected four bits of the first input byte have the correct value. Thus if new characters were to be added to a class, the selection might have to change. If the parser was to support XML 1.0 and 1.1, a rule would have to be in a bucket if it is possible to fire in that bucket for XML 1.0 or for 1.1, because the bucketing is done even before compile time, and thus long before we know what kind of document we are to process.
The following performance measurements that were performed under Linux compare our parser with the libxml parser , a widely used validating SAX parser. The benchmark workload consisted of ten files taken from a collection of XML documents from IBM DB2 customers and ranging from two KBytes to multiple MBytes. During each benchmark run, the respective parser was called a few hundred times, each time parsing a document one hundred times in a row.
Figure 3 shows the average results of these benchmark runs using a normalized scale. On an Intel Pentium M processor, our parser (SAXy) is 2.3 times as fast as libxml. On a Cell BE processor, with eight independent parse threads running in parallel, the total parse throughput is estimated to be 3.0 times that of libxml on an Intel Pentium M processor.
Inspired by a novel finite state machine technology, we were able to successfully implement a non-validating high-performance XML parser that fits into the limited local store of the Cell BE processors SPEs. The parser consists of a front-end that provides a SAX application interface, and eight back-end parse threads on the SPEs. Based on the hand-written implementation of the main loop, the data structures, and the instruction set, we developed a process to automatically generate the remaining parser code from a given abstract specification.
Currently, our parser is restricted to XML 1.0 documents encoded in UTF-8 and does not parse DTDs. Both limitations could be resolved, but would impose a certain penalty on the overall performance. Validation does not seem to be feasible because of the SPEs limited memory.
After multiple iterations of performance tuning, our parser is 2.3 times as fast as libxml on an Intel Pentium M processor and, with eight parallel parse threads on a Cell BE processor, is expected to achieve 3.0 times the throughput of libxml on an Intel Pentium M processor.
We thank Jan van Lunteren for his continuous contributions to our project. We thank Joseph Bostian for his valuable input and support.
IBM, Power Architecture, and DB2 are trademarks of International Business Machines in the United-States, other countries, or both.
Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.
Intel and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
 S. Letz, Cell Processor-Based Workstation for XML Offload System Architecture and Design, University of Leipzig, Department of Computer Science, Leipzig, Germany, May 2005.
 S. Letz, R. Seiffert, J. van Lunteren, and P. Herrmann, System Architecture for XML Offload to a Cell Processor-Based Workstation, Proceedings of the XML 2005 Conference (XML2005), Atlanta, GA, USA, November 2005.
 Extensible Markup Language (XML) 1.0, W3C Recommendation, http://www.w3.org/TR/REC-xml/.
 Extensible Markup Language (XML) 1.1, W3C Recommendation, http://www.w3.org/TR/xml11/.
 Simple Object Access Protocol (SOAP) 1.2, W3C Recommendation, http://www.w3.org/TR/soap/.
 D. Pham et al., The Design and Implementation of a First-Generation CELL Processor, Proceedings of the 2005 IEEE International Solid State Circuits Conference (ISSCC), San Francisco, CA, USA, February 2005.
 J. van Lunteren and A.P.J. Engbersen, A High-Performance Pattern-Matching Engine for Intrusion Detection, Hot Chips 17, Stanford University, Palo Alto, CA, USA, August 2005.
 J. van Lunteren and A.P.J. Engbersen, XML Accelerator Engine, First International Workshop on High Performance XML Processing, in conjunction with the 13th International World Wide Web Conference (WWW2004), New York, NY, USA, May 2004.
 libxml The XML C parser and toolkit for Gnome, http://xmlsoft.org/.