phpQuery vs. QueryPath

phpQuery vs. QueryPath

Web developers have come to love the simplicity and elegance of the jQuery JavaScript library for working with the DOM (document object model), such as parsing HTML or inserting new data.

However, since jQuery is written in JavaScript, we're left in the dust when parsing HTML documents in PHP and need to resort to the old and tedious XML parsers, right? Well not anymore! There exist two PHP implementations of jQuery, phpQuery (0.9.5.386) and QueryPath (2.0.1), which allow you to use the same familiar functions and syntax.

This article compares the two libraries in terms of features, performance, ease of use and documentation.

Loading & creating documents

To start using either library, you just need to include their main php file:

Loading phpQuery

require_once('./phpQuery/phpQuery/phpQuery.php');

Loading QueryPath

require_once('./QueryPath/QueryPath.php');

Once that is done, you can create a new document. phpQuery provides a handful of functions to create documents either from markup such as phpQuery::newDocumentHTML($html, $charset = 'utf-8') or from file such as phpQuery::newDocumentFileXML($file, $charset = 'utf-8') and then returns a phpQuery object. The easiest way is to let phpQuery do auto-detection using the following code:

Document creation in phpQuery

$pq = phpQuery::newDocument($html);

QueryPath works in a similar way but makes things even easier by just providing a single document creation function that accepts markup, filepaths and even URLs. It then returns a QueryPath object.

Document creation in QueryPath

$qp = qp($html);
$qp = qp($filepath);
$qp = qp($url);

Basic usage

Both libraries follow the same principle of chain-ability that jQuery is famous for. Every function returns an object (like $pq or $qp from above) which can be re-used to apply additional functions such as $pq->find('...')->children()->not('...').

In addition, both libraries allow you to loop through results by using PHP's foreach. The difference is that phpQuery returns a DomNode object whereas QueryPath returns a QueryPath object.

Looping through results in phpQuery

foreach ($pq->find('...') as $node) {
  echo $node->node_name; // DomNode object
  echo qp($node)->attr('...'); // turn node into a phpQuery object
}

Looping through results in QueryPath

foreach ($qp->find('...') as $node) {
  echo $node->attr('...'); // QueryPath object
}

There's a little catch with QueryPath in that the functions are always applied to the current context whereas phpQuery always starts from the object's root. In order for QueryPath to start from the root, a special selector :root has to be applied. See the following examples:

phpQuery always starting from the root

$pq->find('body'); // returns 1 body element
$pq->find('body'); // also returns 1 body element, since phpQuery starts from the root again

QueryPath always starting from the current context

$qp->find('body'); // returns 1 body element
$qp->find('body'); // returns 0 elements because the current context is already the body element itself
$qp->find(':root body'); // returns 1 body element after starting explicitly from the root

Functions

Both libraries support the most important and used jQuery functions, such as find(), filter(), append(), remove(), html() and a whole lot more which I'm not going to list here.

Compared to QueryPath, phpQuery also implements some of the more advanced jQuery functions. For instance grep(), toggleClass(), clone(), empty() and trim().

However, unless you absolutely require one of those advanced functions, you should be fine with either library.

Selectors

Certain functions such as find() accept selectors similar to CSS selectors. Both libraries seem to support the same basic selectors such as #id or selector1, selector2, selector3 and hierarchical selectors such as ancestor descendant or parent > child.

Again no difference in which library you use.

Special features

In addition to jQuery support, both libraries offer a couple of special features which you may or may not find useful. These are some of the more important ones:

  • phpQuery:
    • Plugins: It's possible to extend phpQuery through a plugin system.
    • WebBrowser: A plugin that simulates user web browser behavior inside a PHP script.
    • jQueryServer: Can be used to do jQuery Ajax requests.
  • QueryPath:
    • Extensions: Like phpQuery plugins, QueryPath can be extended through an extension system.
    • Templates: A template extension which can be used to format larger chunks of HTML or XML at once.
    • Database layer: An extension that let's you connect to a database, execute queries and merge the results into QueryPath objects.

Documentation

Both libraries offer online documentation on how to use them.

phpQuery's documentation is very user-friendly and describes all features where necessary or links to the respective jQuery page. Furthermore, the library files contain a lot of useful example code to get your hands dirty. Not once did I have to look for other tutorials on the net, everything I needed was right there.

QueryPath's documentation consists of an introductory tutorial, example source files and an API reference. The former two do a decent job at explaining the basics, however, the API reference is confusing and provides no easy way of finding what jQuery features are supported. I found myself often forced to look for additional tutorials elsewhere that showed some more example code in action.

Invalid markup handling

If you plan on using either library to parse HTML sites you didn't create yourself, chances are they contain invalid markup and don't fully validate.

This can lead to problems when feeding the erroneous HTML to either parser, worst case being a library crash, which I encountered once using phpQuery.

Generally speaking, though, phpQuery handles invalid markup quite well most of the time and isn't too picky about its correctness.

QueryPath, on the other hand, throws out a fatal error whenever it encounters even the slightest anomaly, such as an unescaped ampersand (& instead of &).

If you do indeed have to work with input data from external sources, I recommend running the markup through a HTML sanitizer first, such as HTML Tidy for PHP (also available for other languages). The following code snippet demonstrates this:

Cleaning up invalid markup with HTML Tidy

// configuration to output XHTML without line wrapping
$config = array('output-xhtml' => true, 'wrap' => 0);

// clean up ($html contains the invalid markup)
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$html = tidy_get_output($tidy);

Performance benchmark

Now comes the real interesting part: Which library performs better?

To measure their performance, several small test queries were written and their execution time evaluated. The resulting score is the average duration of 10 test runs.

Because parsing a document and modifying it are vastly different things, the tests were divided into read and write operations.

Read tests

To test document parsing, the W3C start page was used as input and each of the following code snippets were executed 1'000 times in a loop (the first line for phpQuery and the second for QueryPath):

Test 1: Finding all <a> tags

$pq->find('a');
$qp->find(':root a');

Test 2: Finding nested tags

$pq->find('div p span');
$qp->find(':root div p span');

Test 3: Complex selectors

$pq->find('div > div[class^="w3c"] div *:odd[id]');
$qp->find(':root div > div[class^="w3c"] div *:odd[id]');

Test 4: Chained method calls

$pq->find('div.headline')->children('[class]')->not('h3')->filter('p')->children('span,a');
$qp->find(':root div.headline')->children('[class]')->not('h3')->filter('p')->children('span,a');
Read test results (less is better)Read test results (less is better)

Except for test 3, one can clearly see that QueryPath is heavily outperformed by phpQuery, especially for parsing nested tags in test 2. But even for simple operations, such as finding a specific tag in test 1, QueryPath is more than 3 times slower than phpQuery.

Needless to say, I was quite surprised by this result and first thought there was a bug in the testing code. But after thoroughly analyzing it and re-running the tests the results remained the same.

Write tests

The write tests involved creating a new document in test 1 and then modifying it in the subsequent tests:

Test 1: Creating a HTML file with 10'000 <div>

$pq->append('<div/>'); // called 10'000 times
$qp->append('<div/>'); // called 10'000 times

Test 2: Inserting text

$pq->find('div')->text('Hello world!');
$qp->find(':root div')->text('Hello world!');

Test 3: Adding attributes

$pq->find('div')->attr('class', 'test');
$qp->find(':root div')->attr('class', 'test');
Write test results (less is better)Write test results (less is better)

As if the read tests weren't surprising enough, this one takes the cake! Except for test 1, where a new document is filled with empty tags, phpQuery gets totally crushed by QueryPath. phpQuery is about 35 times slower when it comes to writing operations.

Conclusion

So, which library should you choose?

Performance aside, both libraries cover the most important jQuery features and are almost identical in usage. You can't really go wrong with either one.

The fact that QueryPath must be told explicitly to start from the document root makes it slightly less intuitive to use, though. This and phpQuery currently supporting some of the more advanced jQuery functions coupled with a more user-friendly documentation might make it the better choice for some quick jQuery-esque document processing.

However, when performance becomes an important factor, one must distinguish between read- and write-heavy tasks. If you need to parse hundreds of documents, the benchmark results speak for phpQuery. For document creation and modification, however, QueryPath is clearly the superior library.