LINQ & Lambda, Part 3: Html Agility Pack to LINQ to XML Converter


The Html Agility Pack brings flexible, versatile html parsing onto the .NET platform. It's a great toolkit for scraping websites that don't offer formal, documented APIs.

Under the hood, the agility pack has it's own parser that's deftly designed to parse malformed HTML in all it's nasty real-life forms.

For querying and traversing a parsed document, the pack offers a in-memory object graph, XPath and XSLT. Today I'd like to add a LINQ to XML converter for extra HTML agility in this new .NET 3.5 post-MIX '08 world.


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.IO;
namespace HtmlAgilityPack
    public static class HtmlDocumentExtensions
        public static XDocument ToXDocument(this HtmlDocument document)
            using (StringWriter sw = new StringWriter())
                document.OptionOutputAsXml = true;
                return XDocument.Parse(sw.GetStringBuilder().ToString());

For a demonstration on how to use this, I'm going to partially solve a scraping problem I came across a few days ago. Basically, I needed to find all the images on a webpage, order them by descending size, and list them to the user for their selection. Their choice would be used to generate a thumbnail. Social content sites like Digg and Mixx do something similar, but seem to download the top 10 biggest images, thumbnail them anyway and let the user choose. In the following example, I'm going to simply find all the images in the right order and let the reader do the thumbnailing :)

The first step is to download a page and convert it's HtmlDocument object graph into an XDocument.

HtmlWeb hw = new HtmlWeb();
string url = @"";
Uri uri = new Uri(url);
HtmlDocument doc = hw.Load(url);
var xdoc = doc.ToXDocument();


Now we can find all the image tags with a simple LINQ expression.

var imgs =
        from el in xdoc.Descendants()
        where el.Name.LocalName == "img"
        new { 
            Src = el.Attribute(XName.Get("src")).Value,


The last step is to order them by descending size, but there's a couple of things to consider when reading the "width" and "height" from an image XElement.

  1. "width" and "height" may not exist. These attributes aren't mandatory.
  2. "width" and "height" may be expressed in percentages.
  3. XElement.Attribute(XName.Get("width")).Value will throw a NullRefereceException if the attribute doesn't exist.

Every raster image has a width and height, even if the image tag doesn't. If the size cannot be seen on the tag, we could download and open the image to find out it's true height/width. A fast way to do this would be read the image header, but this post isn't about the Image class or reading images - just LINQ to HTML. For now, I'm going to create a default width and height of 0.

XName widthAttribute = XName.Get("width");
XName heightAttribute = XName.Get("height");
XName srcAttribute = XName.Get("src");
XAttribute defaultHeight = new XAttribute(heightAttribute, "0");
XAttribute defaultWidth = new XAttribute(widthAttribute, "0");

Now we can read the width and height in the LINQ expression allowing for defaults, and adding a hack for sizes expressed in percentages.

var imgs =
from el in xdoc.Descendants()
let width = Int32.Parse((el.Attribute(widthAttribute) ?? defaultWidth).Value.TrimEnd('%'))
let height = Int32.Parse((el.Attribute(heightAttribute) ?? defaultHeight).Value.TrimEnd('%'))
let metric = Math.Sqrt(width * height)
where el.Name.LocalName == "img" orderby metric descending select new {
Src = el.Attribute(srcAttribute).Value,
Width = width,
Height = height


foreach (var image in imgs)
    Console.WriteLine("{0} Size: {1}x{2}", 


Hey presto! The image URLs get written to the console in the right order. The output for the new york times article looks like this:

Found 26 images Size: 600x350 Size: 300x250 Size: 190x285 Size: 334x105 Size: 334x105 Size: 151x151 Size: 151x151 Size: 151x151 Size:
151x151 Size: 151x151 Size: 151x151 Size: 75x75 Size: 151x26 Size: 151x26 Size: 145x23 Size: 145x23 Size: 3x1
/adx/bin/clientside/7e2e2078Q2FZkQ3F3VQ7BpQ27!l6c6L7cQ7BQ7Bpp82Q27Vp Size: 3x1 Size: 1x1 Size: 0x0 Size: 0x0 Size: 0x0 Size: 0x0 Size: 0x0 Size: 0x0 Size: 0x0


Some of my assumptions above may seem less than ideal, but remember that it's good practice to include the width and height in the image tag so the browser knows how much visual space to leave while the page loads. For any well designed page this approach works nicely.

Another neat use for LINQ :)

UPDATE: you can download the code with my modifications and this sample from here.

kick it on

posted on Monday, May 26, 2008 4:03 PM Print
# re: LINQ & Lambda, Part 3: Html Agility Pack to LINQ to XML Converter
7/15/2008 9:47 AM
How do you download this thing?
# re: LINQ & Lambda, Part 3: Html Agility Pack to LINQ to XML Converter
Vijay Santhanam
7/15/2008 9:11 PM
Hi Victor,

Download the agility pack from

Simply add my extension method and you're ready to rock.

Hope that helps

Post Comment

Title *
Name *
Comment *  
Please add 7 and 6 and type the answer here: