Posts
8
Comments
25
Trackbacks
0
LINQ & Lambda, Part 3: Html Agility Pack to LINQ to XML Converter

 

The Html Agility Pack brings flexible, versatile html parsing onto the .NET platform. It's a great toolkit for scraping websites that don't offer formal, documented APIs.

Under the hood, the agility pack has it's own parser that's deftly designed to parse malformed HTML in all it's nasty real-life forms.

For querying and traversing a parsed document, the pack offers a in-memory object graph, XPath and XSLT. Today I'd like to add a LINQ to XML converter for extra HTML agility in this new .NET 3.5 post-MIX '08 world.

 

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml.Linq;
using System.IO;
 
namespace HtmlAgilityPack
{
    public static class HtmlDocumentExtensions
    {
        public static XDocument ToXDocument(this HtmlDocument document)
        {
            using (StringWriter sw = new StringWriter())
            {
                document.OptionOutputAsXml = true;
                document.Save(sw);
                return XDocument.Parse(sw.GetStringBuilder().ToString());
            }
        }
    }
}

For a demonstration on how to use this, I'm going to partially solve a scraping problem I came across a few days ago. Basically, I needed to find all the images on a webpage, order them by descending size, and list them to the user for their selection. Their choice would be used to generate a thumbnail. Social content sites like Digg and Mixx do something similar, but seem to download the top 10 biggest images, thumbnail them anyway and let the user choose. In the following example, I'm going to simply find all the images in the right order and let the reader do the thumbnailing :)

The first step is to download a page and convert it's HtmlDocument object graph into an XDocument.

HtmlWeb hw = new HtmlWeb();
string url = @"http://www.nytimes.com/2008/05/18/us/politics/18memoirs.html?hp";
Uri uri = new Uri(url);
 
HtmlDocument doc = hw.Load(url);
var xdoc = doc.ToXDocument();

 

Now we can find all the image tags with a simple LINQ expression.

var imgs =
        from el in xdoc.Descendants()
        where el.Name.LocalName == "img"
        select 
        new { 
            Src = el.Attribute(XName.Get("src")).Value,
        };

 

The last step is to order them by descending size, but there's a couple of things to consider when reading the "width" and "height" from an image XElement.

  1. "width" and "height" may not exist. These attributes aren't mandatory.
  2. "width" and "height" may be expressed in percentages.
  3. XElement.Attribute(XName.Get("width")).Value will throw a NullRefereceException if the attribute doesn't exist.

Every raster image has a width and height, even if the image tag doesn't. If the size cannot be seen on the tag, we could download and open the image to find out it's true height/width. A fast way to do this would be read the image header, but this post isn't about the Image class or reading images - just LINQ to HTML. For now, I'm going to create a default width and height of 0.

XName widthAttribute = XName.Get("width");
XName heightAttribute = XName.Get("height");
XName srcAttribute = XName.Get("src");
XAttribute defaultHeight = new XAttribute(heightAttribute, "0");
XAttribute defaultWidth = new XAttribute(widthAttribute, "0");

Now we can read the width and height in the LINQ expression allowing for defaults, and adding a hack for sizes expressed in percentages.

var imgs =
from el in xdoc.Descendants()
let width = Int32.Parse((el.Attribute(widthAttribute) ?? defaultWidth).Value.TrimEnd('%'))
let height = Int32.Parse((el.Attribute(heightAttribute) ?? defaultHeight).Value.TrimEnd('%'))
let metric = Math.Sqrt(width * height)
where el.Name.LocalName == "img" orderby metric descending select new {
Src = el.Attribute(srcAttribute).Value,
Width = width,
Height = height
};

 

foreach (var image in imgs)
{
    Console.WriteLine("{0} Size: {1}x{2}", 
        image.Src, 
        image.Width, 
        image.Height);
}

 

Hey presto! The image URLs get written to the console in the right order. The output for the new york times article looks like this:

Found 26 images

http://graphics7.nytimes.com/images/2008/05/18/us/18mem.span.jpg Size: 600x350
http://graphics7.nytimes.com/adx/images/ADS/14/60/ad.146024/sf.gif Size: 300x250
http://graphics7.nytimes.com/images/2008/05/17/us/18memoir_190.jpg Size: 190x285
http://graphics7.nytimes.com/ads/marketing/mm08/opinion_033108.jpg Size: 334x105
http://graphics7.nytimes.com/ads/marketing/mm08/opinion_052608.jpg Size: 334x105
http://graphics8.nytimes.com/images/2008/05/25/opinion/25moth_glanville.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/25/weekinreview/25moth_wong.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/23/fashion/25moth-brain1.ready.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/25/opinion/25moth_classic.gif Size:
151x151
http://graphics8.nytimes.com/images/2008/05/25/nyregion/thecity/25moth_crime.jpg Size: 151x151
http://graphics8.nytimes.com/images/2008/05/25/travel/25moth-trail.jpg Size: 151x151
http://graphics7.nytimes.com/images/blogs/thecaucus/thecaucus75.jpg Size: 75x75
http://graphics7.nytimes.com/ads/marketing/mm07/nyt-logo.png Size: 151x26
http://graphics7.nytimes.com/ads/marketing/mm07/nyt-logo.png Size: 151x26
http://graphics7.nytimes.com/ads/marketing/mm07/vertical_opinion.gif Size: 145x23
http://graphics7.nytimes.com/ads/marketing/mm07/vertical_opinion.gif Size: 145x23
http://up.nytimes.com/?d=0//&t=2&s=0&ui=&r=&u= Size: 3x1
/adx/bin/clientside/7e2e2078Q2FZkQ3F3VQ7BpQ27!l6c6L7cQ7BQ7Bpp82Q27Vp Size: 3x1
http://wt.o.nytimes.com/dcsym57yw10000s1s8g0boozt_9t1x/njs.gif?js=No&WT.tv=1.0.7 Size: 1x1
http://graphics7.nytimes.com/ads/ameriprise/AMP_logo_88x31.gif Size: 0x0
http://graphics7.nytimes.com/images/misc/nytlogo153x23.gif Size: 0x0
http://politics.nytimes.com/images/section/politics/elections/2008.gif Size: 0x0
http://graphics7.nytimes.com/adx/images/ADS/16/69/ad.166980/728x90_T_logo.gif Size: 0x0
http://graphics7.nytimes.com/adx/images/ADS/16/95/ad.169529/youngheart_88x31_8.gif Size: 0x0
http://graphics7.nytimes.com/adx/images/ADS/16/01/ad.160192/TMAG_86x60.gif Size: 0x0
http://graphics7.nytimes.com/ads/blank.gif Size: 0x0

 

Some of my assumptions above may seem less than ideal, but remember that it's good practice to include the width and height in the image tag so the browser knows how much visual space to leave while the page loads. For any well designed page this approach works nicely.

Another neat use for LINQ :)

UPDATE: you can download the code with my modifications and this sample from here.

kick it on DotNetKicks.com

posted on Monday, May 26, 2008 4:03 PM Print
Comments
Gravatar
# re: LINQ & Lambda, Part 3: Html Agility Pack to LINQ to XML Converter
Victor
7/15/2008 9:47 AM
How do you download this thing?
Gravatar
# re: LINQ & Lambda, Part 3: Html Agility Pack to LINQ to XML Converter
Vijay Santhanam
7/15/2008 9:11 PM
Hi Victor,

Download the agility pack from http://codeplex.com/htmlagilitypack

Simply add my extension method and you're ready to rock.

Hope that helps

Post Comment

Title *
Name *
Email
Url
Comment *  
Please add 7 and 6 and type the answer here: