In the comments of Do Not... DO NOT! Parse HTML with Regex's and Example of Hard to Parse HTML, a couple of people piped up about how there are legitimate reasons to use regex's when your input is HTML.

Let me reiterate, if all you can say (in the biz we call this "specification") about the input to your program is that "it will be HTML"; there is no reason on God's green Earth that you should be using regular expressions to parse it. I can't think of a way to make that any more clearer.

Patrick Walton was first up and offered these two regex's:

s/<\s*script.*?>//gis;
s/on[a-zA-Z]+\s*=//gis;

This will, of course, mutilate the contents of a page like:

<html>
<body>
onMyBirthday=FUN!!
onHolidays=FUN!!
ontology=FUN!!
batons=FUN!!
</body>
</html>

Also, there are additional places other than just <script> tags and on* attributes where one can hide JavaScript code (via Integrating JavaScript into Stylesheets):

<html>
<body>
<style>
background: url("
    javascript:
      document.body.onload = function(){
        ...custom js here...
      }
  ");
</style>
...
...
</body>
</html>

There are more ways to break Patrick's regular expressions, but I think that is enough for now.

The next guy to chime in was one d.furuta, who says:

Assuming you know what you're doing, it's fine to use regexes if appropriate. The problem is not experienced programmers using regexes; it's inexperienced programmers not knowing when regexes are acceptable and when a parser is needed.

I have heard this argument before. Usually, I hear it as justification for seeing something like the following code:

($table_data) = $html =~ /<td>(.*?)<\/td>/gis; # pull out data between <td> tags

"But, it works!" they say.
"It's easy!"
"It's quick!"
"It will do the job just fine!"

I berate them for not being lazy. (I, also, berate them for using .* in a regex — see Death to Dot Star! — but that's the subject of a different post.) You need to be lazy as a programmer. Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy. You have CPAN. You will never reach Perl-Guru until you have mastered CPAN. Be lazy, use CPAN and use HTML::Sanitizer. It will make your coding easier. It will leave your code more maintainable. You won't have to sit there hand-coding regular expressions. Your code will be more robust. You won't have to bug fix every time the HTML breaks your crappy regex. This is true laziness:

#!/usr/bin/perl
use warnings;
use strict;

use HTML::Sanitizer;
use LWP::Simple;

my $html = get('http://alpha-geek.com/example/crazy_html.html');

my $sanity = new HTML::Sanitizer;

# pick your poisons
# in the documentation, learn how to specify permissible attributes, also
$sanity->permit_only(qw/ html head title body a p h1 div strong em /);

print $sanity->filter_html($html);

__END__
# the above code outputs
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Crazy HTML -- Can Your R
egex Parse This?</title></head><body><h1>Did The Javascript Execute?</h1><div> I
 will execute here, too, if you mouse over me </div></body></html>

There now, doesn't that code seem so much nicer than 97 regex's and various special case post-regex and pre-regex string manipulations? Isn't that nice, clean, clear, and concise? Aren't robust libraries wonderful?

One last time:

Parsing HTML is a solved problem. Use a library.

If you still need convincing I can go on and on and on about this topic. Just let me know. And, the best way to let me know is to continue posting regex's in the comments. You set'em up. I'll knock'em down.

Trackbacks

Trackback URL for this entry is: http://hiveserver.com/cgi-bin/trackback.pl/492
Die Comments Die
Excerpt: Boy am I really tired fighting MT on comments content. Even after I disable ALL plugins, and turn off...
Weblog: I Can't Focus
Tracked: January 13, 2004 08:24 AM
You've gotta love a challege :-)
Excerpt: TITLE: You've gotta love a challege :-) URL: http://regexblogs.com/dneimke/archive/2004/01/15/327.aspx IP: 66.129.67.222 BLOG NAME: ShowUsYour DATE: 01/15/2004 06:53:09 AM TITLE: You've gotta love a challege :-) URL: http://regexblogs.com/dneimke/archive/2004/01/15/327.aspx IP: 66.129.67.222 BLOG NAME: ShowUsYour DATE: 01/15/2004 06:53:09 AM
Weblog: ShowUsYour
Tracked: January 15, 2004 06:53 AM
Create a regex to strip javascript from Html
Excerpt: TITLE: Create a regex to strip javascript from Html URL: http://blogs.regexadvice.com/dneimke/archive/2004/01/17/336.aspx IP: 66.129.67.222 BLOG NAME: ShowUsYour DATE: 01/16/2004 03:41:27 PM TITLE: Create a regex to strip javascript from Html URL: http://blogs.regexadvice.com/dneimke/archive/2004/01/17/336.aspx IP: 66.129.67.222 BLOG NAME: ShowUsYour DATE: 01/16/2004 03:41:27 PM
Weblog: ShowUsYour
Tracked: January 16, 2004 03:41 PM
Create a regex to strip javascript from Html
Excerpt: TITLE: Create a regex to strip javascript from Html URL: http://blogs.regexadvice.com/dneimke/archive/0001/01/01/336.aspx IP: 66.129.67.222 BLOG NAME: ShowUsYour DATE: 01/16/2004 03:51:48 PM TITLE: Create a regex to strip javascript from Html URL: http://blogs.regexadvice.com/dneimke/archive/0001/01/01/336.aspx IP: 66.129.67.222 BLOG NAME: ShowUsYour DATE: 01/16/2004 03:51:48 PM
Weblog: ShowUsYour
Tracked: January 16, 2004 03:51 PM

Comments

To make no mistakes is not in the power of man; but from their errors and mistakes the wise and good learn wisdom for the future.

Posted by Bronson Pinchot on January 13, 2004 02:59 AM

Yea, I should point out that, a long time ago, I parsed HTML with regex's. Well, actually, I parsed everything with regex's. I used regex's everywhere I possibly could. I was a regex over-user.

But, I learned better. And, now, I can help out others by pointing out that it is better, easier, lazier, and nicer to use libraries for parsing HTML.

Posted by J$ on January 13, 2004 11:38 AM

That was not my point. If you're extracting information, yes, you should use a parser. If you're trying to mangle something beyond browser recognition, regexes are ok, as long as you know the different forms the malicious code can take.

Posted by d.furuta on January 13, 2004 11:58 PM

Furthermore, if you are a good enough programmer to know what you're doing (which I am not claiming to be), and if you want to parse HTML using substr, more power to you. Pedantry and insisting that there's only one way (TIMTOWTDI, eh?) serve no useful purpose.

Posted by d.furuta on January 14, 2004 12:06 AM

I had no idea my quick regexes would be subjected to such scrutiny; I'm flattered! I apologize that I'm not going to play your game further, but those regexes weren't designed to be a perfect solution; for this, I agree entirely that a module is needed. In fact, I *use* a module in my web project for this (unless the module isn't available, in which regexes are used as a fallback). But for day-to-day work, using a bulky module that builds up an entire parse tree is swatting a fly with a sledgehammer.

I don't dispute that modules are quite often the right tool for the job, but I do dispute that "there's no reason on God's green earth" to use regexes when dealing with HTML. There's actually a library on CPAN which uses regexes to destroy JavaScript: HTML::StripScripts::Regex. HTML::Sanitizer actually advocates using regexes to catch JavaScript embedded in URLs. Its sister module HTML::Scrubber, which is also based on parsing the document, recommends the same method.

(Speaking of which, wouldn't a module better suited to the job - HTML::StripScripts - be a better choice for removing JavaScript?)

Posted by Patrick Walton on January 14, 2004 09:23 PM

Whats you point in "mastering CPAN"? Is it the art of seperating the junk modules from the good ones in it? It also always depends on the enviroment where you use your software! do you know what happens to the startup-time of your perl interpreter after installing the twentieth CPAN-module? have you made benchmarks comparing the speed of a well crafted regex and HTML::Sanitizer?
That should be just some pointers that programming is not always about style and code-reuse.

Posted by andi on January 22, 2004 05:15 AM

Post a comment










Comment Policy

I will delete comments that I want to delete. Deal with it. Most HTML is allowed — <b> <i> <a href="..."> <blockquote> etc. Use the Preview button to see if your HTML went through.