In the comments of Do Not... DO NOT! Parse HTML with Regex's and Example of Hard to Parse HTML, a couple of people piped up about how there are legitimate reasons to use regex's when your input is HTML.
Let me reiterate, if all you can say (in the biz we call this "specification") about the input to your program is that "it will be HTML"; there is no reason on God's green Earth that you should be using regular expressions to parse it. I can't think of a way to make that any more clearer.
Patrick Walton was first up and offered these two regex's:
This will, of course, mutilate the contents of a page like:
<html> <body> onMyBirthday=FUN!! onHolidays=FUN!! ontology=FUN!! batons=FUN!! </body> </html>
There are more ways to break Patrick's regular expressions, but I think that is enough for now.
The next guy to chime in was one d.furuta, who says:
Assuming you know what you're doing, it's fine to use regexes if appropriate. The problem is not experienced programmers using regexes; it's inexperienced programmers not knowing when regexes are acceptable and when a parser is needed.
I have heard this argument before. Usually, I hear it as justification for seeing something like the following code:
($table_data) = $html =~ /<td>(.*?)<\/td>/gis; # pull out data between <td> tags
"But, it works!" they say.
"It will do the job just fine!"
I berate them for not being lazy. (I, also, berate them for using .* in a regex — see Death to Dot Star! — but that's the subject of a different post.) You need to be lazy as a programmer. Parsing HTML is a solved problem. You do not need to solve it. You just need to be lazy. You have CPAN. You will never reach Perl-Guru until you have mastered CPAN. Be lazy, use CPAN and use HTML::Sanitizer. It will make your coding easier. It will leave your code more maintainable. You won't have to sit there hand-coding regular expressions. Your code will be more robust. You won't have to bug fix every time the HTML breaks your crappy regex. This is true laziness:
There now, doesn't that code seem so much nicer than 97 regex's and various special case post-regex and pre-regex string manipulations? Isn't that nice, clean, clear, and concise? Aren't robust libraries wonderful?
One last time:
If you still need convincing I can go on and on and on about this topic. Just let me know. And, the best way to let me know is to continue posting regex's in the comments. You set'em up. I'll knock'em down.