Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Welcome to the Monastery

Re: a regex to parse html tags

by wog (Curate)
 | Log in | Create a new user | The Monastery Gates | Super Search | 
 | Seekers of Perl Wisdom | Meditations | PerlMonks Discussion | Snippets | 
 | Obfuscation | Reviews | Cool Uses For Perl | Perl News | Q&A | Tutorials | 
 | Code | Poetry | Recent Threads | Newest Nodes | Donate | What's New | 

on Jul 05, 2001 at 06:54 UTC ( #93996=note: print w/ replies, xml ) Need Help??

in reply to a regex to parse html tags

First, use the HTML::Parser module or HTML::TokeParser, or in your case, probably the HTML::HeadParser module to parse the HTML. Regular expressions won't work. What if you have an HTML document with like:

I changed this, it was just <head><title></title></head>
 - djb (03 Jul 2001)
<meta name="DESCRIPTION" value="About </head> tags.">
That's a whole lot harder to parse with regular expression.

That said [.\n] creates a character class matching a period and a newline. The []s interperate .s as not special. You could use the /s modifier (see perlre) and just use .+ instead. The /s modifier will make . match even newlines. Another way is to use (?:.|\n) which is the same as (.|\n) except that it doesn't capture anything (into the $<digit> variables.)

Also you need to actually escape / in regexs if you are using / as the deliminator with like: \/, or you can avoid that ugliness by using an alternate deliminator (like m!regex goes here! or m(regex).) I assume the lack of a / at the end of your regex is an error made in posting your code here.

update: To give another example of why not to use a regex: <head something="someattribute">...</head> won't be handled by a simple regex either.

update 2: fixed typo of </head> where </title> was meant and other minor typos.

Comment on Re: a regex to parse html tags
Select or Download Code
Re: Re: a regex to parse html tags
by Hofmator (Curate) on Jul 05, 2001 at 12:52 UTC

    Just one small addition to the problem of matching any character. The /s modifier is definitely the way to go, so that . matches everything including newlines.

    If for some reason you don't want to use the \s modifier - maybe you have other dots in your regex which should not match newlines - you should use a character class. The advantage over (?:.|\n) is that no backtracking has to be done.

    # character class matching any one character
    # or equivalent

    -- Hofmator

remember me
password reminder
Create A New User

Node Status
node history
Node Type: note [id://93996]
Community Ads
and all is quiet...

How do I use this? | Other CB clients
Other Users
Others browsing the Monastery: (12)
As of 2008-10-25 10:45 GMT
The Monastery Gates
Seekers of Perl Wisdom
PerlMonks Discussion
Categorized Q&A
Obfuscated Code
Perl Poetry
Cool Uses for Perl
Snippets Section
Code Catacombs
Perl News
PerlMonks FAQ
Guide to the Monastery
What's New at PerlMonks
Voting/Experience System
Perl FAQs
Other Info Sources
Find Nodes
Nodes You Wrote
Super Search
List Nodes By Users
Newest Nodes
Recently Active Threads
Selected Best Nodes
Best Nodes
Worst Nodes
Saints in our Book
The St. Larry Wall Shrine
Offering Plate
Editor Requests
Buy PerlMonks Gear
PerlMonks Merchandise
Perl Buzz
Perl 5 Wiki
Perl Jobs
Perl Mongers
Planet Perl
Use Perl
Perl Directory
Perl documentation
Random Node
Voting Booth

I joined PerlMonks because I wanted to learn about Perl, but now I stay because...

I want to learn about Perl
I like to teach people about Perl
I found a case of the most excellent red in the cellar, and I ain't leaving till it's empty (hic!)
I'm a member of the Illuminati, and I know where you live!
It's a great pick-up joint
I haven't got my pony yet!
I dropped my car keys somewhere and I can't find them
I'm addicted to the ChatterBox
I've become an XP whore
I'm waiting for paco's next question

Results (312 votes), past polls