English version below.
You can’t parse [X]HTML with regex. Because HTML can’t be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.
Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.
Artigo completo: http://www.codinghorror.com/blog/archives/001311.html
Nunca cheguei a fazer nenhum parsing sério de HTML com expressões regulares, mas pelo visto não é uma boa idéia. A menos que você queira apenas checar algumas tags simples.
Este comentário no artigo esclarece um pouco mais:
For the brave of heart: Write a regular language to recognize all strings of balanced parens.
Actually, don’t, because this is provably impossible.
Why? Because regular expressions recognize *regular languages*, a specific, well-defined class of languages. HTML, like the balanced parens problem above, doesn’t conform to this pattern.
Every once in a while, I’m reminded of why studying bona fide computer science in college was the right idea. It won’t necessarily make you a better programmer, but it has saved me from doing really stupid things from time to time, like trying to parse html with a regex.
David R. Albrecht on November 16, 2009 10:23 AM
Realmente, regexps não podem ser usadas para nada além de checar tags simples, e nunca para um texto HTML inteiro: (http://www.perlmonks.org/?node_id=8624)
you cannot write a regexp that will match any level of nested balanced parentheses. You can always write one to match a specific number of balanced ones, but not one that can match a non- fixed number of these. I think there is an explaination in the Friedl book.
Um comentário retirado do Regular Expression HOWTO que resume bem a situação:
(Note that parsing HTML or XML with regular expressions is painful. Quick-and-dirty patterns will handle common cases, but HTML and XML have special cases that will break the obvious regular expression; by the time you’ve written a regular expression that handles all of the possible cases, the patterns will be very complicated. Use an HTML or XML parser module for such tasks.)
Portanto… não faça isto.
Visto noSeen at http://www.codinghorror.com/:
You can’t parse [X]HTML with regex. Because HTML can’t be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.
Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.
Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes.
Full article: http://www.codinghorror.com/blog/archives/001311.html
I never used regular expressions to do any serious HTML parsing, but thinking about it it doesn’t looks a good idea. Unless you just want to check some simple tags.
This comment in the article is interesting:
For the brave of heart: Write a regular language to recognize all strings of balanced parens.
Actually, don’t, because this is provably impossible.
Why? Because regular expressions recognize *regular languages*, a specific, well-defined class of languages. HTML, like the balanced parens problem above, doesn’t conform to this pattern.
Every once in a while, I’m reminded of why studying bona fide computer science in college was the right idea. It won’t necessarily make you a better programmer, but it has saved me from doing really stupid things from time to time, like trying to parse html with a regex.
David R. Albrecht on November 16, 2009 10:23 AM
Indeed, regexps cannot be used to anything else beyond checking simple tags, and never for a full HTML text: (http://www.perlmonks.org/?node_id=8624)
you cannot write a regexp that will match any level of nested balanced parentheses. You can always write one to match a specific number of balanced ones, but not one that can match a non- fixed number of these. I think there is an explaination in the Friedl book.
A comment taken from the Regular Expression HOWTO that sums it up well:
(Note that parsing HTML or XML with regular expressions is painful. Quick-and-dirty patterns will handle common cases, but HTML and XML have special cases that will break the obvious regular expression; by the time you’ve written a regular expression that handles all of the possible cases, the patterns will be very complicated. Use an HTML or XML parser module for such tasks.)
So… don’t do this.