I need to extract some data from a web page that I'm parsing, using
regular expressions to find matches I'm interested in. In particular,
I'm trying to extract the text from between <A> and </A>. The <A> does
have several attributes which I don't care about. The reg. exp. I'm
trying to use is:
<a class=((.)+?) id=((.)+?)</a>
I'm after anchors that have a class & id attributes as the 1st ones.
This pattern works for most of the page I'm parsing, but it hangs when
there's extra spaces in the tag, for instance: if the html starts like:
<a class=t id= .....
or
<a class=t id= .....
the pattern does NOT match since the extra spaces between things, but
in fact it's an anchor that I do want to extract. How can I ignore
those extra spaces?
Thanks.
Chris Smith - 27 Jun 2006 01:48 GMT
> <a class=t id= .....
>
> the pattern does NOT match since the extra spaces between things, but
> in fact it's an anchor that I do want to extract. How can I ignore
> those extra spaces?
Use \s+ (if in a string literal, \\s+) instead of the space in the
regular expresion.

Signature
Chris Smith - Lead Software Developer / Technical Trainer
MindIQ Corporation
cmills28@yahoo.com - 27 Jun 2006 02:26 GMT
> > <a class=t id= .....
> >
[quoted text clipped - 8 lines]
> Chris Smith - Lead Software Developer / Technical Trainer
> MindIQ Corporation
Thanks Chris!! That did it!
i30817@gmail.com - 27 Jun 2006 11:33 GMT
If you want to ignore \n you can do something like this:
<tag>(([^<]*\n)*[^<]*)</tag>
and use the $1 captured group for whatever you want. I think its
correct. You can test regular expression quickly in JEdit