Hi everybody,
I am trying to read the html source of a web page, and finding all the
thumbnails and linked addresses within.
The code below works, but with one strange exception: if I test it on a
html page, I get all the links except for one.
So to test and debug it, I took out that html line, and stuck it in a
test example. Whatever I do, I only get the last occurrence. Even though
the first two match the expression, they are not returned.
I always seem to get only the last one - if I delete it, I get the last
one from the newly formed string.
The regular expression tests for the following:
<a href="(1)"><img src="(2)"></a>
It takes into account the fact that there may be other stuff within (like
border, height, width, ...). (1) and (2) are the addresses I should
receive.
Here is the code:
--- code ---
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest
{
public RegexTest()
{
String html =
"<a href=\"http://www.domain.com/\"><img src=\"domain.gif
\" alt=\"Domain\"></a>" +
"</td><td align=center valign=middle width=\"33%\">" +
"<a href=\"http://www.domain.org/\"><img border=0 width=
130 height=70 src=\"domain.jpg\" " +
"alt=\"Domain\"></a></td></tr></table></td>" +
"<a href=\"http://www.domein.be/\"><img border=0 width=
130 height=70 src=\"domein.gif\" " +
"alt=\"Domein\"></a></td></tr></table></td>";
String expression = "<a.*href=\"([^\"]*)\".*>.*<img.*src=\"([^
\"]*)\".*>.*</a>";
Pattern p = Pattern.compile(expression);
Matcher m = p.matcher(html);
while (m.find())
{
System.out.println("Found a match!");
System.out.println(" -> href: " + m.group(1));
System.out.println(" -> isrc: " + m.group(2));
}
}
public static void main(String[] args)
{
new RegexTest();
}
}
--- /code ---
I do not understand why only the last match is returned - can somebody
please clarify this, or point me in the right direction?
Thanks in advance,
JayCee

Signature
http://jcsnippets.atspace.com/
a collection of source code, tips and tricks
hiwa - 18 Sep 2006 00:05 GMT
jcsnippets.atspace.com のメッセージ:
> Hi everybody,
>
[quoted text clipped - 66 lines]
> http://jcsnippets.atspace.com/
> a collection of source code, tips and tricks
> String expression = "<a.*href=\"([^\"]*)\".*>.*<img.*src=\"([^\"]*)\".*>.*</a>";
Your .* is called 'greedy match' that swallows all the characters until
the last 'href'.
Use more reasonable regexp string. I would simpley use <a href= pattern.
hiwa - 18 Sep 2006 00:28 GMT
hiwa のメッセージ:
For example, this one works:
"<a href=\"([^\"]*)\".*?>.*?<img.*?src=\"([^\"]*)\".*?>.*?</a>"
But you could much more simplify it than this.
jcsnippets.atspace.com - 18 Sep 2006 15:31 GMT
<snipped>
>> String expression = "<a.*href=\"([^\"]*)\".*>.*<img.*src=\"([^\"]*)
> \".*>.*</a>";
> Your .* is called 'greedy match' that swallows all the characters
> until the last 'href'.
> Use more reasonable regexp string. I would simpley use <a href=
> pattern.
Hi Hiwa,
Now I get it - by the way, thank you for posting your working example in
your other post! Another lesson learned.
Best regards,
JayCee

Signature
http://jcsnippets.atspace.com/
a collection of source code, tips and tricks