Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / September 2006

Tip: Looking for answers? Try searching our database.

Regular expression does not return all matches?

Thread view: 
jcsnippets.atspace.com - 17 Sep 2006 14:53 GMT
Hi everybody,

I am trying to read the html source of a web page, and finding all the
thumbnails and linked addresses within.

The code below works, but with one strange exception: if I test it on a
html page, I get all the links except for one.

So to test and debug it, I took out that html line, and stuck it in a
test example. Whatever I do, I only get the last occurrence. Even though
the first two match the expression, they are not returned.

I always seem to get only the last one - if I delete it, I get the last
one from the newly formed string.

The regular expression tests for the following:
       <a href="(1)"><img src="(2)"></a>
It takes into account the fact that there may be other stuff within (like
border, height, width, ...). (1) and (2) are the addresses I should
receive.

Here is the code:
--- code ---
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTest
{

    public RegexTest()
    {
        String html =
            "<a href=\"http://www.domain.com/\"><img src=\"domain.gif
\" alt=\"Domain\"></a>" +
            "</td><td align=center valign=middle width=\"33%\">" +
            "<a href=\"http://www.domain.org/\"><img border=0 width=
130 height=70 src=\"domain.jpg\" " +
            "alt=\"Domain\"></a></td></tr></table></td>" +
            "<a href=\"http://www.domein.be/\"><img border=0 width=
130 height=70 src=\"domein.gif\" " +
            "alt=\"Domein\"></a></td></tr></table></td>";
        String expression = "<a.*href=\"([^\"]*)\".*>.*<img.*src=\"([^
\"]*)\".*>.*</a>";
        Pattern p = Pattern.compile(expression);
        Matcher m = p.matcher(html);
        while (m.find())
        {
            System.out.println("Found a match!");
            System.out.println("  -> href: " + m.group(1));
            System.out.println("  -> isrc: " + m.group(2));
        }
    }

    public static void main(String[] args)
    {
        new RegexTest();
    }

}
--- /code ---

I do not understand why only the last match is returned - can somebody
please clarify this, or point me in the right direction?

Thanks in advance,

JayCee
Signature

http://jcsnippets.atspace.com/
a collection of source code, tips and tricks

hiwa - 18 Sep 2006 00:05 GMT
jcsnippets.atspace.com のメッセージ:

> Hi everybody,
>
[quoted text clipped - 66 lines]
> http://jcsnippets.atspace.com/
> a collection of source code, tips and tricks

> String expression = "<a.*href=\"([^\"]*)\".*>.*<img.*src=\"([^\"]*)\".*>.*</a>";
Your .* is called 'greedy match' that swallows all the characters until
the last 'href'.
Use more reasonable regexp string. I would simpley use <a href= pattern.
hiwa - 18 Sep 2006 00:28 GMT
hiwa のメッセージ:

For example, this one works:
"<a href=\"([^\"]*)\".*?>.*?<img.*?src=\"([^\"]*)\".*?>.*?</a>"
But you could much more simplify it than this.
jcsnippets.atspace.com - 18 Sep 2006 15:31 GMT
<snipped>
>> String expression = "<a.*href=\"([^\"]*)\".*>.*<img.*src=\"([^\"]*)
> \".*>.*</a>";
> Your .* is called 'greedy match' that swallows all the characters
> until the last 'href'.
> Use more reasonable regexp string. I would simpley use <a href=
> pattern.

Hi Hiwa,

Now I get it - by the way, thank you for posting your working example in
your other post! Another lesson learned.

Best regards,

JayCee
Signature

http://jcsnippets.atspace.com/
a collection of source code, tips and tricks



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.