Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / March 2006

Tip: Looking for answers? Try searching our database.

Java Regex Problem

Thread view: 
stevengarcia@yahoo.com - 27 Mar 2006 21:01 GMT
I want to extract all the content between HTML <li> tags.  I'm using
regular expressions and I'm not capturing every match with my regex.
What I have is:

String regex = "<li>(.*)</li>";
String content = "<html><li>aaa</li><li>bbb</li></html>";

Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(content);
while (matcher.find()) {
   System.out.println(matcher.group(1));
}

The result of this is "aaa</li><li>bbb" and that is not what I want.  I
instead want to just print "aaa" and "bbb".  What am I doing wrong?

Thanks for your help.
Lars-Åke Aspelin - 27 Mar 2006 21:30 GMT
>I want to extract all the content between HTML <li> tags.  I'm using
>regular expressions and I'm not capturing every match with my regex.
[quoted text clipped - 13 lines]
>
>Thanks for your help.

If you add a '?' you will prevent the greedy behaviour of the pattern
matching and gives you the expected result.

String regex = "<li>(.*?)</li>";

Hope this helps

Lars-Åke
Oliver Wong - 27 Mar 2006 23:30 GMT
>I want to extract all the content between HTML <li> tags.  I'm using
> regular expressions and I'm not capturing every match with my regex.
[quoted text clipped - 11 lines]
> The result of this is "aaa</li><li>bbb" and that is not what I want.  I
> instead want to just print "aaa" and "bbb".  What am I doing wrong?

   In general, regular expressions are not sufficient to solve this
problem, since list-items in HTML can be nested, e.g.

<exampleHtmlSnippet>
<ul>
 <li>
   <ol>
     <li>Foo</li>
     <li>Bar</li>
   <ol>
 <li>
 <li>Buntz</li>
</ul>
</exampleHtmlSnippet>

   To solve the problem in general, you might look into an XML parser (if
your HTML is valid XML).

   If you somehow "know" that you'll never get nested list-items, then the
problem is that your regular expression is behaving greedily; i.e. it's
matching as-much-as-possible, as opposed to as-little-as-possible.

   - Oliver
stevengarcia@yahoo.com - 28 Mar 2006 19:31 GMT
>     In general, regular expressions are not sufficient to solve this
> problem, since list-items in HTML can be nested, e.g.
[quoted text clipped - 10 lines]
> </ul>
> </exampleHtmlSnippet>

Generally yes I would use an XML parser but I don't think the HTML will
ever change in my case.  And using regex I think is easier than using
an XML parser and trying to locate particular nodes.  I guess XPath
would help in that case but I'm confident this will work.

>     To solve the problem in general, you might look into an XML parser (if
> your HTML is valid XML).

I'm not sure if it's valid or not.  I guess I could parse it and find
out.  :)

>     If you somehow "know" that you'll never get nested list-items, then the
> problem is that your regular expression is behaving greedily; i.e. it's
> matching as-much-as-possible, as opposed to as-little-as-possible.

I'm looking for something quick and easy, this is not for some big
company project.  Thanks for your time.

-- Steve
Oliver Wong - 28 Mar 2006 19:58 GMT
>>     In general, regular expressions are not sufficient to solve this
>> problem, since list-items in HTML can be nested, e.g.
[quoted text clipped - 15 lines]
> an XML parser and trying to locate particular nodes.  I guess XPath
> would help in that case but I'm confident this will work.

   Not sure I understand; my objection is not that the HTML might change
during the program execution, but that list-items can be nested, as per the
example above (notice that the first <li> you encounter contains further
<li> elements).

   If you honestly mean that the HTML will never ever change at all, why
not just hard-code the return result into your function?

>>     To solve the problem in general, you might look into an XML parser
>> (if
[quoted text clipped - 10 lines]
> I'm looking for something quick and easy, this is not for some big
> company project.  Thanks for your time.

   Are you writing some sort of throw-away program which you'll run once,
and then throw away afterwards? I guess you're trying to do some analysis on
one particular HTML file. You should "describe the goal, not the step". See
http://www.catb.org/~esr/faqs/smart-questions.html#goal

   - Oliver
stevengarcia@yahoo.com - 28 Mar 2006 20:47 GMT
> > Generally yes I would use an XML parser but I don't think the HTML will
> > ever change in my case.  And using regex I think is easier than using
[quoted text clipped - 5 lines]
> example above (notice that the first <li> you encounter contains further
> <li> elements).

For the HTML I'm parsing, there won't be any nested list items.

>     If you honestly mean that the HTML will never ever change at all, why
> not just hard-code the return result into your function?

What's between the list items can be variable, but it will always be
text and not embedded HTML.  So I don't think hardcoding the return
result would work (I'm not sure what I would hard-code anyway.)

>     Are you writing some sort of throw-away program which you'll run once,
> and then throw away afterwards? I guess you're trying to do some analysis on
> one particular HTML file. You should "describe the goal, not the step". See
> http://www.catb.org/~esr/faqs/smart-questions.html#goal

I'm writing something for my own personal use - it's a program that
will screen scrap a website for information that I want.  Because I do
not expect this code to work in perpetuity, I'm looking for a quick and
easy way to reliably extract information from an HTML page.  It's kind
of like a prototype of sorts.

As for not stating the goal, I have actually abstracted more from you
(and everyone else) because I ran into a problem that was not inherent
to my task.  I recognized that regular expressions can be greedy or
reluctant, and I did some research on those, but I didn't get enough
information to help me.  So the problem is really not whether I'm
finding the "most right" solution for parsing HTML.  I am confident
that the program, when  I'm finished, will satisfactorily accomplish my
task, despite the real risks you identified (which, BTW, I've already
determined to be low enough risk not to warrant another solution, like
XML parsing.)

The problem I wanted to state to the group was how do I prevent my
regular expression from grouping too much information?  I happen to use
HTML as my example (which has caused the confusion) but could have made
up some other example as well.

>     - Oliver
Oliver Wong - 28 Mar 2006 21:06 GMT
>>     Are you writing some sort of throw-away program which you'll run
>> once,
[quoted text clipped - 3 lines]
>> See
>> http://www.catb.org/~esr/faqs/smart-questions.html#goal

[...]

> As for not stating the goal, I have actually abstracted more from you
> (and everyone else) because I ran into a problem that was not inherent
[quoted text clipped - 11 lines]
> HTML as my example (which has caused the confusion) but could have made
> up some other example as well.

   Okay, fair enough. You saw Lars' post, right? Use '?' to disable greedy
matching:

<quote>
"<li>(.*?)</li>";
</quote>

   - Oliver
Roedy Green - 28 Mar 2006 22:37 GMT
>it's a program that
>will screen scrap a website for information that I want.

You can use plain old indexOf to find the stuff surrounding what you
want and substring to extract it.  It is fast and  impervious to all
kinds of non-grammatical stuff in there.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.