> > Generally yes I would use an XML parser but I don't think the HTML will
> > ever change in my case. And using regex I think is easier than using
[quoted text clipped - 5 lines]
> example above (notice that the first <li> you encounter contains further
> <li> elements).
For the HTML I'm parsing, there won't be any nested list items.
> If you honestly mean that the HTML will never ever change at all, why
> not just hard-code the return result into your function?
What's between the list items can be variable, but it will always be
text and not embedded HTML. So I don't think hardcoding the return
result would work (I'm not sure what I would hard-code anyway.)
> Are you writing some sort of throw-away program which you'll run once,
> and then throw away afterwards? I guess you're trying to do some analysis on
> one particular HTML file. You should "describe the goal, not the step". See
> http://www.catb.org/~esr/faqs/smart-questions.html#goal
I'm writing something for my own personal use - it's a program that
will screen scrap a website for information that I want. Because I do
not expect this code to work in perpetuity, I'm looking for a quick and
easy way to reliably extract information from an HTML page. It's kind
of like a prototype of sorts.
As for not stating the goal, I have actually abstracted more from you
(and everyone else) because I ran into a problem that was not inherent
to my task. I recognized that regular expressions can be greedy or
reluctant, and I did some research on those, but I didn't get enough
information to help me. So the problem is really not whether I'm
finding the "most right" solution for parsing HTML. I am confident
that the program, when I'm finished, will satisfactorily accomplish my
task, despite the real risks you identified (which, BTW, I've already
determined to be low enough risk not to warrant another solution, like
XML parsing.)
The problem I wanted to state to the group was how do I prevent my
regular expression from grouping too much information? I happen to use
HTML as my example (which has caused the confusion) but could have made
up some other example as well.
> - Oliver
Oliver Wong - 28 Mar 2006 21:06 GMT
>> Are you writing some sort of throw-away program which you'll run
>> once,
[quoted text clipped - 3 lines]
>> See
>> http://www.catb.org/~esr/faqs/smart-questions.html#goal
[...]
> As for not stating the goal, I have actually abstracted more from you
> (and everyone else) because I ran into a problem that was not inherent
[quoted text clipped - 11 lines]
> HTML as my example (which has caused the confusion) but could have made
> up some other example as well.
Okay, fair enough. You saw Lars' post, right? Use '?' to disable greedy
matching:
<quote>
"<li>(.*?)</li>";
</quote>
- Oliver
Roedy Green - 28 Mar 2006 22:37 GMT
>it's a program that
>will screen scrap a website for information that I want.
You can use plain old indexOf to find the stuff surrounding what you
want and substring to extract it. It is fast and impervious to all
kinds of non-grammatical stuff in there.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.