Hello all,
I would like to get some advise from someone that knows a lot more than
I. I need a special purpose Java HTML parser. I have seen several out there,
but none meet my needs. What I need to do is GET a web page, find some items
with check boxes, set the appropriate selections and post the data back.
Most of the parsers let one search for HTML tags, links, etc, but not items
like check boxes. I need the name and value of the check boxes in question
so I can POST the desired values.
TIA,
Adam
P.S. As you can probably tell from my E-Mail, I am fairly new to web
programming.
Chris Smith - 13 Dec 2005 05:28 GMT
> I would like to get some advise from someone that knows a lot more than
> I. I need a special purpose Java HTML parser. I have seen several out there,
[quoted text clipped - 3 lines]
> like check boxes. I need the name and value of the check boxes in question
> so I can POST the desired values.
Pretty much any HTML parser will do. Two that come to mind are JTidy
(which was historically a formatter, but includes a parser with an API
that can be used stand-alone) or Xerces with NekoHTML (Xerces by itself
is an XML parser, Neko extends it via XNI to cover HTML as well).
There's even a parser in javax.swing.text.html, although the output
format is horrid.
Any of the above will be sufficient for your task. The posting part is
pretty trivial with Jakarta Commons HttpClient, or could be done with
URLConnection (although that's a little messy for anything beyond just
retrieving a file).

Signature
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
Missaka Wijekoon - 13 Dec 2005 05:33 GMT
> Hello all,
>
[quoted text clipped - 5 lines]
> like check boxes. I need the name and value of the check boxes in question
> so I can POST the desired values.
Adam,
You might want to take a look at
http://sourceforge.net/projects/jtidy
The code that lets you look at HTML as DOM is probablt what you need.
-Misk
Chris Uppal - 13 Dec 2005 07:25 GMT
> What I need to do is GET a web page,
> find some items with check boxes, set the appropriate selections and post
> the data back.
If you need to be able to do this with arbitrary web pages then you are
probably hosed. People use JavaScript to generate all sort of stuff
dynamically on the client-side which makes the problem of recovering forms by
parsing the HTML a little... tricky.
Of course, you may not have a requirement to parse /arbitrary/ web pages, in
which case you can probably[*] use one of the parsers already mentioned to look
for INPUT fields inside FORM elements.
-- chris
([*] I say "probably" because I haven't looked at any of the suggested parsers
myself.)
zero - 13 Dec 2005 13:02 GMT
> Hello all,
>
[quoted text clipped - 12 lines]
> P.S. As you can probably tell from my E-Mail, I am fairly new to web
> programming.
If you're a brave man with a lot of time you could do the world a favour
and create a good Java HTML parser. I have yet to find one that can
compete with top(*) products like the parsers of IE, Opera or Mozilla.
Of course, if you just want a quick answer with basic functionality, look
at what the others posted :-)
(*) by top I mean they work in most cases, not that they're actually that
great. Each of those products has its failings. But they still do a lot
better than any Java product I've seen.

Signature
Beware the False Authority Syndrome