Hello,
I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in <!--
-->) or images.
What would be the best start point?
I have found JTidy and HTML Parser in SourceForge, but none of them is
able of ignoring tags - or did I miss it?
Thank you for any clue
Honza
Roedy Green - 29 Nov 2005 11:01 GMT
>I would like to process html pages in java. The very first task would
>be to ignore unnecessary information like comments (everything in <!--
>-->) or images.
>What would be the best start point?
See http://mindprod.com/products1.html#ENTITIES
to strip the HTML out optionally convert the &xxx; entities back to
normal characters.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 29 Nov 2005 17:49 GMT
On Tue, 29 Nov 2005 11:01:38 GMT, Roedy Green
<my_email_is_posted_on_my_website@munged.invalid> wrote, quoted or
indirectly quoted someone who said :
>See http://mindprod.com/products1.html#ENTITIES
>to strip the HTML out optionally convert the &xxx; entities back to
>normal characters.
With a simple modification, you could strip just comments, not all
HTML tags.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
zero - 29 Nov 2005 11:03 GMT
"Honza" <jan.zeman@gmail.com> wrote in news:1133255497.231778.229120
@g14g2000cwa.googlegroups.com:
> Hello,
>
[quoted text clipped - 7 lines]
> Thank you for any clue
> Honza
I would be very surprised if either of those actually did anything with the
comments. If they do, why not just remove the code that handles them?

Signature
Beware the False Authority Syndrome
Oliver Wong - 29 Nov 2005 16:50 GMT
> Hello,
>
[quoted text clipped - 7 lines]
> Thank you for any clue
> Honza
Haven't used the parsers you're talking about, but if you find any SAX
based parser, you'll just receive a bunch of "events" representing the
discovery of "things" in an HTML document, and you can just ignore the
"comment" events.
- Oliver
Abhijat Vatsyayan - 29 Nov 2005 18:16 GMT
> Hello,
>
[quoted text clipped - 7 lines]
> Thank you for any clue
> Honza
Take a look at classes ParserDelegator and HTMLEditorKit.ParserCallback
in package javax.swing.text.html
You can implement(subclass) your own ParserCallback and use that in the
parse method of ParserDelegator object. This is quite like using SAX
parsers for XML documents.
Abhijat
Honza - 30 Nov 2005 16:09 GMT
Hello Abhijat,
I have tested HTMLEditorKit today. It is really very easy to use and it
would be appropriate for my purpose...
BUT: I've tested it with "real world" HTML pages and I find it not
robust enough. The results are not accurate enough and number of errors
is too high if parsing any "badly written" HTML page.
I have found nice page benchmarking "real world" SAX HTML parsers. I
think I will use one of them...
Link: http://www.portletbridge.org/saxbenchmark
Honza
Honza - 29 Nov 2005 20:57 GMT
Thank you guys, I will check the possibilities.
I have found another interesting application which could also be
solution of my problem. Its name is Muffin - http://muffin.doit.org/
It is highly customizable java writen proxy where you can filter html
content.
I am going to try it out tomorrow.
Thanks a lot
Honza