Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / November 2005

Tip: Looking for answers? Try searching our database.

HTML Processing in Java

Thread view: 
Honza - 29 Nov 2005 09:11 GMT
Hello,

I would like to process html pages in java. The very first task would
be to ignore unnecessary information like comments (everything in <!--
-->) or images.
What would be the best start point?
I have found JTidy and HTML Parser in SourceForge, but none of them is
able of ignoring tags - or did I miss it?

Thank you for any clue
Honza
Roedy Green - 29 Nov 2005 11:01 GMT
>I would like to process html pages in java. The very first task would
>be to ignore unnecessary information like comments (everything in <!--
>-->) or images.
>What would be the best start point?

See http://mindprod.com/products1.html#ENTITIES
to strip the HTML out optionally convert the &xxx; entities back to
normal characters.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 29 Nov 2005 17:49 GMT
On Tue, 29 Nov 2005 11:01:38 GMT, Roedy Green
<my_email_is_posted_on_my_website@munged.invalid> wrote, quoted or
indirectly quoted someone who said :

>See http://mindprod.com/products1.html#ENTITIES
>to strip the HTML out optionally convert the &xxx; entities back to
>normal characters.

With a simple modification, you could strip just comments, not all
HTML tags.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

zero - 29 Nov 2005 11:03 GMT
"Honza" <jan.zeman@gmail.com> wrote in news:1133255497.231778.229120
@g14g2000cwa.googlegroups.com:

> Hello,
>
[quoted text clipped - 7 lines]
> Thank you for any clue
> Honza

I would be very surprised if either of those actually did anything with the
comments.  If they do, why not just remove the code that handles them?

Signature

Beware the False Authority Syndrome

Oliver Wong - 29 Nov 2005 16:50 GMT
> Hello,
>
[quoted text clipped - 7 lines]
> Thank you for any clue
> Honza

   Haven't used the parsers you're talking about, but if you find any SAX
based parser, you'll just receive a bunch of "events" representing the
discovery of "things" in an HTML document, and you can just ignore the
"comment" events.

   - Oliver
Abhijat Vatsyayan - 29 Nov 2005 18:16 GMT
> Hello,
>
[quoted text clipped - 7 lines]
> Thank you for any clue
> Honza

Take a look at classes ParserDelegator and HTMLEditorKit.ParserCallback
 in package javax.swing.text.html

You can implement(subclass) your own ParserCallback and use that in the
parse method of ParserDelegator object. This is quite like using SAX
parsers for XML documents.

Abhijat
Honza - 30 Nov 2005 16:09 GMT
Hello Abhijat,

I have tested HTMLEditorKit today. It is really very easy to use and it
would be appropriate for my purpose...

BUT: I've tested it with "real world" HTML pages and I find it not
robust enough. The results are not accurate enough and number of errors
is too high if parsing any "badly written" HTML page.

I have found nice page benchmarking "real world" SAX HTML parsers. I
think I will use one of them...

Link: http://www.portletbridge.org/saxbenchmark 

Honza
Honza - 29 Nov 2005 20:57 GMT
Thank you guys, I will check the possibilities.

I have found another interesting application which could also be
solution of my problem. Its name is Muffin - http://muffin.doit.org/
It is highly customizable java writen proxy where you can filter html
content.
I am going to try it out tomorrow.

Thanks a lot
Honza


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.