Hello,
I have a java servlet that processes plain text. I'd like to point to a
specific url and pull over a webpage, then convert it to plain text for
further processing.
I have written some code that simply strips tags from the html, but
this only does an OK job as it fails on poorly written html and
javascript (to name a few). Are there any java APIs that would perform
a better conversion? I've looked into JEditorPane and HTMLEditorKit,
but haven't had any luck in getting these to perform the conversion.
Thanks for any help!
Mike
William Brogden - 19 Mar 2006 15:43 GMT
> Hello,
>
[quoted text clipped - 10 lines]
>
> Mike
The JTidy library - http://jtidy.sourceforge.net/
will clean up bad HTML and has "Prettyprint" functions.
Bill