Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / March 2006

Tip: Looking for answers? Try searching our database.

Best way to convert html to plain text in java?

Thread view: 
google@lrlart.com - 19 Mar 2006 08:20 GMT
Hello,

I have a java servlet that processes plain text. I'd like to point to a
specific url and pull over a webpage, then convert it to plain text for
further processing.

I have written some code that simply strips tags from the html, but
this only does an OK job as it fails on poorly written html and
javascript (to name a few). Are there any java APIs that would perform
a better conversion? I've looked into JEditorPane and HTMLEditorKit,
but haven't had any luck in getting these to perform the conversion.
Thanks for any help!
Marcin Wielgus - 19 Mar 2006 13:38 GMT
> Hello,
>
[quoted text clipped - 8 lines]
> but haven't had any luck in getting these to perform the conversion.
> Thanks for any help!

its a bad solution but u can always run html2text in child process;)

--
SaSol

Signature

Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Dave Mandelin - 21 Mar 2006 03:32 GMT
Can you give some examples of how it fails on poorly written HTML? It
may not be that hard to bulletproof the tag-stripping code you wrote.
Roedy Green - 21 Mar 2006 04:18 GMT
On 20 Mar 2006 18:32:27 -0800, "Dave Mandelin"
<mandelin@cs.berkeley.edu> wrote, quoted or indirectly quoted someone
who said :

>Can you give some examples of how it fails on poorly written HTML? It
>may not be that hard to bulletproof the tag-stripping code you wrote.

I wrote a tag stripper, but it presumes valid HTML.  I suppose you
could on hitting an < in a tag presume the > was missing. and insert
one just before the first space after the last <

You could look for standard tags.

The other common error is as < or > lying around by itself or next to
=.

From a practical point of view it might be easiest to run your code
through a verifier and fix the errors then do your strip.  See
http://mindprod.com/jgloss/htmlvalidator.html

Anything else is going to lose some data or insert some junk.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

google@lrlart.com - 21 Mar 2006 07:28 GMT
One failure I've run into is with the use of javascript--for example

<script>

function CNN_getCookies() {
    var hash = new Array;
    if ( document.cookie ) {
        var cookies = document.cookie.split( '; ' );
        for ( var i = 0; i < cookies.length; i++ ) {

......
Note: Notice the "less than" symbol in the javascript above.

</script>

This is some slightly modified source from cnn's site--but the point is
that a "<tag>" pattern can be distinguished, but it's difficult to
differentiate this from a greater than or less than in some enclosed
javascript code.

But even if I were to write some code that could handle this case
effectively I'd probably be dealing with loads of other special cases
within poorly written html source.
Chris Uppal - 21 Mar 2006 11:18 GMT
> But even if I were to write some code that could handle this case
> effectively I'd probably be dealing with loads of other special cases
> within poorly written html source.

Take it from me: parsing HTML is not trivial.  And that's even without
considering all the invalid HTML out there (I don't mean stuff like incorrectly
nested structures, but unmatched ""s, tags with no >, etc).

JTidy appears to do what you are looking for, it might help (I've never tried
it myself):
   http://jtidy.sourceforge.net/

   -- chris
Dave Mandelin - 21 Mar 2006 20:41 GMT
Ah, I see. Yeah, that looks pretty rough. JTidy looks like a really
nice program.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.