Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / November 2006

Tip: Looking for answers? Try searching our database.

Convert HTML to plain text

Thread view: 
Marcel Kessler - 13 Nov 2006 10:25 GMT
Hi there

Does anyone know a good way of converting HTML to plain text, keeping as
 much of the formatting as possible?

The HTML will be produced by an editor like FCKEditor, and
transformation should happen in Java.

So far I've found the following options, none of them really convincing:

# Using w3m or lynx to convert html to plain text
(http://www.biglist.com/lists/xsl-list/archives/200406/msg00689.html)
+ neat output
- need to call C from java

# Google gdata routine
(http://www.biglist.com/lists/xsl-list/archives/200406/msg00689.html)
+ java source available
- only basic stripping, no tables etc

# Use xml & xslt
(http://www-128.ibm.com/developerworks/java/library/x-xmlist1/)
+ good result
- complicated approach, cannot use wysiwyg-editor like FCKEditor

# use other tools like docfraq, detagger, notetab etc.
- no better results than with w3m

Thanks and regars
Marcel
Andy Dingley - 13 Nov 2006 17:21 GMT
> Does anyone know a good way of converting HTML to plain text, keeping as
>   much of the formatting as possible?

Of course not. "Plain text" doesn't have formatting. If you want to
"keep some formatting", then you first have to know just how much is
preservable. Some people claim "RTF" is "plain text" because it's
editable with a text editor rather than in binary -- how much are you
expecting to preserve?

Converting all HTML block elements to a marker, stripping out
everything except text and markers, normalizing whitespace and markers
and then converting markers to something local is usually a good start.

If you're already in a web context, then a DOM walker that returns the
set of text nodes might be easier.

if the HTML is crap to begin with, pre-process it with Tidy.
Marcel Kessler - 14 Nov 2006 07:47 GMT
>> Does anyone know a good way of converting HTML to plain text, keeping as
>>   much of the formatting as possible?
[quoted text clipped - 4 lines]
> editable with a text editor rather than in binary -- how much are you
> expecting to preserve?

Thanks, Andy!
Obviously we can't keep e.g. a header in big letters, but one thing we
need for example is if we have a <li> tag, we don't want

 * Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
nec est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
aliquet risus ac velit eleifend scelerisque.

but rather

 * Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
nec
   est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
   aliquet risus ac velit eleifend scelerisque.

i.e. something that keeps the indention...
If there is some Java library out there that does this kind of thing,
that would be great... the HTML itself should already be quite nice.
Karl Uppiano - 14 Nov 2006 07:58 GMT
>>> Does anyone know a good way of converting HTML to plain text, keeping as
>>>   much of the formatting as possible?
[quoted text clipped - 22 lines]
> If there is some Java library out there that does this kind of thing, that
> would be great... the HTML itself should already be quite nice.

It sounds like you want an HTML parser with pluggable handlers that are
customizable. A SAX parser comes pretty close. If you could first convert
the HTML to well-formed HTML (with matching open and close tags, for
example) you might be able to get a non-validating SAX parser to work. Just
a thought. My guess is that it would take a fair bit of work to implement.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.