>> Does anyone know a good way of converting HTML to plain text, keeping as
>> much of the formatting as possible?
[quoted text clipped - 4 lines]
> editable with a text editor rather than in binary -- how much are you
> expecting to preserve?
Thanks, Andy!
Obviously we can't keep e.g. a header in big letters, but one thing we
need for example is if we have a <li> tag, we don't want
* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
nec est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
aliquet risus ac velit eleifend scelerisque.
but rather
* Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Quisque
nec
est eu nunc rutrum aliquet. In hac habitasse platea dictumst. Ut
aliquet risus ac velit eleifend scelerisque.
i.e. something that keeps the indention...
If there is some Java library out there that does this kind of thing,
that would be great... the HTML itself should already be quite nice.
Karl Uppiano - 14 Nov 2006 07:58 GMT
>>> Does anyone know a good way of converting HTML to plain text, keeping as
>>> much of the formatting as possible?
[quoted text clipped - 22 lines]
> If there is some Java library out there that does this kind of thing, that
> would be great... the HTML itself should already be quite nice.
It sounds like you want an HTML parser with pluggable handlers that are
customizable. A SAX parser comes pretty close. If you could first convert
the HTML to well-formed HTML (with matching open and close tags, for
example) you might be able to get a non-validating SAX parser to work. Just
a thought. My guess is that it would take a fair bit of work to implement.