I have a little project which involves parsing some HTML pages. I tried
using JAXP/SAX to build a simple parser. I subclassed the DefaultHandler
class with some simple methods for characters, startElement and
endElement. The default parser is set as 'non validating' (verified by
calling the
isValidating() method).
Problem: The parser still throws 'SAXParseException's when it encounters
'bad' HTML. This is a problem, since the vast majority of web pages are
'bad HTML', which is 'really bad XML'. Its not validating against some
DTD, but I want to turn off checks for well formed XML altogether.
Its this possible with the JAXP/SAX packages? If not, can anyone suggest
a better approach to parsing HTML using Java.
Thanks in advance.

Signature
Paul Hovnanian mailto:Paul@Hovnanian.com
------------------------------------------------------------------
c (velocity of light in a vacuum) = 1.8x10^12 furlongs per fortnight
Rob Skedgell - 17 Feb 2006 09:17 GMT
> I have a little project which involves parsing some HTML pages. I
> tried using JAXP/SAX to build a simple parser. I subclassed the
[quoted text clipped - 11 lines]
> Its this possible with the JAXP/SAX packages? If not, can anyone
> suggest a better approach to parsing HTML using Java.
You might want to try having a look at nekohtml
<http://www.apache.org/~andyc/neko/doc/html/>
From the package information of the JPackage RPM:
<quote>
NekoHTML is a simple HTML scanner and tag balancer that enables
application programmers to parse HTML documents and access the
information using standard XML interfaces. The parser can scan HTML
files and "fix up" many common mistakes that human (and computer)
authors make in writing HTML documents. NekoHTML adds missing parent
elements; automatically closes elements with optional end tags; and
can handle mismatched inline element tags.
NekoHTML is written using the Xerces Native Interface (XNI) that is
the foundation of the Xerces2 implementation. This enables you to use
the NekoHTML parser with existing XNI tools without modification or
rewriting code.
</quote>
It includes a SAX parser for HTML documents, so this may do what you
want.

Signature
Rob Skedgell <rob+news@nephelococcygia.demon.co.uk>
GnuPG/PGP: 7DA3 1579 C0DD 8748 C05A B984 E2A2 3234 D14B 6DD7