Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / February 2006

Tip: Looking for answers? Try searching our database.

SAX Parser and HTML

Thread view: 
Paul Hovnanian P.E. - 17 Feb 2006 02:39 GMT
I have a little project which involves parsing some HTML pages. I tried
using JAXP/SAX to build a simple parser. I subclassed the DefaultHandler
class with some simple methods for characters, startElement and
endElement. The default parser is set as 'non validating' (verified by
calling the
isValidating() method).

Problem: The parser still throws 'SAXParseException's when it encounters
'bad' HTML. This is a problem, since the vast majority of web pages are
'bad HTML', which is 'really bad XML'. Its not validating against some
DTD, but I want to turn off checks for well formed XML altogether.

Its this possible with the JAXP/SAX packages? If not, can anyone suggest
a better approach to parsing HTML using Java.

Thanks in advance.

Signature

Paul Hovnanian     mailto:Paul@Hovnanian.com
------------------------------------------------------------------
c (velocity of light in a vacuum) = 1.8x10^12 furlongs per fortnight

Rob Skedgell - 17 Feb 2006 09:17 GMT
> I have a little project which involves parsing some HTML pages. I
> tried using JAXP/SAX to build a simple parser. I subclassed the
[quoted text clipped - 11 lines]
> Its this possible with the JAXP/SAX packages? If not, can anyone
> suggest a better approach to parsing HTML using Java.

You might want to try having a look at nekohtml
<http://www.apache.org/~andyc/neko/doc/html/>

From the package information of the JPackage RPM:
<quote>
NekoHTML is a simple HTML scanner and tag balancer that enables
application programmers to parse HTML documents and access the
information using standard XML interfaces. The parser can scan HTML
files and "fix up" many common mistakes that human (and computer)
authors make in writing HTML documents.  NekoHTML adds missing parent
elements; automatically closes elements with optional end tags; and
can handle mismatched inline element tags.
NekoHTML is written using the Xerces Native Interface (XNI) that is
the foundation of the Xerces2 implementation. This enables you to use
the NekoHTML parser with existing XNI tools without modification or
rewriting code.
</quote>

It includes a SAX parser for HTML documents, so this may do what you
want.

Signature

Rob Skedgell <rob+news@nephelococcygia.demon.co.uk>
GnuPG/PGP: 7DA3 1579 C0DD 8748 C05A  B984 E2A2 3234 D14B 6DD7



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.