Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / February 2007

Tip: Looking for answers? Try searching our database.

HTML 4 BDTD?

Thread view: 
John W. Kennedy - 30 Jan 2007 04:46 GMT
I'm in the process of de-frame-ing a website with a couple thousand
pages of static HTML, and I've been building a tool that works pretty
well, based on javax.swing.text.html.parser technology, which I've never
used before. Large parts of the website are HTML 3.2, and everything's
just ducky. But there are a good many pages that are HTML 4.0, and my
program goes completely ca-ca on them, because I'm stuck with only the
built-in html32.bdtd file.

A) Is there any good reason that Sun didn't make up an html401.bdtd file
yonks ago?

B) Has anyone an html401.bdtd file to share?

C) Is there any other solution available? (No XML-based tool is going to
come close to handling this stuff -- it's all hand-written--not by me--
and it was painful enough doing various text-based global fixes to make
it parse properly as 3.2. -- lots of <b><i>blah</b></i> and that sort of
thing.)

Signature

John W. Kennedy
"The blind rulers of Logres
Nourished the land on a fallacy of rational virtue."
  -- Charles Williams.  "Taliessin through Logres: Prelude"

Daniel Pitts - 30 Jan 2007 05:39 GMT
> I'm in the process of de-frame-ing a website with a couple thousand
> pages of static HTML, and I've been building a tool that works pretty
[quoted text clipped - 20 lines]
> Nourished the land on a fallacy of rational virtue."
>    -- Charles Williams.  "Taliessin through Logres: Prelude"

Check out JTidy (or just tidy). It'll clean up your HTML. It might
even be able to translate it to XHTML, and THEN you can use XML
parsing no problem :-)

Standard java HTML parsing is very lacking (as you have discovered).  
At the very worst, you may want to work with regex instead.

Oh, and see if Apache has anything (Maybe in Jakarta?), they tend to
have useful utilities of the most surprising type :-)

Hope this helps,
Daniel.
John W. Kennedy - 30 Jan 2007 20:41 GMT
> Check out JTidy (or just tidy). It'll clean up your HTML.

Yes, but I'm not trying to tidy it (though my current code does that as
a side effect, since I'm slurping each page into a tree and re-emitting
it in clean HTML4); I'm trying to do major surgery on the content of
every page, so that I can de-frame the whole website, which, although
elegant-looking to the user, has become a nightmare of frame-juggling
whenever I have to link from one page to another that is not a notional
child, parent, or sibling. The last thing I want to do is degrade the
existing HTML 4.0 pages (the majority of which are semantically
marked-up, thoroughly CSSed, and W3C verified) to HTML 3.2. I also want
a stable tool for future use, so that I can revise link menus in the
event of a new branch on the site's conceptual tree; otherwise, I'll
have to use SHTML for every single page.

> It might
> even be able to translate it to XHTML, and THEN you can use XML
> parsing no problem :-)

Maybe I'll have to do that, but I'm annoyed that I won't be able to use
real XHTML, but only XHTML-like HTML, thanks to Microsoft stabbing the
W3C in the back. (It's a public-oriented website, so I can't say "Use
Firefox", however much I'd like to.) I suppose I could make up the site
in XHTML and then XSLT it to an HTML4 equivalent.

Damn Microsoft! (And damn Apple for their cowardly acquiescence!)

Signature

John W. Kennedy
"The blind rulers of Logres
Nourished the land on a fallacy of rational virtue."
  -- Charles Williams.  "Taliessin through Logres: Prelude"

John W. Kennedy - 30 Jan 2007 21:31 GMT
> Check out JTidy (or just tidy).

On investigation, it appears to be able to be used as a library to read
HTML into a DOM. I'm more or less doing that now, so it should be
relatively straightforward to slot it in where I am using import
javax.swing.text.html, etc..

Signature

John W. Kennedy
"The blind rulers of Logres
Nourished the land on a fallacy of rational virtue."
  -- Charles Williams.  "Taliessin through Logres: Prelude"

Rogan Dawes - 01 Feb 2007 16:24 GMT
>> Check out JTidy (or just tidy).
>
> On investigation, it appears to be able to be used as a library to read
> HTML into a DOM. I'm more or less doing that now, so it should be
> relatively straightforward to slot it in where I am using import
> javax.swing.text.html, etc..

Also consider htmlparser (htmlparser.sourceforge.net)

Rogan
John W. Kennedy - 02 Feb 2007 05:01 GMT
>>> Check out JTidy (or just tidy).

>> On investigation, it appears to be able to be used as a library to
>> read HTML into a DOM. I'm more or less doing that now, so it should be
>> relatively straightforward to slot it in where I am using import
>> javax.swing.text.html, etc..

> Also consider htmlparser (htmlparser.sourceforge.net)

I looked at it, but liked the feel of JTidy better.

In practice, JTidy (as an in-program DOM-building tool, not as a
standalone application) has worked fine. I plugged it into my program,
replacing the javax.swing.text.html tools, in a few hours, and I can now
read HTML 4 and HTML 3.2 equally well. The end of the project to
de-frame the website and get all the pages 4.01-clean is now in sight.

I do wish the JavaDoc was a little more complete. In a few places, I had
to look at the source.

Signature

John W. Kennedy
"The blind rulers of Logres
Nourished the land on a fallacy of rational virtue."
  -- Charles Williams.  "Taliessin through Logres: Prelude"



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.