
Signature
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
hello, it's quite simple what i need tot do:
for example: this is a sample text from the html files:
<table border=1 width="100%" >
<tr>
<td width=20%><noindex>Betreft :</noindex></td>
<td colspan=3>
<betreft><P><A NAME="b_betreft"></A>Kinderrechten: implementatie van
het VN-verdrag<BR>Jaarlijkse verslaggeving van de Vlaamse regering aan
het Vlaams Parlement en aan de kinderrechtencommissaris omtrent de
implementatie van het VN-verdrag van 20 november 1989 inzake de rechten
van het kind<BR>Tweede verslag d.d. 29 september 2000 <A
NAME="e_betreft"></A></betreft>
</td></tr>
Per html file i need to extract the contents of these special tags ...
<betreft> (and others), (& create xml files out of them), is it
possible to read a html file as a xml file and do some xpath stuff on
it ???
Or just extract tags from a simple text file ...
" JTidy provides a DOM interface to the document that is being
processed, which effectively makes you able to use JTidy as a DOM
parser for real-world HTML."
but no where i can find a good reference to jtidy ...
I still don't know how I'm gonna do it, maybe write it all myself ....
greetings
Martin Gregorie - 17 Oct 2006 12:37 GMT
> hello, it's quite simple what i need tot do:
>
[quoted text clipped - 25 lines]
>
> I still don't know how I'm gonna do it, maybe write it all myself ....
Have you looked at the HTML, HTMLEditorKit and HTMLDocument classes?
The HTMLEditorKit contains a parser I used as the basis for a URL
checker. This extracts <A> tags from HTML pages, Sets up a URL instance
from the href attribute and sees if it is accessible. Access failures
are reported for manual examination and fixes.

Signature
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
Oliver Wong - 17 Oct 2006 14:59 GMT
> hello, it's quite simple what i need tot do:
>
[quoted text clipped - 16 lines]
> possible to read a html file as a xml file and do some xpath stuff on
> it ???
This is possible if and only if the HTML file actually is an XML file
(the HTML file format and the XML file format overlap, but are not identical
to each other). Otherwise, first you'll need something like "XMLTidy" (a
fictional product I just made up) to fix the broken XML -- things like
making sure every open tag is balanced by a closing tag, etc. I noticed in
your example, the <table>, <P> and <BR> tags are never closed, for example.
- Oliver