Hi
I am using HTMLEditorKit.Parser class to parse a HTML file. However, I have
found this Swing HTML parser extremely difficult to use.
I am trying to parse a HTML file and extracting specific information from it
into a table. Consider the snippet of my HTML and the table I like it to
generate:
HTML source:
<HTML>
<TITLE></TITLE>
<BODY>
<PRE>
Identifer: ABCDEFG
</PRE>
data: 123456
<PRE>
</PRE>
</BODY>
</HTML>
TABLE:
ABCDEFG 123456
Here is the code I have so far:
import javax.swing.text.*;
import javax.swing.text.html.*;
import java.io.*;
public class HTMLParser extends HTMLEditorKit
{
public HTMLEditorKit.Parser getParser()
{
return super.getParser();
}
public static void main (String[] args)
{
try
{
Reader r = new FileReader("html_file.html");
HTMLEditor.Parser parse = new HTMLParser.getParser()
HTMLEditorKit.ParserCallback cb =
{
public void handleStartTag(HTML.Tag t, MutableAttributeSet
a, int a)
{
if (t==HTML.Tag.PRE)
{
//print whats between the pre tag
}
}
public void handleText(char[] data, int pos)
{
//print whats between the pre tags
}
};
parse.parse(r, cb, true);
}
catch (IOException e)
{
System.out.println(e);
}
}
}
I would appreciate it very much if someone could solve this problem for me.
I tried the sun tutortial, but the examples aren't that clear enough for me.
Thanks
ZOCOR
Nathan Zumwalt - 30 Sep 2004 16:14 GMT
I've never used this HTML Parser before, but I've done similar things
when scraping HTML off websites. My general solution is to:
1. Get the HTML as text (which you already have).
2. Run it through an HTML to XHTML cleanser (I lik JTidy)
3. Parse the XHTML using Java's XML parsers.
4. Use XPath statements to get the values I want.
This probably isn't very efficient for getting small bits of data, but
it works.
//Nathan
> Hi
>
[quoted text clipped - 78 lines]
> Checked by AVG anti-virus system (http://www.grisoft.com).
> Version: 6.0.760 / Virus Database: 509 - Release Date: 10/09/2004
Paul Lutus - 30 Sep 2004 17:10 GMT
> Hi
>
> I am using HTMLEditorKit.Parser class to parse a HTML file. However, I
> have found this Swing HTML parser extremely difficult to use.
Problem: "difficult".
> I am trying to parse a HTML file and extracting specific information from
> it into a table.
Problem: "trying".
> Consider the snippet of my HTML and the table I like it
> to generate:
You left out the table, the final goal of your program.
/ ...
> I would appreciate it very much if someone could solve this problem for
> me.
Which problem, "difficult" or "trying"? Children and both difficult and
trying, but this is not a specific complaint. Neither is yours.
Tell us what you wanted, what you got, and how they differ.
> I tried the sun tutortial, but the examples aren't that clear enough
> for me.
Clear enough to do what?

Signature
Paul Lutus
http://www.arachnoid.com
John K - 30 Sep 2004 18:04 GMT
TagSoup [http://mercury.ccil.org/~cowan/XML/tagsoup/] might fit the
bill.
-John K