Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / September 2004

Tip: Looking for answers? Try searching our database.

HTML Parser Help Please

Thread view: 
ZOCOR - 30 Sep 2004 10:41 GMT
Hi

I am using HTMLEditorKit.Parser class to parse a HTML file. However, I have
found this Swing HTML parser extremely difficult to use.

I am trying to parse a HTML file and extracting specific information from it
into a table. Consider the snippet of my HTML and the table I like it to
generate:

HTML source:

<HTML>
<TITLE></TITLE>
<BODY>
<PRE>
   Identifer: ABCDEFG
</PRE>
   data: 123456
<PRE>
</PRE>
</BODY>
</HTML>

TABLE:

ABCDEFG    123456

Here is the code I have so far:

import javax.swing.text.*;
import javax.swing.text.html.*;
import java.io.*;

public class HTMLParser extends HTMLEditorKit
{
   public HTMLEditorKit.Parser getParser()
   {
       return super.getParser();
   }

   public static void main (String[] args)
   {
       try
       {
           Reader r = new FileReader("html_file.html");
           HTMLEditor.Parser parse = new HTMLParser.getParser()
           HTMLEditorKit.ParserCallback cb =
           {
               public void handleStartTag(HTML.Tag t, MutableAttributeSet
a, int a)
               {
                   if (t==HTML.Tag.PRE)
                   {
                           //print whats between the pre tag
                   }
               }
               public void handleText(char[] data, int pos)
               {
                   //print whats between the pre tags
               }
           };

           parse.parse(r, cb, true);
       }
       catch (IOException e)
       {
           System.out.println(e);
       }
}
}

I would appreciate it very much if someone could solve this problem for me.
I tried the sun tutortial, but the examples aren't that clear enough for me.

Thanks

ZOCOR
Nathan Zumwalt - 30 Sep 2004 16:14 GMT
I've never used this HTML Parser before, but I've done similar things
when scraping HTML off websites.  My general solution is to:

1.  Get the HTML as text (which you already have).
2.  Run it through an HTML to XHTML cleanser (I lik JTidy)
3.  Parse the XHTML using Java's XML parsers.
4.  Use XPath statements to get the values I want.

This probably isn't very efficient for getting small bits of data, but
it works.

//Nathan

> Hi
>
[quoted text clipped - 78 lines]
> Checked by AVG anti-virus system (http://www.grisoft.com).
> Version: 6.0.760 / Virus Database: 509 - Release Date: 10/09/2004
Paul Lutus - 30 Sep 2004 17:10 GMT
> Hi
>
> I am using HTMLEditorKit.Parser class to parse a HTML file. However, I
> have found this Swing HTML parser extremely difficult to use.

Problem: "difficult".

> I am trying to parse a HTML file and extracting specific information from
> it into a table.

Problem: "trying".

> Consider the snippet of my HTML and the table I like it
> to generate:

You left out the table, the final goal of your program.

/ ...

> I would appreciate it very much if someone could solve this problem for
> me.

Which problem, "difficult" or "trying"? Children and both difficult and
trying, but this is not a specific complaint. Neither is yours.

Tell us what you wanted, what you got, and how they differ.

> I tried the sun tutortial, but the examples aren't that clear enough
> for me.

Clear enough to do what?

Signature

Paul Lutus
http://www.arachnoid.com

John K - 30 Sep 2004 18:04 GMT
TagSoup [http://mercury.ccil.org/~cowan/XML/tagsoup/] might fit the
bill.

-John K


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.