Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / August 2005

Tip: Looking for answers? Try searching our database.

import org.ccil.mercury.TagSoup

Thread view: 
hawat.thufir@gmail.com - 13 Aug 2005 23:08 GMT
The general idea:

"I've never used this HTML Parser before, but I've done similar things
when scraping HTML off websites.  My general solution is to:

1.  Get the HTML as text (which you already have).
2.  Run it through an HTML to XHTML cleanser (I lik JTidy)
3.  Parse the XHTML using Java's XML parsers.
4.  Use XPath statements to get the values I want."
<http://groups-beta.google.com/group/comp.lang.java.programmer/msg/d3255204fcb5f1
9f?dmode=source
>

I'm trying to create an instance of TagSoup, but don't know how.
Alternately, import org...TagSoup, but again, I don't know how.
TagSoup:<http://mercury.ccil.org/~cowan/XML/tagsoup/>

I'm doing ok with the import for JTidy, I think, but get errors:

doCompile:
   [javac] Compiling 1 source file to /home/thufir/java/bin
   [javac]
/home/thufir/java/src/atreides/parseHTML/ParseExperiment.java:32:
package org.ccil.mercury does not exist
   [javac]     static org.ccil.mercury.TagSoup foo;
   [javac]                                ^
   [javac] 1 error

I've tried many variations.

Going by the TagSoup directions, this looks to be useful for my needs.
It certainly appears to output XHTML files, which is a big step, from
the command line.  However, how is it run as a class from another Java
class?

[thufir@arrakis tagSoup]$
[thufir@arrakis tagSoup]$ date
Sat Aug 13 22:15:59 IST 2005
[thufir@arrakis tagSoup]$ pwd
/home/thufir/Desktop/tagSoup
[thufir@arrakis tagSoup]$ ll
total 12
-rw-rw-r--  1 thufir thufir 7662 Aug 13 22:08 google.html
[thufir@arrakis tagSoup]$ ll
total 60
-rw-rw-r--  1 thufir thufir  7662 Aug 13 22:08 google.html
-rw-rw-r--  1 thufir thufir 42207 Aug 13 22:16 tagsoup-1.0rc3.jar
[thufir@arrakis tagSoup]$ java -jar tagsoup-1.0rc3.jar --files
google.html
src: google.html dst: google.xhtml
[thufir@arrakis tagSoup]$ ll
total 76
-rw-rw-r--  1 thufir thufir  7662 Aug 13 22:08 google.html
-rw-rw-r--  1 thufir thufir 10568 Aug 13 22:16 google.xhtml
-rw-rw-r--  1 thufir thufir 42207 Aug 13 22:16 tagsoup-1.0rc3.jar
[thufir@arrakis tagSoup]$ rm *
rm: remove regular file `google.html'? n
rm: remove regular file `google.xhtml'? y
rm: remove regular file `tagsoup-1.0rc3.jar'? y
[thufir@arrakis tagSoup]$ ll
total 12
-rw-rw-r--  1 thufir thufir 7662 Aug 13 22:08 google.html
[thufir@arrakis tagSoup]$ cd /home/thufir/java/
[thufir@arrakis java]$ ant
Buildfile: build.xml

clean:
  [delete] Deleting directory /home/thufir/java/bin

prepare:
   [mkdir] Created dir: /home/thufir/java/bin
   [mkdir] Created dir: /home/thufir/java/bin/lib
    [copy] Copying 1 file to /home/thufir/java/bin/lib

compile:

doCompile:
   [javac] Compiling 1 source file to /home/thufir/java/bin

package:

doPackage:
     [jar] Building jar: /home/thufir/java/bin/ParseExperiment.jar

BUILD SUCCESSFUL
Total time: 5 seconds
[thufir@arrakis java]$ java -jar bin/ParseExperiment.jar
..main
..iterateFiles
files[0]                        google.html
..parseFile
fileName                                google.html
adding path to file name..
fileName
/home/thufir/Desktop/tagSoup/google.html

Tidy (vers 4th August 2000) Parsing "InputStream"
line 2 column 1 - Warning: <style> lacks "type" attribute
line 5 column 1 - Warning: <script> lacks "type" attribute
line 13 column 1 - Warning: <table> lacks "summary" attribute
line 14 column 529 - Warning: <table> lacks "summary" attribute
line 14 column 1,482 - Warning: <img> lacks "src" attribute
line 14 column 1,582 - Warning: <table> lacks "summary" attribute
line 15 column 1 - Warning: <tr> unknown attribute value "center"
line 15 column 1 - Warning: <tr> unknown attribute value "middle"
line 15 column 106 - Warning: <script> lacks "type" attribute
line 15 column 518 - Warning: <table> lacks "summary" attribute
line 15 column 518 - Warning: trimming empty <p>
line 15 column 618 - Warning: <img> lacks "src" attribute
line 27 column 1 - Warning: trimming empty <p>
line 39 column 1 - Warning: trimming empty <p>
line 51 column 1 - Warning: trimming empty <p>
line 54 column 1 - Warning: trimming empty <p>
line 54 column 11 - Warning: missing <tr>
line 54 column 108 - Warning: <img> lacks "src" attribute
line 54 column 423 - Warning: <table> lacks "summary" attribute
line 55 column 11 - Warning: <table> lacks "summary" attribute

InputStream: Document content looks like HTML 4.01 Transitional
20 warnings/errors were found!

..runTagSoupOnFile..
fileName                                google.html
adding path to file name..
fileName
/home/thufir/Desktop/tagSoup/google.html
[thufir@arrakis java]$ cat src/atreides/parseHTML/ParseExperiment.java
-n
    1  package atreides.parseHTML;
    2
    3  /*
    4  *       This class is designed take HTML files as input and
    5  *       generate XML files as output.  The class can take
    6  *       multiple files, generating one XML for the
corrosponding
    7  *       HTML file.
    8  *
    9  *       At the moment most everything is hard-coded in.
   10  *       A FileChooser dialog is in the works for a GUI
   11  *       front end.
   12  *
   13  *       More critical will be processing the XML.  This
   14  *       is a sort of "screen scraper" to pull data from
   15  *       a specified internet site, or file.  The data
   16  *       will probably be inserted into MySql down the road.
   17  */
   18  public class ParseExperiment
   19  {
   20          static String pathToFiles =
"/home/thufir/Desktop/tagSoup";
   21          //hard coded, sample HTML files
   22
   23          static org.w3c.tidy.Tidy converter = new
org.w3c.tidy.Tidy();
   24          //if Tidy is part of the J2SE, why is there a JTidy
project
   25          //with a file of the same name?
   26
   27          //static instance of TagSoup.jar
   28          //How is the object instatiated?  TagSoup tConv = new
TagSoup(); ?
   29          //Where are the directions on that?
   30
   31  /*
   32  *       @param  args arguments from the command line
   33  */
   34          public static void main (String[] args)
   35          {
   36                  System.out.println("..main");
   37                  iterateFiles();
   38          }//main
   39
   40  /*
   41  *       generates a list of files
   42  *       which are passed to parse file
   43  */
   44          private static void iterateFiles()
   45          {
   46                  System.out.println("..iterateFiles");
   47
   48                  //get the list of the directory's contents,
   49                  //process each file.  Assume each file is
   50                  //HTML, assume the path to the directory
   51                  //is ok.
   52                  String[] files = new
java.io.File(pathToFiles).list();
   53                  for (int i=0; i<files.length; ++i)
   54                  {
   55                          System.out.println("files[" + i + "]" +
"\t\t\t" + files[i]);
   56                          runTidyOnFile(files[i],i);
   57                          runTagSoupOnFile(files[i],i);
   58                  }//for
   59          }//iterateFiles
   60
   61  /*
   62  *       @param fileName the file to parse
   63  *       @param fileName the number of the file
   64  */
   65          private static void runTidyOnFile(String fileName, int
i)
   66          {
   67                  boolean xmlOutput = true;
   68                  System.out.println("..parseFile");
   69
   70                  System.out.println("fileName\t\t\t\t" +
fileName);
   71                  System.out.println("adding path to file
name..");
   72                  fileName = pathToFiles +  "/" + fileName;
   73                  System.out.println("fileName\t\t\t\t" +
fileName);
   74
   75                  converter.setXmlOut(xmlOutput);
   76
   77
   78                  //is the try block set up correctly?  is the
   79                  //correct type of error caught?
   80                  //output is currently to the screen.
   81                  //need send output to a file so that
   82                  //tag soup can work on it.
   83                  try
   84                  {
   85                          org.w3c.dom.Document  converted =
   86                          converter.parseDOM(new
java.io.FileInputStream(fileName), null);
   87                  } catch (java.io.FileNotFoundException fnee)
   88                  {
   89                          System.out.println("..catch.." + fnee);
   90                  }//catch
   91          }//runTidyOnFile
   92
   93          private static void runTagSoupOnFile(String fileName,
int i)
   94          {
   95                  System.out.println("..runTagSoupOnFile..");
   96                  System.out.println("fileName\t\t\t\t" +
fileName);
   97                  System.out.println("adding path to file
name..");
   98                  fileName = pathToFiles + "/" + fileName;
   99                  System.out.println("fileName\t\t\t\t" +
fileName);
  100
  101                  //how is a tag soup object instatiated?
  102                  //just copy the try/catch block from
runTidyOnFile?
  103                  //will tag soup create a file, or must the
  104                  //equivalent of "java -jar --files tagsoup.jar
foo.html"
  105                  //get passed to the tag soup instance?
  106          }//runTagSoupOnFile
  107  }//ParseExperiment
[thufir@arrakis java]$ date
Sat Aug 13 22:18:09 IST 2005
[thufir@arrakis java]$

Thanks,

Thufir
hawat.thufir@gmail.com - 13 Aug 2005 23:12 GMT
"

import java.net.URL;
import org.apache.xalan.xsltc.trax.SAX2DOM;
import org.apache.xpath.XPathAPI;
import org.apache.xpath.objects.XObject;
import org.ccil.cowan.tagsoup.Parser;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;

public class example {
public final static void main(String[] args) throws Exception {
 URL url = new URL("http://example.com");
 Parser p = new Parser();
 SAX2DOM sax2dom = new SAX2DOM();
 p.setContentHandler(sax2dom);
 p.parse(new InputSource(url.openStream()));
 Node doc = sax2dom.getDOM();
 String titlePath = "/html:html/html:head/html:title";
 XObject title = XPathAPI.eval(doc,titlePath);
 System.out.println("Title is '"+title+"'");
}
}

"
<http://www.hackdiary.com/archives/000041.html>

Oh, think I found it.

-Thufir


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.