The general idea:
"I've never used this HTML Parser before, but I've done similar things
when scraping HTML off websites. My general solution is to:
1. Get the HTML as text (which you already have).
2. Run it through an HTML to XHTML cleanser (I lik JTidy)
3. Parse the XHTML using Java's XML parsers.
4. Use XPath statements to get the values I want."
<http://groups-beta.google.com/group/comp.lang.java.programmer/msg/d3255204fcb5f1
9f?dmode=source>
I'm trying to create an instance of TagSoup, but don't know how.
Alternately, import org...TagSoup, but again, I don't know how.
TagSoup:<http://mercury.ccil.org/~cowan/XML/tagsoup/>
I'm doing ok with the import for JTidy, I think, but get errors:
doCompile:
[javac] Compiling 1 source file to /home/thufir/java/bin
[javac]
/home/thufir/java/src/atreides/parseHTML/ParseExperiment.java:32:
package org.ccil.mercury does not exist
[javac] static org.ccil.mercury.TagSoup foo;
[javac] ^
[javac] 1 error
I've tried many variations.
Going by the TagSoup directions, this looks to be useful for my needs.
It certainly appears to output XHTML files, which is a big step, from
the command line. However, how is it run as a class from another Java
class?
[thufir@arrakis tagSoup]$
[thufir@arrakis tagSoup]$ date
Sat Aug 13 22:15:59 IST 2005
[thufir@arrakis tagSoup]$ pwd
/home/thufir/Desktop/tagSoup
[thufir@arrakis tagSoup]$ ll
total 12
-rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
[thufir@arrakis tagSoup]$ ll
total 60
-rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
-rw-rw-r-- 1 thufir thufir 42207 Aug 13 22:16 tagsoup-1.0rc3.jar
[thufir@arrakis tagSoup]$ java -jar tagsoup-1.0rc3.jar --files
google.html
src: google.html dst: google.xhtml
[thufir@arrakis tagSoup]$ ll
total 76
-rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
-rw-rw-r-- 1 thufir thufir 10568 Aug 13 22:16 google.xhtml
-rw-rw-r-- 1 thufir thufir 42207 Aug 13 22:16 tagsoup-1.0rc3.jar
[thufir@arrakis tagSoup]$ rm *
rm: remove regular file `google.html'? n
rm: remove regular file `google.xhtml'? y
rm: remove regular file `tagsoup-1.0rc3.jar'? y
[thufir@arrakis tagSoup]$ ll
total 12
-rw-rw-r-- 1 thufir thufir 7662 Aug 13 22:08 google.html
[thufir@arrakis tagSoup]$ cd /home/thufir/java/
[thufir@arrakis java]$ ant
Buildfile: build.xml
clean:
[delete] Deleting directory /home/thufir/java/bin
prepare:
[mkdir] Created dir: /home/thufir/java/bin
[mkdir] Created dir: /home/thufir/java/bin/lib
[copy] Copying 1 file to /home/thufir/java/bin/lib
compile:
doCompile:
[javac] Compiling 1 source file to /home/thufir/java/bin
package:
doPackage:
[jar] Building jar: /home/thufir/java/bin/ParseExperiment.jar
BUILD SUCCESSFUL
Total time: 5 seconds
[thufir@arrakis java]$ java -jar bin/ParseExperiment.jar
..main
..iterateFiles
files[0] google.html
..parseFile
fileName google.html
adding path to file name..
fileName
/home/thufir/Desktop/tagSoup/google.html
Tidy (vers 4th August 2000) Parsing "InputStream"
line 2 column 1 - Warning: <style> lacks "type" attribute
line 5 column 1 - Warning: <script> lacks "type" attribute
line 13 column 1 - Warning: <table> lacks "summary" attribute
line 14 column 529 - Warning: <table> lacks "summary" attribute
line 14 column 1,482 - Warning: <img> lacks "src" attribute
line 14 column 1,582 - Warning: <table> lacks "summary" attribute
line 15 column 1 - Warning: <tr> unknown attribute value "center"
line 15 column 1 - Warning: <tr> unknown attribute value "middle"
line 15 column 106 - Warning: <script> lacks "type" attribute
line 15 column 518 - Warning: <table> lacks "summary" attribute
line 15 column 518 - Warning: trimming empty <p>
line 15 column 618 - Warning: <img> lacks "src" attribute
line 27 column 1 - Warning: trimming empty <p>
line 39 column 1 - Warning: trimming empty <p>
line 51 column 1 - Warning: trimming empty <p>
line 54 column 1 - Warning: trimming empty <p>
line 54 column 11 - Warning: missing <tr>
line 54 column 108 - Warning: <img> lacks "src" attribute
line 54 column 423 - Warning: <table> lacks "summary" attribute
line 55 column 11 - Warning: <table> lacks "summary" attribute
InputStream: Document content looks like HTML 4.01 Transitional
20 warnings/errors were found!
..runTagSoupOnFile..
fileName google.html
adding path to file name..
fileName
/home/thufir/Desktop/tagSoup/google.html
[thufir@arrakis java]$ cat src/atreides/parseHTML/ParseExperiment.java
-n
1 package atreides.parseHTML;
2
3 /*
4 * This class is designed take HTML files as input and
5 * generate XML files as output. The class can take
6 * multiple files, generating one XML for the
corrosponding
7 * HTML file.
8 *
9 * At the moment most everything is hard-coded in.
10 * A FileChooser dialog is in the works for a GUI
11 * front end.
12 *
13 * More critical will be processing the XML. This
14 * is a sort of "screen scraper" to pull data from
15 * a specified internet site, or file. The data
16 * will probably be inserted into MySql down the road.
17 */
18 public class ParseExperiment
19 {
20 static String pathToFiles =
"/home/thufir/Desktop/tagSoup";
21 //hard coded, sample HTML files
22
23 static org.w3c.tidy.Tidy converter = new
org.w3c.tidy.Tidy();
24 //if Tidy is part of the J2SE, why is there a JTidy
project
25 //with a file of the same name?
26
27 //static instance of TagSoup.jar
28 //How is the object instatiated? TagSoup tConv = new
TagSoup(); ?
29 //Where are the directions on that?
30
31 /*
32 * @param args arguments from the command line
33 */
34 public static void main (String[] args)
35 {
36 System.out.println("..main");
37 iterateFiles();
38 }//main
39
40 /*
41 * generates a list of files
42 * which are passed to parse file
43 */
44 private static void iterateFiles()
45 {
46 System.out.println("..iterateFiles");
47
48 //get the list of the directory's contents,
49 //process each file. Assume each file is
50 //HTML, assume the path to the directory
51 //is ok.
52 String[] files = new
java.io.File(pathToFiles).list();
53 for (int i=0; i<files.length; ++i)
54 {
55 System.out.println("files[" + i + "]" +
"\t\t\t" + files[i]);
56 runTidyOnFile(files[i],i);
57 runTagSoupOnFile(files[i],i);
58 }//for
59 }//iterateFiles
60
61 /*
62 * @param fileName the file to parse
63 * @param fileName the number of the file
64 */
65 private static void runTidyOnFile(String fileName, int
i)
66 {
67 boolean xmlOutput = true;
68 System.out.println("..parseFile");
69
70 System.out.println("fileName\t\t\t\t" +
fileName);
71 System.out.println("adding path to file
name..");
72 fileName = pathToFiles + "/" + fileName;
73 System.out.println("fileName\t\t\t\t" +
fileName);
74
75 converter.setXmlOut(xmlOutput);
76
77
78 //is the try block set up correctly? is the
79 //correct type of error caught?
80 //output is currently to the screen.
81 //need send output to a file so that
82 //tag soup can work on it.
83 try
84 {
85 org.w3c.dom.Document converted =
86 converter.parseDOM(new
java.io.FileInputStream(fileName), null);
87 } catch (java.io.FileNotFoundException fnee)
88 {
89 System.out.println("..catch.." + fnee);
90 }//catch
91 }//runTidyOnFile
92
93 private static void runTagSoupOnFile(String fileName,
int i)
94 {
95 System.out.println("..runTagSoupOnFile..");
96 System.out.println("fileName\t\t\t\t" +
fileName);
97 System.out.println("adding path to file
name..");
98 fileName = pathToFiles + "/" + fileName;
99 System.out.println("fileName\t\t\t\t" +
fileName);
100
101 //how is a tag soup object instatiated?
102 //just copy the try/catch block from
runTidyOnFile?
103 //will tag soup create a file, or must the
104 //equivalent of "java -jar --files tagsoup.jar
foo.html"
105 //get passed to the tag soup instance?
106 }//runTagSoupOnFile
107 }//ParseExperiment
[thufir@arrakis java]$ date
Sat Aug 13 22:18:09 IST 2005
[thufir@arrakis java]$
Thanks,
Thufir
hawat.thufir@gmail.com - 13 Aug 2005 23:12 GMT
"
import java.net.URL;
import org.apache.xalan.xsltc.trax.SAX2DOM;
import org.apache.xpath.XPathAPI;
import org.apache.xpath.objects.XObject;
import org.ccil.cowan.tagsoup.Parser;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;
public class example {
public final static void main(String[] args) throws Exception {
URL url = new URL("http://example.com");
Parser p = new Parser();
SAX2DOM sax2dom = new SAX2DOM();
p.setContentHandler(sax2dom);
p.parse(new InputSource(url.openStream()));
Node doc = sax2dom.getDOM();
String titlePath = "/html:html/html:head/html:title";
XObject title = XPathAPI.eval(doc,titlePath);
System.out.println("Title is '"+title+"'");
}
}
"
<http://www.hackdiary.com/archives/000041.html>
Oh, think I found it.
-Thufir