Hi
I'm trying to extract text from a html page useing DOM. I used JTidy
first on it. The HTml itself is not very descriptive. Theres no
standout tags around the text I need to extract . The way I was
thinking of doing it was accessing the attributes, but I keep getting a
NullPointerException. This is the HTML:
<div class="mb16">
<div id="r_t0" class="prel">
<a id="r0_t" class="L4"href="http://java.sun.com/"">
<b>Java</b> Technology</a></div>
<div class="T1" id="r0_a">Sun's home for <b>Java</b>. Offers
Windows, Solaris, and Linux <b>Java</b> Development Kits (JDKs),
extensions, news, tutorials, and product information.</div>
<div id="r_b0" class="prel T11"><a id="r0_b"
href="http://java.sun.com/">
<img src="http://sp.ask.com/sh/i/icon_bins.gif" border="0"class="bb"
/></a>
<span id="r0_u" class="T10">java.sun.com/</span>
<strong>·</strong> <a class="L5 nw"
href="http://www.askcache.com">
Cached</a> 1f40 <strong>·</strong>
<a class="L5 L5V" href="javascript:void(0)">Save</a>
</div>
</div>
This is the part I want to skip to to extract text. Its buried in loads
of other HTML. Cany anyone please help me do this.
Daniel Pitts - 16 Jan 2007 22:22 GMT
> Hi
> I'm trying to extract text from a html page useing DOM. I used JTidy
[quoted text clipped - 24 lines]
> This is the part I want to skip to to extract text. Its buried in loads
> of other HTML. Cany anyone please help me do this.
The example HTML is a good start, perhaps you should consider giving us
the code that produces the NPE, and what you expect the output to be.
Also, if its a valid XML document, perhaps you should consider using
XPath, it helps select data based on the path to that data (including
selections based on element names, attributes, order, etc...).
Damo - 16 Jan 2007 22:46 GMT
If I can get at the first div I can get its child nodes. How would one
use XPath to get it.
The code below is what I have
NodeList sections = document.getElementsByTagName("div");
for(int i=0; i<sections.getLength();i++)
{
Element section =(Element)sections.item(i);
Attr attr = (Attr)section.getAttributeNode("class");
boolean wasSpecified = attr != null && attr.getSpecified();
String at = attr.getValue();
if(at=="mb16")
{
//I have a recursive method to get the text nodes for here
//if I can get at the child nodes of that particular div
}
}
Damo - 16 Jan 2007 23:05 GMT
Oh and the output I want is
Java Technology
Sun's home for Java Offers
Windows, Solaris, and Linux Java Development Kits (JDKs),
extensions, news, tutorials, and product information.
java.sun.com/
all stored as differnet Strings
Damo - 16 Jan 2007 23:05 GMT
Oh and the output I want is
Java Technology
Sun's home for Java Offers
Windows, Solaris, and Linux Java Development Kits (JDKs),
extensions, news, tutorials, and product information.
java.sun.com/
all stored as 3 differnet Strings
Damo - 17 Jan 2007 12:22 GMT
I'm now using this code. It finds the div nodes with the attribute
"pre1", but it wil not get its child nodes.
if(attr.getValue()=="prel"): Is there something wrong with this line?
NodeList sections =
document.getElementsByTagName("div");
System.out.println(sections.getLength());
for(int i=0; i<sections.getLength();i++)
{
Element section =(Element)sections.item(i);
Attr attr =
(Attr)section.getAttributeNode("class");
if(attr==null)
{
System.out.println("false");
}
else
{
System.out.println(attr.getValue());
if(attr.getValue()=="prel")
{
NodeList name =
section.getChildNodes();
System.out.println(name.getLength());
for(int j=0;
j<name.getLength();j++)
{
Element list =
(Element)name.item(j);
String title =
getText(list.getFirstChild());
System.out.println(title);
}
}
}
}
Andrew Thompson - 17 Jan 2007 14:14 GMT
> I'm now using this code. It finds the div nodes with the attribute
> "pre1", but it wil not get its child nodes.
// compares references to the two strings
> if(attr.getValue()=="prel"): Is there something wrong with this line?
// compares contents of strings
if(attr.getValue().equals("prel"))
Andrew T.