Hi:
I have a little problem. I would like to iterate through some HTML
tags programmatically and want to be lazy about it so would like to
avoid text processing.
I have done a little digging and it looks like HTMLDocument and
HTMLDocument.Iterator will do the job.
Except, wherever I look, there is no indication how to load an actual
HTML document downloaded from the Web into an HTMLDocument instance. I
have found that it is somehow done through HTMLDocument.HTMLReader but
nowhere is it explained exactly how.
I would appreciate any help or code samples.
Thanks
MT
Andy Flowers - 18 Apr 2004 21:23 GMT
Take a look at HTMLEditorKit and it's read method. Here's a very simple
example that just dumps an HTML document.
You could also look at the HTMLDocument.getIterator(..) method for parsing
specific tags.
import javax.swing.text.html.*;
import java.io.*;
import java.net.*;
import javax.swing.text.*;
import java.util.*;
public class Sampler
{
public static void parseElement( Element elem, int offset )
{
for( int y = 0; y < offset; y++ )
{
System.out.print(" ");
}
System.out.println(elem.getName());
AttributeSet as = elem.getAttributes();
Enumeration en = as.getAttributeNames();
while( en.hasMoreElements())
{
for( int y = 0; y < offset; y++ )
{
System.out.print(" ");
}
System.out.println("<" + as.getAttribute(en.nextElement()) + ">");
}
for( int x = 0; x< elem.getElementCount(); x++ )
{
parseElement( elem.getElement(x), offset+1);
}
}
public static void main(String[] args)
{
try
{
HTMLEditorKit kit = new HTMLEditorKit();
HTMLDocument doc = new HTMLDocument();
URL pageontheweb = new URL("http://java.sun.com");
InputStream is = new BufferedInputStream(pageontheweb.openStream());
kit.read(is, doc, 0);
Element[] elems = doc.getRootElements();
for( int x = 0; x < elems.length; x++ )
{
System.out.println(elems[x].getClass().getName());
parseElement( elems[x], 0 );
}
}
catch(BadLocationException ex)
{
}
catch(MalformedURLException ex)
{
}
catch(IOException ex)
{
}
}
}
> Hi:
>
[quoted text clipped - 14 lines]
> Thanks
> MT