[CODE]
package ppowell;
/**
* For implementation of Retriever by superclass FlatFileRetriever
*/
import java.io.*;
import java.net.*;
import java.util.Vector; // FOR IMPLEMENTATION OF RETRIEVER
/**
* This class will retrieve all "Safe HTML" tags for display
*
* @version JDK 1.4
* @author Phil Powell
* @package PPOWELL
*/
public class HTMLRetriever extends FlatFileRetriever {
private String fileName = "";
/**
* Constructor
* Constructs superclass constructor and local construction
*
* @access public
* @param String fileName
* @param int displayColAmount
* @see FlatFileRetriever
*/
public HTMLRetriever(String fileName) { // CONSTRUCTOR
super(fileName);
this.fileName = fileName;
}
/**
* Override getHTML() method in superclass FlatFileRetriever
*
* @access public
* @return String HTML
*/
public String getHTML() { // STRING METHOD
String html = "", stuff = "";
try {
URL url = new URL(this.fileName);
URLConnection conn = url.openConnection();
conn.setDoInput(true);
conn.setDoOutput(true);
conn.setUseCaches(false);
conn.setDefaultUseCaches(false);
conn.setRequestProperty("Content-type", "text/html");
BufferedReader fromURL = new BufferedReader(new
InputStreamReader(conn.getInputStream()));
while ((stuff = fromURL.readLine()) != null) html += stuff + "\n";
fromURL.close();
fromURL = null;
conn = null;
} catch (Exception e) {
e.printStackTrace();
}
return html;
}
}
[/CODE]
This is a primitive HTML scraper class I wrote that receives a String
URL parameter and returns the raw HTML code back.
Problem is that the return is utterly inconsistent:
[CODE]
HTMLRetriever retriever = new HTMLRetriever("http://www.cnn.com");
out.println(retriever.getHTML()); // YOU "MIGHT" GET CNN, YOU MIGHT GET
null, YOU MIGHT GET THE CACHED RESULT OF 5 MINUTES AGO!
retriever = new
HTMLRetriever("http://www.myjavaserver.com/~ppowell/blah.html");
out.println(retriever.getHTML()); // YOU GET LITERALLY ANYTHING EVEN IF
blah.html IS CONSISTENTLY STATIC HTML CODING
[/CODE]
I can't figure out why this one is going on so please, someone, help!
This is utterly frustrating because I'm having to do this from PHP
Thanx
Phil
Paulus de Boska - 17 Jan 2006 08:38 GMT
getHTML is coded to an ideal situation, which is simply not always the
case. To avoid the null result it could contain a loop like :
while ( !success && count < 10 )
where you increment the counter in the exception's catch block and wait
for some short random time interval, so it can try a couple of times
again.
You should also simplify your code, since you don't want to be POSTing,
just a simple GET. You can do that by avoiding URLConnection and
getting the the inputstream from the URL itself. You won't have to call
setDoIn/Output then.
---
Paul Hamaker, SEMM
http://javalessons.com