Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / January 2006

Tip: Looking for answers? Try searching our database.

HTMLRetriever class method getHTML() returns null - confused

Thread view: 
phillip.s.powell@gmail.com - 17 Jan 2006 07:31 GMT
[CODE]
package ppowell;

/**
* For implementation of Retriever by superclass FlatFileRetriever
*/
import java.io.*;
import java.net.*;
import java.util.Vector;    // FOR IMPLEMENTATION OF RETRIEVER

/**
* This class will retrieve all "Safe HTML" tags for display
*
* @version JDK 1.4
* @author Phil Powell
* @package PPOWELL
*/
public class HTMLRetriever extends FlatFileRetriever {

private String fileName = "";

/**
 * Constructor
 * Constructs superclass constructor and local construction
 *
 * @access public
 * @param String fileName
 * @param int displayColAmount
 * @see FlatFileRetriever
 */
public HTMLRetriever(String fileName) {    // CONSTRUCTOR
 super(fileName);
 this.fileName = fileName;
}

/**
 * Override getHTML() method in superclass FlatFileRetriever
 *
 * @access public
 * @return String HTML
 */
public String getHTML() {                // STRING METHOD
 String html = "", stuff = "";
 try {
  URL url = new URL(this.fileName);
  URLConnection conn = url.openConnection();
  conn.setDoInput(true);
  conn.setDoOutput(true);
  conn.setUseCaches(false);
  conn.setDefaultUseCaches(false);
  conn.setRequestProperty("Content-type", "text/html");
  BufferedReader fromURL = new BufferedReader(new
InputStreamReader(conn.getInputStream()));
  while ((stuff = fromURL.readLine()) != null) html += stuff + "\n";
  fromURL.close();
  fromURL = null;
  conn = null;
 } catch (Exception e) {
  e.printStackTrace();
 }
 return html;
}

}
[/CODE]

This is a primitive HTML scraper class I wrote that receives a String
URL parameter and returns the raw HTML code back.

Problem is that the return is utterly inconsistent:

[CODE]
HTMLRetriever retriever = new HTMLRetriever("http://www.cnn.com");
out.println(retriever.getHTML()); // YOU "MIGHT" GET CNN, YOU MIGHT GET
null, YOU MIGHT GET THE CACHED RESULT OF 5 MINUTES AGO!
retriever = new
HTMLRetriever("http://www.myjavaserver.com/~ppowell/blah.html");
out.println(retriever.getHTML()); // YOU GET LITERALLY ANYTHING EVEN IF
blah.html IS CONSISTENTLY STATIC HTML CODING
[/CODE]

I can't figure out why this one is going on so please, someone, help!
This is utterly frustrating because I'm having to do this from PHP

Thanx
Phil
Paulus de Boska - 17 Jan 2006 08:38 GMT
getHTML is coded to an ideal situation, which is simply not always the
case. To avoid the null result it could contain a loop like :
while ( !success  && count < 10 )
where you increment the counter in the exception's catch block and wait
for some short random time interval, so it can try a couple of times
again.
You should also simplify your code, since you don't want to be POSTing,
just a simple GET. You can do that by avoiding URLConnection and
getting the the inputstream from the URL itself. You won't have to call
setDoIn/Output then.

---
Paul Hamaker, SEMM
http://javalessons.com


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.