Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / January 2006

Tip: Looking for answers? Try searching our database.

HTML "scrape" causes loss of query string in URL

Thread view: 
phillip.s.powell@gmail.com - 18 Jan 2006 07:43 GMT
URL:
http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2Fm&nickname=Phil

This works just fine if called via browser, however, if using any data
scraper out there, including my own HTMLRetriever.getHTML() method, the
query string mysteriously disappears (and I am unable to determine
where, in the URL object or the BufferedReader object), causing
inaccurate data to be scraped back.

Following is my HTMLRetriever object code:

[CODE]
package ppowell;

/**
* For implementation of Retriever by superclass FlatFileRetriever
*/
import java.io.*;
import java.net.*;
import java.util.Vector;    // FOR IMPLEMENTATION OF RETRIEVER

/**
* This class will retrieve all "Safe HTML" tags for display
*
* @version JDK 1.4
* @author Phil Powell
* @package PPOWELL
*/
public class HTMLRetriever extends FlatFileRetriever {

private String fileName = "";

/**
 * Constructor
 * Constructs superclass constructor and local construction
 *
 * @access public
 * @param String fileName
 * @param int displayColAmount
 * @see FlatFileRetriever
 */
public HTMLRetriever(String fileName) {    // CONSTRUCTOR
 super(fileName);
 this.fileName = fileName;
}

/**
 * Override getHTML() method in superclass FlatFileRetriever
 *
 * @access public
 * @return String HTML
 */
public String getHTML() {                // STRING METHOD
 String html = "", stuff = "";
 try {
  URL url = new URL(this.fileName);
  BufferedReader fromURL = new BufferedReader(new
InputStreamReader(url.openStream()));
  while ((stuff = fromURL.readLine()) != null) html += stuff + "\n";
  fromURL.close();
 } catch (Exception e) {
  e.printStackTrace();
 }
 return html;
}

}
[/CODE]

At least the output is no longer random (Thanx), now it's constant..
constantly being java.lang.NullPointerException (this results only when
the query string is removed from the URL)

Thanx
Phil
James Westby - 18 Jan 2006 12:10 GMT
> URL:
> http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2Fm&nickname=Phil
[quoted text clipped - 6 lines]
>
> Following is my HTMLRetriever object code:

[snip]
> At least the output is no longer random (Thanx), now it's constant..
> constantly being java.lang.NullPointerException (this results only when
> the query string is removed from the URL)
>
> Thanx
> Phil

I have an inkling that what you are trying to do is a little complex for
the java classes you are trying to use. I may be wrong, but when I
worked on something very similar in the past we were using Jakarta
commons httpclient stuff. I suggest you check it out if you really want
to be able to scrape pages with any request.

http://jakarta.apache.org/commons/httpclient/

Can you tell me what you are trying to scrape and why?

James
phillip.s.powell@gmail.com - 18 Jan 2006 15:02 GMT
All I'm trying to do is to produce the HTML contents of going to the
following URL:

ChatGlobals.SERVLET_SELF + "/ppowell.ChatServlet?message=" +
URLEncoder.encode("/m", "UTF-8") + "&nickname=" +
URLEncoder.encode(cookie, "UTF-8")

Which can translate to the URL I first mentioned.  What happens is that
the URL resolves within HTMLRetriever except for the query string,
which never does.

I'm sorry but that link is too complex for me to understand how to use
HttpClient, download, install, anything, and furthermore, I'm on a
remote hosting service.  Could you explain it to me, sorry I simply
have no idea.

Phil

> > URL:
> > http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2Fm&nickname=Phil
[quoted text clipped - 26 lines]
>
> James
James Westby - 18 Jan 2006 15:15 GMT
> All I'm trying to do is to produce the HTML contents of going to the
> following URL:
[quoted text clipped - 13 lines]
>
> Phil
[snip]

So you're trying to get the contents of a page on the same server and
display it? What is the page written in? What are you trying to embed it
in to? What changes do you want to make to it when you include it?

The HttpClient provides a set of classes you can use to get the
functionality of a web browser, so you can basically (not like this but
it gives the idea)

HttpClient client = new Client();

PageRequest request = new PageRequest(ChatGlobals.SERVLET_SELF +
"/ppowell.ChatServlet");

request.setParameter("message",URLEncoder.encode("/m", "UTF-8");

request.setParameter("nickname",URLEncoder.encode(cookie, "UTF-8");

WebPage page = client.getPage(request);

...do stuff with page, including writing it out in to the current page
you are displaying.

This allows you to handle things like query strings, redirects etc. that
may be encountered along the way, but that cannot be handled by URL.

Can i ask what ChatGlobals is? Is it trying to emulate some of the stuff
you are missing from PHP?

What are you using to write your Java code?

James
phillip.s.powell@gmail.com - 18 Jan 2006 16:26 GMT
Please see below, thanx

> > All I'm trying to do is to produce the HTML contents of going to the
> > following URL:
[quoted text clipped - 18 lines]
> display it? What is the page written in? What are you trying to embed it
> in to? What changes do you want to make to it when you include it?

Ok let me give you the background.  The original chatroom was written
in PHP with several PHP scripts interacting with one Java servlet,
which interacted with several Java classes.  That was because I'm a PHP
guy, not a Java guy.  I couldn't figure out how to write the front-end
and middle-end scripts in any other language other than my "native"
language of PHP.

The remote hosting service provider dropped support of PHP and went
100% J2EE on me, so now I have to translate all of the PHP scripts into
JSP.  Rewriting them to alter the architecture into a J2EE format is
beyond unrealistic given my limited availability, not to mention
ability, so the fastest approach was for me to simply translate from
PHP to JSP, figuring that while not very "Java guru cool" to do so,
hey, it works and that's the bottom line..

So for now I will have JSP scripts that will interact with one Java
servlet via HTTP.  The JSP pages will call up the servlet's URL while
sending $_GET variables into the query string.  Once this is done the
HTML contents of the resulting call to the Servlet via HTTP will be
displayed within an existing HTML frame.   Imagine if you're in
www.blah.com and you have a window pop up showing the contents of
www.foo.com?message=/m&nickname=Phil that is what I want to do.

> The HttpClient provides a set of classes you can use to get the
> functionality of a web browser, so you can basically (not like this but
[quoted text clipped - 19 lines]
> Can i ask what ChatGlobals is? Is it trying to emulate some of the stuff
> you are missing from PHP?

Yes.  Considering that ChatGlobals is the only way in Java I could
write what we do quite often in PHP: create global variables.
chat_global_vars.php was a global library script that contained
variables that would be used by every other PHP script.  ChatGlobals is
Java's equivalent to that, being a class containing nothing but public
static final String properties.

> What are you using to write your Java code?

WIndows Notepad.  Java apps compiled via "javac" on W2K.  JSP and
servlets compiled at remote hosting service upon FTP'ing them there.

Phil

> James
James Westby - 18 Jan 2006 16:59 GMT
> Please see below, thanx

>>[snip]
>>
[quoted text clipped - 16 lines]
> PHP to JSP, figuring that while not very "Java guru cool" to do so,
> hey, it works and that's the bottom line..

Ok, I can understand your reasoning for doing this, and it does make
sense to me. But surely you understand that it isn't an automatic
process to go from PHP to JSP. For one every statement you write in PHP
there isn't necesarily a directly corresponding JSP statement. There
will probably be a way to do it, but it may be more convoluted, or
require you to think in a slightly different manner, or look in a
slightly different place. The other point is that JSP isn't a standalone
technology, so something you are trying to translate to JSP may be
possible, but it may be likely that normally people approach it in a
different manner, using one of the other technologies that are available
. This may mean that no one knows how to acheive what you are trying to
do, or that they may try and give you an answer that pushes you towards
a slightly different solution. bviously you don't have to take their
advice, but you're on your own if you continue to do something that
people aren't going to help you with.

Reading that back it sounds like I'm having a go at your for not doing
what i say, please understand that I'm not doing that, it's more like a
convoluted way of saying "I don't know how to solve your problem."

> So for now I will have JSP scripts that will interact with one Java
> servlet via HTTP.  The JSP pages will call up the servlet's URL while
[quoted text clipped - 3 lines]
> www.blah.com and you have a window pop up showing the contents of
> www.foo.com?message=/m&nickname=Phil that is what I want to do.

Where did the Java servlets come from? Do you have access to their
source? Is it possible that you interact with them in a different manner.

Does the pop-up window purely show the contents of
ww.foo.comw?message=/m&nickname=Phil? Or does it have some sort of
addition, for instance an image above it, of all the text turned to green?

You haven't explained why you are downloading the page and then showing
it, rather than hust redirecting the browser.

For instance when I was working on a project that downloaded a webpage
as part of a servlet, it added a browser-like address bar at the top,
and parsed the HTML of the page and chucked out certain parts of it. The
displayed page was then very different to the original. If we just
wanted the address bar then we could have used some technologies that
allow embedding webpages in to another (not all Java). If we didn't even
want that, wel, then it would have been a bit of a pointless project,
but you get the idea.

>>The HttpClient provides a set of classes you can use to get the
>>functionality of a web browser, so you can basically (not like this but
[quoted text clipped - 16 lines]
>>This allows you to handle things like query strings, redirects etc. that
>>may be encountered along the way, but that cannot be handled by URL.

Did my explanation make any sense? What do you think of this approach?

Can I just clarify your problem?

You attempt to use your HTML retriever class to download web pages.

If you download

www.foo.com

you get the page fine, but if you try and download

www.foo.com?message=/m&nickname=Phil

You get the page, but without the changes that should occur if foo.com
received the ?message=/m&nickname=Phil query.

Is that all correct? Are there any more nuances to the problem?

>>Can i ask what ChatGlobals is? Is it trying to emulate some of the stuff
>>you are missing from PHP?
[quoted text clipped - 5 lines]
> Java's equivalent to that, being a class containing nothing but public
> static final String properties.

The reason I ask is that was what I guessed ChatGlobals was, and it is
normally considered bad practice for various reasons. However, it works,
 so stick with it if you're happy.

>>What are you using to write your Java code?
>
> WIndows Notepad.  Java apps compiled via "javac" on W2K.  JSP and
> servlets compiled at remote hosting service upon FTP'ing them there.

That's cool, just out of interest really. A good way to learn I think,
worked for me anyway. Does it not get annoying uploading your JSP pages
to the server to compile and test them though?

James
phillip.s.powell@gmail.com - 18 Jan 2006 17:43 GMT
Once again, please see below, thanx!

> > Please see below, thanx
>
[quoted text clipped - 38 lines]
> what i say, please understand that I'm not doing that, it's more like a
> convoluted way of saying "I don't know how to solve your problem."

That makes 2 of us :(

> > So for now I will have JSP scripts that will interact with one Java
> > servlet via HTTP.  The JSP pages will call up the servlet's URL while
[quoted text clipped - 6 lines]
> Where did the Java servlets come from? Do you have access to their
> source? Is it possible that you interact with them in a different manner.

I have access to their source, but I am not sure what you mean by that.
I interact with the servlet within the URL as that's the only way I
know how to interact with a servlet.

> Does the pop-up window purely show the contents of
> ww.foo.comw?message=/m&nickname=Phil? Or does it have some sort of
> addition, for instance an image above it, of all the text turned to green?

The popup window is a frame.  One of the framesets purely shows the
contents of www.foo.com?message=/m&nickname=Phil.  No other nuances
involved.  Just spits back raw what it finds (with some style sheets
added in later to make it look pretty from the client end)

> You haven't explained why you are downloading the page and then showing
> it, rather than hust redirecting the browser.

Can't redirect the browser.  The popup window is room.jsp, which is an
HTML frame with 3 framesets, each frameset showing a different URL.
Redirection then would be impossible.

> For instance when I was working on a project that downloaded a webpage
> as part of a servlet, it added a browser-like address bar at the top,
[quoted text clipped - 27 lines]
>
> Did my explanation make any sense? What do you think of this approach?

It would make sense were I able to use HttpClient.  I'm sorry but I
have no idea how to use it, period.  I have no clue as to how to
download it and put it on the remote hosting service for me to be able
to use it.  My remote hosting service only allows me to put individual
Java classes, servlets and JSP scripts, and beans.  That's it.  No JAR,
no WAR, no whatever else.

> Can I just clarify your problem?
>
[quoted text clipped - 12 lines]
>
> Is that all correct? Are there any more nuances to the problem?

Nope that's dead on.

> >>Can i ask what ChatGlobals is? Is it trying to emulate some of the stuff
> >>you are missing from PHP?
[quoted text clipped - 18 lines]
> worked for me anyway. Does it not get annoying uploading your JSP pages
> to the server to compile and test them though?

With one single 128mb memory card in an old machine with Win2K that
doesn't allow you to use Eclipse or other Java-related services, you
get used to it.

Phil

> James
James Westby - 18 Jan 2006 18:08 GMT
>>Where did the Java servlets come from? Do you have access to their
>>source? Is it possible that you interact with them in a different manner.
>
> I have access to their source, but I am not sure what you mean by that.
>  I interact with the servlet within the URL as that's the only way I
> know how to interact with a servlet.

Well a servlet is just a normal java class that implements a certain
interface, and so provides methods that are called by the server when
the user access the servlet. Unless the implementation was incredibly
poor however there would be other classes that actually did all the
work, is it a possibility that you could access these more directly?

>>Does the pop-up window purely show the contents of
>>ww.foo.comw?message=/m&nickname=Phil? Or does it have some sort of
[quoted text clipped - 4 lines]
> involved.  Just spits back raw what it finds (with some style sheets
> added in later to make it look pretty from the client end)

I'm no web programmer, but is that what an IFrame does? (Maybe without
the style sheets).

>>You haven't explained why you are downloading the page and then showing
>>it, rather than hust redirecting the browser.
>
> Can't redirect the browser.  The popup window is room.jsp, which is an
> HTML frame with 3 framesets, each frameset showing a different URL.
> Redirection then would be impossible.

That's fair enough, just making sure you had ruled out all of the easy
solutions.
[snip]
>>Did my explanation make any sense? What do you think of this approach?
>
[quoted text clipped - 4 lines]
> Java classes, servlets and JSP scripts, and beans.  That's it.  No JAR,
> no WAR, no whatever else.

There is nothing particularly special about a JAR, so you could upload
it and use it yourself (I think). Technologically I mean, if it is their
policy then that makes things a bit more tricky. I'm not sure there's
anything stopping you unpacking the jar (it's just a fancy zip file
after all) and uploading the classes if necessary.

[snip]

> Nope that's dead on.

Looking back at your code again you are calling the URL(String)
constructor. Looking at the source there appear to be 2 constructors,
one that that form uses, and another which is used by

URL(String protocol, String host, int port, String file)

Could you try using that one instead, as the source looks a little
clearer about how it handles query strings.

Have you experimented with using different query strings, the one you
are using looks pretty benign, but it's worth a try. I spent a whole day
tracking down a similar bug, where one page would lose the query string
every fifth time or so, but due to several other factors the bug would
end up appearing like it was happening somewhere else. This isn't the
same problem though unfortuanately.

[snip]

> With one single 128mb memory card in an old machine with Win2K that
> doesn't allow you to use Eclipse or other Java-related services, you
> get used to it.

I guess you'd have to. I forget that not everyone has the luxury of
plenty of computing power to just try things out.
phillip.s.powell@gmail.com - 18 Jan 2006 22:10 GMT
[snip]

> Looking back at your code again you are calling the URL(String)
> constructor. Looking at the source there appear to be 2 constructors,
[quoted text clipped - 4 lines]
> Could you try using that one instead, as the source looks a little
> clearer about how it handles query strings.

That looks just like a TCL proc I've seen once, where the query string
is literally tacked onto the file name and handled.  I'll let you know.

Phil

> Have you experimented with using different query strings, the one you
> are using looks pretty benign, but it's worth a try. I spent a whole day
[quoted text clipped - 11 lines]
> I guess you'd have to. I forget that not everyone has the luxury of
> plenty of computing power to just try things out.
phillip.s.powell@gmail.com - 19 Jan 2006 08:29 GMT
Hey, I tried out a few things to HTMLRetriever, and the results were no
different: NULL.

I am at a complete loss because the URL comes across just fine in your
browser:

1) Type
http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2fa+James&nickn
ame=James

(this will add your nickname to messages.txt which will be read by the
2nd URL - and I've already verified that this txt file exists and is
populated)

2) Type
http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2fm&nickname=James

 a) Use your browsers (IE and Firefox) and you'll see it comes up just
fine
 b) Then try using HttpClient, HttpUnit or whatever Java-based data
scraper you have (even try HTMLRetriever if you want) and it will
ALWAYS return null!

I cannot fathom how on earth a URL could return null when scraped but
return with content when called via browser! Why in the world would
that happen?

Phil

[CODE]
package ppowell;

/**
* For implementation of Retriever by superclass FlatFileRetriever
*/
import java.io.*;
import java.net.*;
import java.util.Vector;        // FOR IMPLEMENTATION OF RETRIEVER
import java.util.regex.*;

/**
* This class will retrieve all "Safe HTML" tags for display
*
* @version JDK 1.4
* @author Phil Powell
* @package PPOWELL
*/
public class HTMLRetriever extends FlatFileRetriever {

private String fileName = "";
private static final String qsPatternStr =
"\\?([a-zA-Z0-9\\-_\\.]+=[a-zA-Z0-9\\-_\\.%\\+,\\^~]*&?)+#?.*$";

/**
 * Constructor
 * Constructs superclass constructor and local construction
 *
 * @access public
 * @param String fileName
 * @param int displayColAmount
 * @see FlatFileRetriever
 */
public HTMLRetriever(String fileName) {        // CONSTRUCTOR
 super(fileName);
 this.fileName = fileName;
}

//------------- --* GETTER/SETTER METHODS *-- -----------------
/**
 * Get file including query string and ref
 *
 * @access private
 * @return String file
 */
private String getFile() {                // PRIVATE STRING METHOD
 String file = "";
 try {
  URL url = new URL(this.fileName);
  file = url.getFile();
  if (Pattern.matches(HTMLRetriever.qsPatternStr, this.fileName) &&
!Pattern.matches(HTMLRetriever.qsPatternStr, file))
   file += url.getQuery();
 } catch (Exception e) {} // DO NOTHING
 return file;
}

/**
 * Get host
 *
 * @access private
 * @return String host
 */
private String getHost() {                // PRIVATE STRING METHOD
 try {
  URL url = new URL(this.fileName);
  return url.getHost();
 } catch (Exception e) {
  return "";
 }
}

/**
 * Override getHTML() method in superclass FlatFileRetriever
 *
 * @access public
 * @return String HTML
 */
public String getHTML() {                              // STRING
METHOD
 String html = "", stuff = "";
 try {
  URL url = this.generateURL();
  HttpURLConnection conn = (HttpURLConnection)url.openConnection();
  conn.setDoInput(true);
  conn.setDoOutput(true);
  conn.setUseCaches(false);
  conn.setDefaultUseCaches(false);
  conn.setRequestProperty("content-type", "text/html");
  BufferedReader fromURL = new BufferedReader(new
InputStreamReader(conn.getInputStream()));
  while ((stuff = fromURL.readLine()) != null) html += stuff + "\n";
  fromURL.close();
  conn.close();
 } catch (Exception e) {
  e.printStackTrace();
 }
 return html;
}

/**
 * Get port
 *
 * @access private
 * @return int port
 */
private int getPort() {                // PRIVATE INT METHOD
 try {
  URL url = new URL(this.fileName);
  return url.getPort();
 } catch (Exception e) {
  return 0;
 }
}

/**
 * Get protocol
 *
 * @access private
 * @return String protocol
 */
private String getProtocol() {                // PRIVATE STRING METHOD
 try {
  URL url = new URL(this.fileName);
  return url.getProtocol();
 } catch (Exception e) {
  return "";
 }
}

//------------- --* END OF GETTER/SETTER METHODS *-- -----------------

/**
 * Generate URL object
 *
 * @access private
 * @return URL url
 */
private URL generateURL() {                // PRIVATE URL METHOD
 try {
  URL url = null;
  if (Pattern.matches(HTMLRetriever.qsPatternStr, this.fileName)) {

   url = new URL(this.getProtocol(), this.getHost(), this.getPort(),
this.getFile());
  } else {
   url = new URL(this.fileName);
  }
  return url;
 } catch (Exception e) {
  return null;
 }
}

}

[/CODE]

> [snip]
> >
[quoted text clipped - 27 lines]
> > I guess you'd have to. I forget that not everyone has the luxury of
> > plenty of computing power to just try things out.
James Westby - 19 Jan 2006 10:32 GMT
> Hey, I tried out a few things to HTMLRetriever, and the results were no
> different: NULL.
[quoted text clipped - 22 lines]
>
> Phil

[snip]

Ok, I did that. The first page showed nothing what-so-ever. The second
URL gave

500 Servlet Exception

java.lang.NullPointerException
    at
ppowell.MessageProcessor.retrieveMessagesFromFile(MessageProcessor.java:283)
    at ppowell.MessageProcessor.process(MessageProcessor.java:257)
    at ppowell.ChatServlet.doPost(ChatServlet.java:51)
    at ppowell.ChatServlet.doGet(ChatServlet.java:37)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:740)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
    at
com.caucho.server.http.FilterChainServlet.doFilter(FilterChainServlet.java:95)
    at com.caucho.server.http.Invocation.service(Invocation.java:291)
    at
com.caucho.server.http.RunnerRequest.handleRequest(RunnerRequest.java:339)
    at
com.caucho.server.http.RunnerRequest.handleConnection(RunnerRequest.java:268)
    at com.caucho.server.TcpConnection.run(TcpConnection.java:136)
    at java.lang.Thread.run(Thread.java:602)

Resin 2.1.0 (built Tue Mar 26 14:12:50 PST 2002)

Is that what you mean by null? If not what do you mean?

James
phillip.s.powell@gmail.com - 19 Jan 2006 14:04 GMT
Did you enter in the first URL and see nothing? If so, good, you're not
supposed to.

But after that, did you enter the second URL and get a
NullPointerException? That is strange.

I am looking at the possibility now of doing an architectural change at
this point.

Phil

> > Hey, I tried out a few things to HTMLRetriever, and the results were no
> > different: NULL.
[quoted text clipped - 53 lines]
>
> James
phillip.s.powell@gmail.com - 20 Jan 2006 08:21 GMT
I don't know what to do now.  Every modification I could think of to
add to HTMLRetriever has failed.

See for yourself:

[CODE]
package ppowell;

/**
* For implementation of Retriever by superclass FlatFileRetriever
*/
import java.io.*;
import java.net.*;
import java.util.Vector;        // FOR IMPLEMENTATION OF RETRIEVER
import java.util.regex.*;

/**
* This class will retrieve all "Safe HTML" tags for display
*
* @version JDK 1.4
* @author Phil Powell
* @package PPOWELL
*/
public class HTMLRetriever extends FlatFileRetriever {

private String fileName = "";
private static final String qsPatternStr =
"\\?([a-zA-Z0-9\\-_\\.]+=[a-zA-Z0-9\\-_\\.%\\+,\\^~]*&?)+#?.*$";

/**
 * Constructor
 * Constructs superclass constructor and local construction
 *
 * @access public
 * @param String fileName
 * @param int displayColAmount
 * @see FlatFileRetriever
 */
public HTMLRetriever(String fileName) {        // CONSTRUCTOR
 super(fileName);
 this.fileName = fileName;
}

//------------- --* GETTER/SETTER METHODS *-- -----------------
/**
 * Get file including query string and ref
 *
 * @access private
 * @return String file
 */
private String getFile() {                // PRIVATE STRING METHOD
 String file = "";
 try {
  URL url = new URL(this.fileName);
  file = url.getFile();
  if (Pattern.matches(HTMLRetriever.qsPatternStr, this.fileName) &&
!Pattern.matches(HTMLRetriever.qsPatternStr, file))
   file += url.getQuery();
 } catch (Exception e) {} // DO NOTHING
 return file;
}

/**
 * Get host
 *
 * @access private
 * @return String host
 */
private String getHost() {                // PRIVATE STRING METHOD
 try {
  URL url = new URL(this.fileName);
  return url.getHost();
 } catch (Exception e) {
  return "";
 }
}

/**
 * Override getHTML() method in superclass FlatFileRetriever
 *
 * @access public
 * @return String HTML
 */
public String getHTML() {                              // STRING
METHOD
 String html = "", stuff = "";
 try {
  URL url = this.generateURL();
  HttpURLConnection conn = (HttpURLConnection)url.openConnection();
  conn.setDoInput(true);
  conn.setDoOutput(true);
  conn.setUseCaches(false);
  conn.setDefaultUseCaches(false);
  conn.setRequestProperty("content-type", "text/html");
  conn.setRequestMethod("GET");
  conn.setRequestProperty("Connection", "close");
  conn.connect();
  BufferedReader fromURL = new BufferedReader(new
InputStreamReader(conn.getInputStream()));
  while ((stuff = fromURL.readLine()) != null) html += stuff + "\n";
  fromURL.close();
  conn.disconnect();
 } catch (Exception e) {
  e.printStackTrace();
 }
 return html;
}

/**
 * Get port
 *
 * @access private
 * @return int port
 */
private int getPort() {                // PRIVATE INT METHOD
 try {
  URL url = new URL(this.fileName);
  return url.getPort();
 } catch (Exception e) {
  return 0;
 }
}

/**
 * Get protocol
 *
 * @access private
 * @return String protocol
 */
private String getProtocol() {                // PRIVATE STRING METHOD
 try {
  URL url = new URL(this.fileName);
  return url.getProtocol();
 } catch (Exception e) {
  return "";
 }
}

//------------- --* END OF GETTER/SETTER METHODS *-- -----------------

/**
 * Generate URL object
 *
 * @access private
 * @return URL url
 */
private URL generateURL() {                // PRIVATE URL METHOD
 try {
  URL url = null;
  if (Pattern.matches(HTMLRetriever.qsPatternStr, this.fileName)) {

   url = new URL(this.getProtocol(), this.getHost(), this.getPort(),
this.getFile());
  } else {
   url = new URL(this.fileName);
  }
  return url;
 } catch (Exception e) {
  return null;
 }
}

}

[/CODE]

Even when I wrote a simple PHP script to do an fopen() command
remotely, the results were the same: NULL.  When I called the servlet
URL from the browser, every time I get content.

Phil
> Did you enter in the first URL and see nothing? If so, good, you're not
> supposed to.
[quoted text clipped - 64 lines]
> >
> > James
James Westby - 20 Jan 2006 10:12 GMT
> I don't know what to do now.  Every modification I could think of to
> add to HTMLRetriever has failed.
>
> See for yourself:
>
> [CODE]
[snip]
> [/CODE]
>
[quoted text clipped - 3 lines]
>
> Phil
[snip]

You haven't explained exactly what you mean by null. If you mean special
"null" in Java, then that should never happen by calling the getHtml()
method shown above, so I would suspect a problem in the calling code.

If you mean a NullPointerException then you need to give the error
message along with the code that throws the exception.

If you just mean no content, then I suggest you say so, as null means
something else in Java, and so merely becomes confusing. If this is the
case then you have a problem with the method just given, either with
your coding, or with the use of the API.

Which of these cases is it?

If the same thing is happening with PHP then i guess there is a more
serious problem. Can you fix it in PHP first, that might give you some
clues, and puts you back in your area of expertise. Didn't you have it
working in PHP before.

From what you have said the browser will be accessing the servlet from
a different machine to the PHP and the JSP, is this correct? Have you
therefore looked in to say a firewall problem?

James
phillip.s.powell@gmail.com - 20 Jan 2006 16:17 GMT
See below

> > I don't know what to do now.  Every modification I could think of to
> > add to HTMLRetriever has failed.
[quoted text clipped - 18 lines]
> If you mean a NullPointerException then you need to give the error
> message along with the code that throws the exception.

I guess I mean a NullPointerException as that is constantly being
thrown.  I was unable to obtain any specific error message, however.
Whenever the page would pull up, nothing would appear, when called by
an outside script (e.g. my PHP script on another site), then I would
get the NullPointerException thrown and thus a more specific version of
"nothing".

> If you just mean no content, then I suggest you say so, as null means
> something else in Java, and so merely becomes confusing. If this is the
[quoted text clipped - 7 lines]
> clues, and puts you back in your area of expertise. Didn't you have it
> working in PHP before.

It was working on myjavaserver.com before in PHP, however,
myjavaserver.com dropped all PHP support, thus, my having to convert
everything from PHP to JSP, otherwise I wouldn't have bothered.  I do
not have to fix the PHP script as this script, along with HTMLRetriever
and anything else that calls <b>just that one URL</b> will produce a
NullPointerException every time.  I verified this by placing other URLs
into my PHP script, into HTMLRetriever, etc. and every other URL I
placed into these scrapers worked perfectly every time, except when the
specific URL came up
http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2fm&nickname=Phil

>  From what you have said the browser will be accessing the servlet from
> a different machine to the PHP and the JSP, is this correct? Have you
> therefore looked in to say a firewall problem?

No.  The browser will be accessing the JSP scripts which will access
the servlet, all on the same machine.  Sorry if I didn't make that more
clear.

Phil

> James
James Westby - 20 Jan 2006 17:22 GMT
> See below
>
[quoted text clipped - 30 lines]
> get the NullPointerException thrown and thus a more specific version of
> "nothing".

How's this for an error message?

500 Servlet Exception

java.lang.NullPointerException
    at
ppowell.MessageProcessor.retrieveMessagesFromFile(MessageProcessor.java:289)
    at ppowell.MessageProcessor.process(MessageProcessor.java:263)
    at ppowell.ChatServlet.doPost(ChatServlet.java:51)
    at ppowell.ChatServlet.doGet(ChatServlet.java:37)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:740)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
    at
com.caucho.server.http.FilterChainServlet.doFilter(FilterChainServlet.java:95)
    at com.caucho.server.http.Invocation.service(Invocation.java:291)
    at
com.caucho.server.http.RunnerRequest.handleRequest(RunnerRequest.java:339)
    at
com.caucho.server.http.RunnerRequest.handleConnection(RunnerRequest.java:268)
    at com.caucho.server.TcpConnection.run(TcpConnection.java:136)
    at java.lang.Thread.run(Thread.java:602)

This stack trace appears to be from the servlet, not the JSP page, is
that correct? You have said before that you think the query string is
being dropped when you access the page from the JSP. In the servlet
there will be a call to

public java.lang.String getParameter(java.lang.String name)

which returns

the value of a request parameter as a String, or null if the parameter
does not exist.

and so if the query string is being dropped, and the servlet doesn't
have null checking here (probably a bad move), then this could be the
cause of the NullPointerException. I would advise you first to check
this out, as it's pointless trying to try and deal with a problem if you
don't know what the cause is.

A useful test might be to write a test JSP page that just prints out the
values of the parameters that you send it, something like (un-compiled
and un-tested)

    Map m = request.getParameterMap();

        Iterator it = m.keySet().iterator();

        while (it.hasNext()) {
            String s = (String) it.next();
            out.print(s+" = ");
            String[] vals = (String[]) m.get(s);
            for (int i = 0; i < vals.length; i++) {
                out.print(vals[i]+" ");
            }
            out.println("");
        }

Then use your HttpRetriever to get this page passing it different query
strings to test the class out.

If it is the problem, then you need to sort out the JSP to not drop the
query string (as you know). As I have said before I have two
suggestions. These are not guaranteed to work, they might even make
things worse.

1) Look in my previous posts for a bit about using a different
constructor for URL. Did you try this? If you did I guess it didn't work.

2) I'm not even sure that the URL class can really handle what you are
doing, but I have very little experience with it. I would reccomend the
HttpClient from Apache-Jakarta, but if you have issues with this then
that's you call. You maight consider modifying your test page to just
build the URL that you are attempting to get then just printing out the
different bits to the page, see if they are as you expect.

>>If you just mean no content, then I suggest you say so, as null means
>>something else in Java, and so merely becomes confusing. If this is the
[quoted text clipped - 18 lines]
> specific URL came up
> http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2fm&nickname=Phil

So it is *just* that one URL? That would be very strange but it is
possible. Have you tested other URLs with query strings? With encoded
entities in them?

If you tell me that it is just that one page I will believe you, but I
wouldn't expect that behaviour, my first assumption would be that just
haven't tested enough.

>> From what you have said the browser will be accessing the servlet from
>>a different machine to the PHP and the JSP, is this correct? Have you
[quoted text clipped - 3 lines]
> the servlet, all on the same machine.  Sorry if I didn't make that more
> clear.

So when you said PHP doing an fopen(), you meant on the JSP page, not
the servlet page? Does it work wen you try the servlet page? I guess so,
as that is what was originally working when you could use PHP.

You need to be more explicit about what does and doesn't work, as at the
moment it's hard to know where you are actually experiencing problems,
let alone suggest possible fixes for them. It sounds like your problem
is purely in the JSP's accessing of the servlet, is that correct?

I have given you some suggestions of things to try, as I have done in
many posts now, and I will keep giving the same suggestions until you
tell me that they don't work, hopefully with some evidence that shows
what goes wrong, as that will be where the clues are, or a valid reason
that you cannot use my suggestions.

James
James Westby - 20 Jan 2006 17:49 GMT
[snip]

> How's this for an error message?
>
[quoted text clipped - 74 lines]
> build the URL that you are attempting to get then just printing out the
> different bits to the page, see if they are as you expect.

[snip]

I just used your HTML retriever code to get a page which shows the query
string and it returned the correct string, so maybe dropping the query
string isn't the real reason. So those two suggestions don't really
work. I think therefore the problem lies in the servlet, as that is what
is throwing the exception.

You were accessing the same servlet from PHP, using *exactly* the same
URL? You can access the servlet directly using the URL you are trying to
access from within the JSP page, and that works correctly? If so then it
is a problem in your JSP page, otherwise it is a bug in the servlet that
you have exposed by calling the different URL that you are using.

James
phillip.s.powell@gmail.com - 23 Jan 2006 08:44 GMT
Yeah it turns out there was a bug, but not in the servlet.  In fact, it
was a bizarre set of circumstances where it was found out that
HttpServletRequest request object NEVER existed!! I can't explain it
but for some strange reason "request" simply vanished without a trace
causing a NullPointerException when doing this:

this.message = this.request.getParameter("message"); // WHEN "request"
is passed from JSP page as parameter into MessageProcessor class

In the outlying JSP page "room.jsp", you have HTML framesets that call
chat_messages.jsp, chat_nicknames.jsp and chat_submit_message.jsp.  In
"chat_messages.jsp", I never could use that ChatServlet servlet so I
gave up and bypassed it, calling MessageProcessor class object instance
directly from the JSP script chat_messages.jsp

However, in room.jsp, if you didn't include the query string
"?message=%2fm&nickname=Phil" into the URL of the frame source that
from HTML calls chat_messages.jsp, then chat_messages.jsp has no
"request" and thus, neither does MessageProcessor and then, there's
your NullPointerException.

Sorry I can't explain it any better, but that was literally the
problem.  Works now (although there are considerable bugs that I still
need to fix and will probably yell for more help on!)

Thanx!

Phil


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.