Java Forum / First Aid / January 2006
HTML "scrape" causes loss of query string in URL
phillip.s.powell@gmail.com - 18 Jan 2006 07:43 GMT URL: http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2Fm&nickname=Phil
This works just fine if called via browser, however, if using any data scraper out there, including my own HTMLRetriever.getHTML() method, the query string mysteriously disappears (and I am unable to determine where, in the URL object or the BufferedReader object), causing inaccurate data to be scraped back.
Following is my HTMLRetriever object code:
[CODE] package ppowell;
/** * For implementation of Retriever by superclass FlatFileRetriever */ import java.io.*; import java.net.*; import java.util.Vector; // FOR IMPLEMENTATION OF RETRIEVER
/** * This class will retrieve all "Safe HTML" tags for display * * @version JDK 1.4 * @author Phil Powell * @package PPOWELL */ public class HTMLRetriever extends FlatFileRetriever {
private String fileName = "";
/** * Constructor * Constructs superclass constructor and local construction * * @access public * @param String fileName * @param int displayColAmount * @see FlatFileRetriever */ public HTMLRetriever(String fileName) { // CONSTRUCTOR super(fileName); this.fileName = fileName; }
/** * Override getHTML() method in superclass FlatFileRetriever * * @access public * @return String HTML */ public String getHTML() { // STRING METHOD String html = "", stuff = ""; try { URL url = new URL(this.fileName); BufferedReader fromURL = new BufferedReader(new InputStreamReader(url.openStream())); while ((stuff = fromURL.readLine()) != null) html += stuff + "\n"; fromURL.close(); } catch (Exception e) { e.printStackTrace(); } return html; }
} [/CODE]
At least the output is no longer random (Thanx), now it's constant.. constantly being java.lang.NullPointerException (this results only when the query string is removed from the URL)
Thanx Phil
James Westby - 18 Jan 2006 12:10 GMT > URL: > http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2Fm&nickname=Phil [quoted text clipped - 6 lines] > > Following is my HTMLRetriever object code: [snip]
> At least the output is no longer random (Thanx), now it's constant.. > constantly being java.lang.NullPointerException (this results only when > the query string is removed from the URL) > > Thanx > Phil I have an inkling that what you are trying to do is a little complex for the java classes you are trying to use. I may be wrong, but when I worked on something very similar in the past we were using Jakarta commons httpclient stuff. I suggest you check it out if you really want to be able to scrape pages with any request.
http://jakarta.apache.org/commons/httpclient/
Can you tell me what you are trying to scrape and why?
James
phillip.s.powell@gmail.com - 18 Jan 2006 15:02 GMT All I'm trying to do is to produce the HTML contents of going to the following URL:
ChatGlobals.SERVLET_SELF + "/ppowell.ChatServlet?message=" + URLEncoder.encode("/m", "UTF-8") + "&nickname=" + URLEncoder.encode(cookie, "UTF-8")
Which can translate to the URL I first mentioned. What happens is that the URL resolves within HTMLRetriever except for the query string, which never does.
I'm sorry but that link is too complex for me to understand how to use HttpClient, download, install, anything, and furthermore, I'm on a remote hosting service. Could you explain it to me, sorry I simply have no idea.
Phil
> > URL: > > http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2Fm&nickname=Phil [quoted text clipped - 26 lines] > > James James Westby - 18 Jan 2006 15:15 GMT > All I'm trying to do is to produce the HTML contents of going to the > following URL: [quoted text clipped - 13 lines] > > Phil [snip]
So you're trying to get the contents of a page on the same server and display it? What is the page written in? What are you trying to embed it in to? What changes do you want to make to it when you include it?
The HttpClient provides a set of classes you can use to get the functionality of a web browser, so you can basically (not like this but it gives the idea)
HttpClient client = new Client();
PageRequest request = new PageRequest(ChatGlobals.SERVLET_SELF + "/ppowell.ChatServlet");
request.setParameter("message",URLEncoder.encode("/m", "UTF-8");
request.setParameter("nickname",URLEncoder.encode(cookie, "UTF-8");
WebPage page = client.getPage(request);
...do stuff with page, including writing it out in to the current page you are displaying.
This allows you to handle things like query strings, redirects etc. that may be encountered along the way, but that cannot be handled by URL.
Can i ask what ChatGlobals is? Is it trying to emulate some of the stuff you are missing from PHP?
What are you using to write your Java code?
James
phillip.s.powell@gmail.com - 18 Jan 2006 16:26 GMT Please see below, thanx
> > All I'm trying to do is to produce the HTML contents of going to the > > following URL: [quoted text clipped - 18 lines] > display it? What is the page written in? What are you trying to embed it > in to? What changes do you want to make to it when you include it? Ok let me give you the background. The original chatroom was written in PHP with several PHP scripts interacting with one Java servlet, which interacted with several Java classes. That was because I'm a PHP guy, not a Java guy. I couldn't figure out how to write the front-end and middle-end scripts in any other language other than my "native" language of PHP.
The remote hosting service provider dropped support of PHP and went 100% J2EE on me, so now I have to translate all of the PHP scripts into JSP. Rewriting them to alter the architecture into a J2EE format is beyond unrealistic given my limited availability, not to mention ability, so the fastest approach was for me to simply translate from PHP to JSP, figuring that while not very "Java guru cool" to do so, hey, it works and that's the bottom line..
So for now I will have JSP scripts that will interact with one Java servlet via HTTP. The JSP pages will call up the servlet's URL while sending $_GET variables into the query string. Once this is done the HTML contents of the resulting call to the Servlet via HTTP will be displayed within an existing HTML frame. Imagine if you're in www.blah.com and you have a window pop up showing the contents of www.foo.com?message=/m&nickname=Phil that is what I want to do.
> The HttpClient provides a set of classes you can use to get the > functionality of a web browser, so you can basically (not like this but [quoted text clipped - 19 lines] > Can i ask what ChatGlobals is? Is it trying to emulate some of the stuff > you are missing from PHP? Yes. Considering that ChatGlobals is the only way in Java I could write what we do quite often in PHP: create global variables. chat_global_vars.php was a global library script that contained variables that would be used by every other PHP script. ChatGlobals is Java's equivalent to that, being a class containing nothing but public static final String properties.
> What are you using to write your Java code? WIndows Notepad. Java apps compiled via "javac" on W2K. JSP and servlets compiled at remote hosting service upon FTP'ing them there.
Phil
> James James Westby - 18 Jan 2006 16:59 GMT > Please see below, thanx
>>[snip] >> [quoted text clipped - 16 lines] > PHP to JSP, figuring that while not very "Java guru cool" to do so, > hey, it works and that's the bottom line.. Ok, I can understand your reasoning for doing this, and it does make sense to me. But surely you understand that it isn't an automatic process to go from PHP to JSP. For one every statement you write in PHP there isn't necesarily a directly corresponding JSP statement. There will probably be a way to do it, but it may be more convoluted, or require you to think in a slightly different manner, or look in a slightly different place. The other point is that JSP isn't a standalone technology, so something you are trying to translate to JSP may be possible, but it may be likely that normally people approach it in a different manner, using one of the other technologies that are available . This may mean that no one knows how to acheive what you are trying to do, or that they may try and give you an answer that pushes you towards a slightly different solution. bviously you don't have to take their advice, but you're on your own if you continue to do something that people aren't going to help you with.
Reading that back it sounds like I'm having a go at your for not doing what i say, please understand that I'm not doing that, it's more like a convoluted way of saying "I don't know how to solve your problem."
> So for now I will have JSP scripts that will interact with one Java > servlet via HTTP. The JSP pages will call up the servlet's URL while [quoted text clipped - 3 lines] > www.blah.com and you have a window pop up showing the contents of > www.foo.com?message=/m&nickname=Phil that is what I want to do. Where did the Java servlets come from? Do you have access to their source? Is it possible that you interact with them in a different manner.
Does the pop-up window purely show the contents of ww.foo.comw?message=/m&nickname=Phil? Or does it have some sort of addition, for instance an image above it, of all the text turned to green?
You haven't explained why you are downloading the page and then showing it, rather than hust redirecting the browser.
For instance when I was working on a project that downloaded a webpage as part of a servlet, it added a browser-like address bar at the top, and parsed the HTML of the page and chucked out certain parts of it. The displayed page was then very different to the original. If we just wanted the address bar then we could have used some technologies that allow embedding webpages in to another (not all Java). If we didn't even want that, wel, then it would have been a bit of a pointless project, but you get the idea.
>>The HttpClient provides a set of classes you can use to get the >>functionality of a web browser, so you can basically (not like this but [quoted text clipped - 16 lines] >>This allows you to handle things like query strings, redirects etc. that >>may be encountered along the way, but that cannot be handled by URL. Did my explanation make any sense? What do you think of this approach?
Can I just clarify your problem?
You attempt to use your HTML retriever class to download web pages.
If you download
www.foo.com
you get the page fine, but if you try and download
www.foo.com?message=/m&nickname=Phil
You get the page, but without the changes that should occur if foo.com received the ?message=/m&nickname=Phil query.
Is that all correct? Are there any more nuances to the problem?
>>Can i ask what ChatGlobals is? Is it trying to emulate some of the stuff >>you are missing from PHP? [quoted text clipped - 5 lines] > Java's equivalent to that, being a class containing nothing but public > static final String properties. The reason I ask is that was what I guessed ChatGlobals was, and it is normally considered bad practice for various reasons. However, it works, so stick with it if you're happy.
>>What are you using to write your Java code? > > WIndows Notepad. Java apps compiled via "javac" on W2K. JSP and > servlets compiled at remote hosting service upon FTP'ing them there. That's cool, just out of interest really. A good way to learn I think, worked for me anyway. Does it not get annoying uploading your JSP pages to the server to compile and test them though?
James
phillip.s.powell@gmail.com - 18 Jan 2006 17:43 GMT Once again, please see below, thanx!
> > Please see below, thanx > [quoted text clipped - 38 lines] > what i say, please understand that I'm not doing that, it's more like a > convoluted way of saying "I don't know how to solve your problem." That makes 2 of us :(
> > So for now I will have JSP scripts that will interact with one Java > > servlet via HTTP. The JSP pages will call up the servlet's URL while [quoted text clipped - 6 lines] > Where did the Java servlets come from? Do you have access to their > source? Is it possible that you interact with them in a different manner. I have access to their source, but I am not sure what you mean by that. I interact with the servlet within the URL as that's the only way I know how to interact with a servlet.
> Does the pop-up window purely show the contents of > ww.foo.comw?message=/m&nickname=Phil? Or does it have some sort of > addition, for instance an image above it, of all the text turned to green? The popup window is a frame. One of the framesets purely shows the contents of www.foo.com?message=/m&nickname=Phil. No other nuances involved. Just spits back raw what it finds (with some style sheets added in later to make it look pretty from the client end)
> You haven't explained why you are downloading the page and then showing > it, rather than hust redirecting the browser. Can't redirect the browser. The popup window is room.jsp, which is an HTML frame with 3 framesets, each frameset showing a different URL. Redirection then would be impossible.
> For instance when I was working on a project that downloaded a webpage > as part of a servlet, it added a browser-like address bar at the top, [quoted text clipped - 27 lines] > > Did my explanation make any sense? What do you think of this approach? It would make sense were I able to use HttpClient. I'm sorry but I have no idea how to use it, period. I have no clue as to how to download it and put it on the remote hosting service for me to be able to use it. My remote hosting service only allows me to put individual Java classes, servlets and JSP scripts, and beans. That's it. No JAR, no WAR, no whatever else.
> Can I just clarify your problem? > [quoted text clipped - 12 lines] > > Is that all correct? Are there any more nuances to the problem? Nope that's dead on.
> >>Can i ask what ChatGlobals is? Is it trying to emulate some of the stuff > >>you are missing from PHP? [quoted text clipped - 18 lines] > worked for me anyway. Does it not get annoying uploading your JSP pages > to the server to compile and test them though? With one single 128mb memory card in an old machine with Win2K that doesn't allow you to use Eclipse or other Java-related services, you get used to it.
Phil
> James James Westby - 18 Jan 2006 18:08 GMT >>Where did the Java servlets come from? Do you have access to their >>source? Is it possible that you interact with them in a different manner. > > I have access to their source, but I am not sure what you mean by that. > I interact with the servlet within the URL as that's the only way I > know how to interact with a servlet. Well a servlet is just a normal java class that implements a certain interface, and so provides methods that are called by the server when the user access the servlet. Unless the implementation was incredibly poor however there would be other classes that actually did all the work, is it a possibility that you could access these more directly?
>>Does the pop-up window purely show the contents of >>ww.foo.comw?message=/m&nickname=Phil? Or does it have some sort of [quoted text clipped - 4 lines] > involved. Just spits back raw what it finds (with some style sheets > added in later to make it look pretty from the client end) I'm no web programmer, but is that what an IFrame does? (Maybe without the style sheets).
>>You haven't explained why you are downloading the page and then showing >>it, rather than hust redirecting the browser. > > Can't redirect the browser. The popup window is room.jsp, which is an > HTML frame with 3 framesets, each frameset showing a different URL. > Redirection then would be impossible. That's fair enough, just making sure you had ruled out all of the easy solutions. [snip]
>>Did my explanation make any sense? What do you think of this approach? > [quoted text clipped - 4 lines] > Java classes, servlets and JSP scripts, and beans. That's it. No JAR, > no WAR, no whatever else. There is nothing particularly special about a JAR, so you could upload it and use it yourself (I think). Technologically I mean, if it is their policy then that makes things a bit more tricky. I'm not sure there's anything stopping you unpacking the jar (it's just a fancy zip file after all) and uploading the classes if necessary.
[snip]
> Nope that's dead on. Looking back at your code again you are calling the URL(String) constructor. Looking at the source there appear to be 2 constructors, one that that form uses, and another which is used by
URL(String protocol, String host, int port, String file)
Could you try using that one instead, as the source looks a little clearer about how it handles query strings.
Have you experimented with using different query strings, the one you are using looks pretty benign, but it's worth a try. I spent a whole day tracking down a similar bug, where one page would lose the query string every fifth time or so, but due to several other factors the bug would end up appearing like it was happening somewhere else. This isn't the same problem though unfortuanately.
[snip]
> With one single 128mb memory card in an old machine with Win2K that > doesn't allow you to use Eclipse or other Java-related services, you > get used to it. I guess you'd have to. I forget that not everyone has the luxury of plenty of computing power to just try things out.
phillip.s.powell@gmail.com - 18 Jan 2006 22:10 GMT [snip]
> Looking back at your code again you are calling the URL(String) > constructor. Looking at the source there appear to be 2 constructors, [quoted text clipped - 4 lines] > Could you try using that one instead, as the source looks a little > clearer about how it handles query strings. That looks just like a TCL proc I've seen once, where the query string is literally tacked onto the file name and handled. I'll let you know.
Phil
> Have you experimented with using different query strings, the one you > are using looks pretty benign, but it's worth a try. I spent a whole day [quoted text clipped - 11 lines] > I guess you'd have to. I forget that not everyone has the luxury of > plenty of computing power to just try things out. phillip.s.powell@gmail.com - 19 Jan 2006 08:29 GMT Hey, I tried out a few things to HTMLRetriever, and the results were no different: NULL.
I am at a complete loss because the URL comes across just fine in your browser:
1) Type http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2fa+James&nickn ame=James (this will add your nickname to messages.txt which will be read by the 2nd URL - and I've already verified that this txt file exists and is populated)
2) Type http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2fm&nickname=James
a) Use your browsers (IE and Firefox) and you'll see it comes up just fine b) Then try using HttpClient, HttpUnit or whatever Java-based data scraper you have (even try HTMLRetriever if you want) and it will ALWAYS return null!
I cannot fathom how on earth a URL could return null when scraped but return with content when called via browser! Why in the world would that happen?
Phil
[CODE] package ppowell;
/** * For implementation of Retriever by superclass FlatFileRetriever */ import java.io.*; import java.net.*; import java.util.Vector; // FOR IMPLEMENTATION OF RETRIEVER import java.util.regex.*;
/** * This class will retrieve all "Safe HTML" tags for display * * @version JDK 1.4 * @author Phil Powell * @package PPOWELL */ public class HTMLRetriever extends FlatFileRetriever {
private String fileName = ""; private static final String qsPatternStr = "\\?([a-zA-Z0-9\\-_\\.]+=[a-zA-Z0-9\\-_\\.%\\+,\\^~]*&?)+#?.*$";
/** * Constructor * Constructs superclass constructor and local construction * * @access public * @param String fileName * @param int displayColAmount * @see FlatFileRetriever */ public HTMLRetriever(String fileName) { // CONSTRUCTOR super(fileName); this.fileName = fileName; }
//------------- --* GETTER/SETTER METHODS *-- ----------------- /** * Get file including query string and ref * * @access private * @return String file */ private String getFile() { // PRIVATE STRING METHOD String file = ""; try { URL url = new URL(this.fileName); file = url.getFile(); if (Pattern.matches(HTMLRetriever.qsPatternStr, this.fileName) && !Pattern.matches(HTMLRetriever.qsPatternStr, file)) file += url.getQuery(); } catch (Exception e) {} // DO NOTHING return file; }
/** * Get host * * @access private * @return String host */ private String getHost() { // PRIVATE STRING METHOD try { URL url = new URL(this.fileName); return url.getHost(); } catch (Exception e) { return ""; } }
/** * Override getHTML() method in superclass FlatFileRetriever * * @access public * @return String HTML */ public String getHTML() { // STRING METHOD String html = "", stuff = ""; try { URL url = this.generateURL(); HttpURLConnection conn = (HttpURLConnection)url.openConnection(); conn.setDoInput(true); conn.setDoOutput(true); conn.setUseCaches(false); conn.setDefaultUseCaches(false); conn.setRequestProperty("content-type", "text/html"); BufferedReader fromURL = new BufferedReader(new InputStreamReader(conn.getInputStream())); while ((stuff = fromURL.readLine()) != null) html += stuff + "\n"; fromURL.close(); conn.close(); } catch (Exception e) { e.printStackTrace(); } return html; }
/** * Get port * * @access private * @return int port */ private int getPort() { // PRIVATE INT METHOD try { URL url = new URL(this.fileName); return url.getPort(); } catch (Exception e) { return 0; } }
/** * Get protocol * * @access private * @return String protocol */ private String getProtocol() { // PRIVATE STRING METHOD try { URL url = new URL(this.fileName); return url.getProtocol(); } catch (Exception e) { return ""; } }
//------------- --* END OF GETTER/SETTER METHODS *-- -----------------
/** * Generate URL object * * @access private * @return URL url */ private URL generateURL() { // PRIVATE URL METHOD try { URL url = null; if (Pattern.matches(HTMLRetriever.qsPatternStr, this.fileName)) {
url = new URL(this.getProtocol(), this.getHost(), this.getPort(), this.getFile()); } else { url = new URL(this.fileName); } return url; } catch (Exception e) { return null; } }
}
[/CODE]
> [snip] > > [quoted text clipped - 27 lines] > > I guess you'd have to. I forget that not everyone has the luxury of > > plenty of computing power to just try things out. James Westby - 19 Jan 2006 10:32 GMT > Hey, I tried out a few things to HTMLRetriever, and the results were no > different: NULL. [quoted text clipped - 22 lines] > > Phil [snip]
Ok, I did that. The first page showed nothing what-so-ever. The second URL gave
500 Servlet Exception
java.lang.NullPointerException at ppowell.MessageProcessor.retrieveMessagesFromFile(MessageProcessor.java:283) at ppowell.MessageProcessor.process(MessageProcessor.java:257) at ppowell.ChatServlet.doPost(ChatServlet.java:51) at ppowell.ChatServlet.doGet(ChatServlet.java:37) at javax.servlet.http.HttpServlet.service(HttpServlet.java:740) at javax.servlet.http.HttpServlet.service(HttpServlet.java:853) at com.caucho.server.http.FilterChainServlet.doFilter(FilterChainServlet.java:95) at com.caucho.server.http.Invocation.service(Invocation.java:291) at com.caucho.server.http.RunnerRequest.handleRequest(RunnerRequest.java:339) at com.caucho.server.http.RunnerRequest.handleConnection(RunnerRequest.java:268) at com.caucho.server.TcpConnection.run(TcpConnection.java:136) at java.lang.Thread.run(Thread.java:602)
Resin 2.1.0 (built Tue Mar 26 14:12:50 PST 2002)
Is that what you mean by null? If not what do you mean?
James
phillip.s.powell@gmail.com - 19 Jan 2006 14:04 GMT Did you enter in the first URL and see nothing? If so, good, you're not supposed to.
But after that, did you enter the second URL and get a NullPointerException? That is strange.
I am looking at the possibility now of doing an architectural change at this point.
Phil
> > Hey, I tried out a few things to HTMLRetriever, and the results were no > > different: NULL. [quoted text clipped - 53 lines] > > James phillip.s.powell@gmail.com - 20 Jan 2006 08:21 GMT I don't know what to do now. Every modification I could think of to add to HTMLRetriever has failed.
See for yourself:
[CODE] package ppowell;
/** * For implementation of Retriever by superclass FlatFileRetriever */ import java.io.*; import java.net.*; import java.util.Vector; // FOR IMPLEMENTATION OF RETRIEVER import java.util.regex.*;
/** * This class will retrieve all "Safe HTML" tags for display * * @version JDK 1.4 * @author Phil Powell * @package PPOWELL */ public class HTMLRetriever extends FlatFileRetriever {
private String fileName = ""; private static final String qsPatternStr = "\\?([a-zA-Z0-9\\-_\\.]+=[a-zA-Z0-9\\-_\\.%\\+,\\^~]*&?)+#?.*$";
/** * Constructor * Constructs superclass constructor and local construction * * @access public * @param String fileName * @param int displayColAmount * @see FlatFileRetriever */ public HTMLRetriever(String fileName) { // CONSTRUCTOR super(fileName); this.fileName = fileName; }
//------------- --* GETTER/SETTER METHODS *-- ----------------- /** * Get file including query string and ref * * @access private * @return String file */ private String getFile() { // PRIVATE STRING METHOD String file = ""; try { URL url = new URL(this.fileName); file = url.getFile(); if (Pattern.matches(HTMLRetriever.qsPatternStr, this.fileName) && !Pattern.matches(HTMLRetriever.qsPatternStr, file)) file += url.getQuery(); } catch (Exception e) {} // DO NOTHING return file; }
/** * Get host * * @access private * @return String host */ private String getHost() { // PRIVATE STRING METHOD try { URL url = new URL(this.fileName); return url.getHost(); } catch (Exception e) { return ""; } }
/** * Override getHTML() method in superclass FlatFileRetriever * * @access public * @return String HTML */ public String getHTML() { // STRING METHOD String html = "", stuff = ""; try { URL url = this.generateURL(); HttpURLConnection conn = (HttpURLConnection)url.openConnection(); conn.setDoInput(true); conn.setDoOutput(true); conn.setUseCaches(false); conn.setDefaultUseCaches(false); conn.setRequestProperty("content-type", "text/html"); conn.setRequestMethod("GET"); conn.setRequestProperty("Connection", "close"); conn.connect(); BufferedReader fromURL = new BufferedReader(new InputStreamReader(conn.getInputStream())); while ((stuff = fromURL.readLine()) != null) html += stuff + "\n"; fromURL.close(); conn.disconnect(); } catch (Exception e) { e.printStackTrace(); } return html; }
/** * Get port * * @access private * @return int port */ private int getPort() { // PRIVATE INT METHOD try { URL url = new URL(this.fileName); return url.getPort(); } catch (Exception e) { return 0; } }
/** * Get protocol * * @access private * @return String protocol */ private String getProtocol() { // PRIVATE STRING METHOD try { URL url = new URL(this.fileName); return url.getProtocol(); } catch (Exception e) { return ""; } }
//------------- --* END OF GETTER/SETTER METHODS *-- -----------------
/** * Generate URL object * * @access private * @return URL url */ private URL generateURL() { // PRIVATE URL METHOD try { URL url = null; if (Pattern.matches(HTMLRetriever.qsPatternStr, this.fileName)) {
url = new URL(this.getProtocol(), this.getHost(), this.getPort(), this.getFile()); } else { url = new URL(this.fileName); } return url; } catch (Exception e) { return null; } }
}
[/CODE]
Even when I wrote a simple PHP script to do an fopen() command remotely, the results were the same: NULL. When I called the servlet URL from the browser, every time I get content.
Phil
> Did you enter in the first URL and see nothing? If so, good, you're not > supposed to. [quoted text clipped - 64 lines] > > > > James James Westby - 20 Jan 2006 10:12 GMT > I don't know what to do now. Every modification I could think of to > add to HTMLRetriever has failed. > > See for yourself: > > [CODE] [snip]
> [/CODE] > [quoted text clipped - 3 lines] > > Phil [snip]
You haven't explained exactly what you mean by null. If you mean special "null" in Java, then that should never happen by calling the getHtml() method shown above, so I would suspect a problem in the calling code.
If you mean a NullPointerException then you need to give the error message along with the code that throws the exception.
If you just mean no content, then I suggest you say so, as null means something else in Java, and so merely becomes confusing. If this is the case then you have a problem with the method just given, either with your coding, or with the use of the API.
Which of these cases is it?
If the same thing is happening with PHP then i guess there is a more serious problem. Can you fix it in PHP first, that might give you some clues, and puts you back in your area of expertise. Didn't you have it working in PHP before.
From what you have said the browser will be accessing the servlet from a different machine to the PHP and the JSP, is this correct? Have you therefore looked in to say a firewall problem?
James
phillip.s.powell@gmail.com - 20 Jan 2006 16:17 GMT See below
> > I don't know what to do now. Every modification I could think of to > > add to HTMLRetriever has failed. [quoted text clipped - 18 lines] > If you mean a NullPointerException then you need to give the error > message along with the code that throws the exception. I guess I mean a NullPointerException as that is constantly being thrown. I was unable to obtain any specific error message, however. Whenever the page would pull up, nothing would appear, when called by an outside script (e.g. my PHP script on another site), then I would get the NullPointerException thrown and thus a more specific version of "nothing".
> If you just mean no content, then I suggest you say so, as null means > something else in Java, and so merely becomes confusing. If this is the [quoted text clipped - 7 lines] > clues, and puts you back in your area of expertise. Didn't you have it > working in PHP before. It was working on myjavaserver.com before in PHP, however, myjavaserver.com dropped all PHP support, thus, my having to convert everything from PHP to JSP, otherwise I wouldn't have bothered. I do not have to fix the PHP script as this script, along with HTMLRetriever and anything else that calls <b>just that one URL</b> will produce a NullPointerException every time. I verified this by placing other URLs into my PHP script, into HTMLRetriever, etc. and every other URL I placed into these scrapers worked perfectly every time, except when the specific URL came up http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2fm&nickname=Phil
> From what you have said the browser will be accessing the servlet from > a different machine to the PHP and the JSP, is this correct? Have you > therefore looked in to say a firewall problem? No. The browser will be accessing the JSP scripts which will access the servlet, all on the same machine. Sorry if I didn't make that more clear.
Phil
> James James Westby - 20 Jan 2006 17:22 GMT > See below > [quoted text clipped - 30 lines] > get the NullPointerException thrown and thus a more specific version of > "nothing". How's this for an error message?
500 Servlet Exception
java.lang.NullPointerException at ppowell.MessageProcessor.retrieveMessagesFromFile(MessageProcessor.java:289) at ppowell.MessageProcessor.process(MessageProcessor.java:263) at ppowell.ChatServlet.doPost(ChatServlet.java:51) at ppowell.ChatServlet.doGet(ChatServlet.java:37) at javax.servlet.http.HttpServlet.service(HttpServlet.java:740) at javax.servlet.http.HttpServlet.service(HttpServlet.java:853) at com.caucho.server.http.FilterChainServlet.doFilter(FilterChainServlet.java:95) at com.caucho.server.http.Invocation.service(Invocation.java:291) at com.caucho.server.http.RunnerRequest.handleRequest(RunnerRequest.java:339) at com.caucho.server.http.RunnerRequest.handleConnection(RunnerRequest.java:268) at com.caucho.server.TcpConnection.run(TcpConnection.java:136) at java.lang.Thread.run(Thread.java:602)
This stack trace appears to be from the servlet, not the JSP page, is that correct? You have said before that you think the query string is being dropped when you access the page from the JSP. In the servlet there will be a call to
public java.lang.String getParameter(java.lang.String name)
which returns
the value of a request parameter as a String, or null if the parameter does not exist.
and so if the query string is being dropped, and the servlet doesn't have null checking here (probably a bad move), then this could be the cause of the NullPointerException. I would advise you first to check this out, as it's pointless trying to try and deal with a problem if you don't know what the cause is.
A useful test might be to write a test JSP page that just prints out the values of the parameters that you send it, something like (un-compiled and un-tested)
Map m = request.getParameterMap();
Iterator it = m.keySet().iterator();
while (it.hasNext()) { String s = (String) it.next(); out.print(s+" = "); String[] vals = (String[]) m.get(s); for (int i = 0; i < vals.length; i++) { out.print(vals[i]+" "); } out.println(""); }
Then use your HttpRetriever to get this page passing it different query strings to test the class out.
If it is the problem, then you need to sort out the JSP to not drop the query string (as you know). As I have said before I have two suggestions. These are not guaranteed to work, they might even make things worse.
1) Look in my previous posts for a bit about using a different constructor for URL. Did you try this? If you did I guess it didn't work.
2) I'm not even sure that the URL class can really handle what you are doing, but I have very little experience with it. I would reccomend the HttpClient from Apache-Jakarta, but if you have issues with this then that's you call. You maight consider modifying your test page to just build the URL that you are attempting to get then just printing out the different bits to the page, see if they are as you expect.
>>If you just mean no content, then I suggest you say so, as null means >>something else in Java, and so merely becomes confusing. If this is the [quoted text clipped - 18 lines] > specific URL came up > http://www.myjavaserver.com/servlet/ppowell.ChatServlet?message=%2fm&nickname=Phil So it is *just* that one URL? That would be very strange but it is possible. Have you tested other URLs with query strings? With encoded entities in them?
If you tell me that it is just that one page I will believe you, but I wouldn't expect that behaviour, my first assumption would be that just haven't tested enough.
>> From what you have said the browser will be accessing the servlet from >>a different machine to the PHP and the JSP, is this correct? Have you [quoted text clipped - 3 lines] > the servlet, all on the same machine. Sorry if I didn't make that more > clear. So when you said PHP doing an fopen(), you meant on the JSP page, not the servlet page? Does it work wen you try the servlet page? I guess so, as that is what was originally working when you could use PHP.
You need to be more explicit about what does and doesn't work, as at the moment it's hard to know where you are actually experiencing problems, let alone suggest possible fixes for them. It sounds like your problem is purely in the JSP's accessing of the servlet, is that correct?
I have given you some suggestions of things to try, as I have done in many posts now, and I will keep giving the same suggestions until you tell me that they don't work, hopefully with some evidence that shows what goes wrong, as that will be where the clues are, or a valid reason that you cannot use my suggestions.
James
James Westby - 20 Jan 2006 17:49 GMT [snip]
> How's this for an error message? > [quoted text clipped - 74 lines] > build the URL that you are attempting to get then just printing out the > different bits to the page, see if they are as you expect. [snip]
I just used your HTML retriever code to get a page which shows the query string and it returned the correct string, so maybe dropping the query string isn't the real reason. So those two suggestions don't really work. I think therefore the problem lies in the servlet, as that is what is throwing the exception.
You were accessing the same servlet from PHP, using *exactly* the same URL? You can access the servlet directly using the URL you are trying to access from within the JSP page, and that works correctly? If so then it is a problem in your JSP page, otherwise it is a bug in the servlet that you have exposed by calling the different URL that you are using.
James
phillip.s.powell@gmail.com - 23 Jan 2006 08:44 GMT Yeah it turns out there was a bug, but not in the servlet. In fact, it was a bizarre set of circumstances where it was found out that HttpServletRequest request object NEVER existed!! I can't explain it but for some strange reason "request" simply vanished without a trace causing a NullPointerException when doing this:
this.message = this.request.getParameter("message"); // WHEN "request" is passed from JSP page as parameter into MessageProcessor class
In the outlying JSP page "room.jsp", you have HTML framesets that call chat_messages.jsp, chat_nicknames.jsp and chat_submit_message.jsp. In "chat_messages.jsp", I never could use that ChatServlet servlet so I gave up and bypassed it, calling MessageProcessor class object instance directly from the JSP script chat_messages.jsp
However, in room.jsp, if you didn't include the query string "?message=%2fm&nickname=Phil" into the URL of the frame source that from HTML calls chat_messages.jsp, then chat_messages.jsp has no "request" and thus, neither does MessageProcessor and then, there's your NullPointerException.
Sorry I can't explain it any better, but that was literally the problem. Works now (although there are considerable bugs that I still need to fix and will probably yell for more help on!)
Thanx!
Phil
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|