Hi All
Perhaps someone knows the answer to this problem. I open a connection
to a URL and read lines one at a time from the URL using a
InputStreamReader and a BufferedReader:
// Open connection to URL
URLConnection conn =
(URLConnection)pageURL.openConnection();
conn.setReadTimeout(timeout);
conn.setConnectTimeout(timeout);
conn.setUseCaches(false);
InputStream pageStream = conn.getInputStream();
BufferedReader reader = new BufferedReader(new
InputStreamReader(pageStream));
String line;
StringBuffer pageBuffer = new StringBuffer();
while ((line = reader.readLine()) != null)
{
System.out.println(line);
pageBuffer.append(line);
}
return pageBuffer.toString();
However, the actual text I get back from the URL is different from that
saved out of a browser from the same URL. Particularly, the browser
saves £ characters, whereas the lines read in Java are missing
these characters altogether. Also, some of the characters have actually
been deleted in the Java lines. I have tried using different character
encodings in the second argument of the InputStreamReader, this has
virtually no effect, except using UTF-16 which returns a large number
of "?" characters in the stream. The content type header of the page
says it is ISO-8859-1, but this character encoding string with the
InputStreamReader changes nothing in the Java code: the £ symbol is
still missing.
In the browser, if I change the character encoding to "UTF-8" then the
£ symbol is still properly displayed in the browser. In other words,
it looks like I am receiving different data from the server depending
upon whether I use the browser or the code. I'm not sure if it has
anything to do with the encoding, but I'm just guessing.
Thanks,
Nubs.
Andrew Thompson - 04 Oct 2006 16:29 GMT
...
> Perhaps someone knows the answer to this problem. I open a connection
> to a URL ...
What URL (specifically)?
> ...However, the actual text I get back from the URL is different from that
> saved out of a browser ...
What browser (make, version, OS - specifically)?
Is the saved text identical to the text shown when
you 'view source' in the 'a browser'?
Andrew T.
little_mm@ntlworld.com - 04 Oct 2006 16:33 GMT
Thanks for the response Andrew.
URL: http://www.net-a-porter.com/Shop/Shop/Shoes/All?pageNumber=0
Browser: Mozilla Firefox, but same effect in IE6, OS: Windows XP.
Yes, I think view source and save page are identical, although I
haven't checked byte-for-byte.
Nubs.
> ...
> > Perhaps someone knows the answer to this problem. I open a connection
[quoted text clipped - 11 lines]
>
> Andrew T.
Chris Uppal - 04 Oct 2006 17:18 GMT
> Perhaps someone knows the answer to this problem. I open a connection
> to a URL and read lines one at a time from the URL using a
> InputStreamReader and a BufferedReader:
[...]
> However, the actual text I get back from the URL is different from that
> saved out of a browser from the same URL. Particularly, the browser
> saves £ characters, whereas the lines read in Java are missing
> these characters altogether. Also, some of the characters have actually
> been deleted in the Java lines.
Maybe the website is using something like the Accept-Language: field in the
request to decide what currency (etc) to send back. I don't know what the Java
HTTP client will send in that field by default, but it is unlikely to be
'en-GB' which is what my browser would send.
I just tried it myself, but -- most unfortunately -- the site has just stopped
responding. I /do/ hope my little experiment didn't kill it...
-- chris
little_mm@ntlworld.com - 04 Oct 2006 17:22 GMT
> > Perhaps someone knows the answer to this problem. I open a connection
> > to a URL and read lines one at a time from the URL using a
[quoted text clipped - 15 lines]
>
> -- chris
Hi Chris - thanks for the response. So, question: how do you mimic the
browser's HTTP requests precisely, so that a website generally behaves
in the same way? For example, how do you change the Accept-Language
field?
Thanks,
Nubs.
Tor Iver Wilhelmsen - 04 Oct 2006 19:35 GMT
> Hi Chris - thanks for the response. So, question: how do you mimic the
> browser's HTTP requests precisely, so that a website generally behaves
> in the same way? For example, how do you change the Accept-Language
> field?
Look at URLConnection.setRequestProperty().
little_mm@ntlworld.com - 04 Oct 2006 22:12 GMT
> > Hi Chris - thanks for the response. So, question: how do you mimic the
> > browser's HTTP requests precisely, so that a website generally behaves
> > in the same way? For example, how do you change the Accept-Language
> > field?
>
> Look at URLConnection.setRequestProperty().
OK, many thanks Iver.
Chris Uppal - 05 Oct 2006 11:39 GMT
> Hi Chris - thanks for the response. So, question: how do you mimic the
> browser's HTTP requests precisely, so that a website generally behaves
> in the same way?
I see that Tor has already answered. I want to add that their server is back
up this morning, and I've just tried again (it stayed up this time !). The bad
news is that changing the Accept-Language field to, say, "da" made no
difference -- it still sent back a page where the price of the first boot was
£ <some jaw-droppingly large number>. So that was a red-herring, I'm
afraid.
-- chris