> On Oct 22, 1:11 am, mic...@gmail.com wrote:
> > I am trying to read the text of a website using a URL object and a data
> > stream
> > It works well on CNN.com for example, but doesn't work well on:http://www.collegehumor.com:80/video:1674301
>
> What makes you think it does not work?
The fact instead of normal HTML text I'm getting gibbrish like this:
<?s?6²¿w¦???E?9¿$J´-e?I|/N|¶?^s???$$1¦??l«???·?IQ²?v??¼d?X`???~8?tr????e??\~~????hm]??>????S??÷7??1?MB?4?B?H×?>jD?e??@×???;÷v?'S??J@X&vV??¬?³d?6??»#|¿x?h
¯?,£?¶?o??n¨??8cq?¾Y-?F|y7?2??????3??,?)o=·m
?RL?l¨?e6?I©7
> > How should I interpret the stream I'm getting?
>
> As HTML?
>
> I don't get exactly what you want to do, but have you considered
> Jakarta HttpClient?
Thanks for the tip - will give it a shot
Arne Vajhøj - 22 Oct 2006 02:10 GMT
> The fact instead of normal HTML text I'm getting gibbrish like this:
> <?s?6²¿w¦???E?9¿$J´-e?I|/N|¶?^s???$$1¦??l«???·?IQ²?v??¼d?X`???~8?tr????e??\~~????hm]??>????S??÷7??1?MB?4?B?H×?>jD?e??@×???;÷v?'S??J@X&vV??¬?³d?6??»#|¿x?h
> ¯?,£?¶?o??n¨??8cq?¾Y-?F|y7?2??????3??,?)o=·m
> ?RL?l¨?e6?I©7
Look as if that URL are returning its content GZIP'ed.
Try wrap the InputStream in a GZIPInputStream.
Arne
William Brogden - 22 Oct 2006 16:03 GMT
>> On Oct 22, 1:11 am, mic...@gmail.com wrote:
>> > I am trying to read the text of a website using a URL object and a
[quoted text clipped - 8 lines]
> ¯?,£?¶?o??n¨??8cq?¾Y-?F|y7?2??????3??,?)o=·m
> ?RL?l¨?e6?I©7
As another poster already said, this is gzip encoded.
When I do this sort of thing I just grab the data stream to a byte[] -
then take a look at the headers to see what the encoding is when I have
the whole message.
I found that it is necessary to search for the GZIP signature bytes
to locate the start of the gzip stream after the headers.
Bill