Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / October 2006

Tip: Looking for answers? Try searching our database.

Getting text from a URL

Thread view: 
mic123@gmail.com - 22 Oct 2006 00:11 GMT
I am trying to read the text of a website using a URL object and a data
stream
It works well on CNN.com for example, but doesn't work well on:
http://www.collegehumor.com:80/video:1674301

How should I interpret the stream I'm getting?

I'm using the following code:

     URL u;
     InputStream is = null;
     DataInputStream dis;
     String s;

     try {

        u = new URL("http://www.collegehumor.com:80/video:1674301");
        is = u.openStream();         // throws an IOException
        dis = new DataInputStream(new BufferedInputStream(is));
        while ((s = dis.readLine()) != null) {
           System.out.println(s);
        }
     }
catch (MalformedURLException mue) {
     } catch (IOException ioe) {
     } finally {
        try {
           is.close();
        } catch (IOException ioe) {
        }

     } // end of 'finally' clause

  }  // end of main
Régis Décamps - 22 Oct 2006 00:40 GMT
On Oct 22, 1:11 am, mic...@gmail.com wrote:
> I am trying to read the text of a website using a URL object and a data
> stream
> It works well on CNN.com for example, but doesn't work well on:http://www.collegehumor.com:80/video:1674301

What makes you think it does not work?

> How should I interpret the stream I'm getting?

As HTML?

I don't get exactly what you want to do, but have you considered
Jakarta HttpClient?
Signature

Régis

mic123@gmail.com - 22 Oct 2006 01:05 GMT
> On Oct 22, 1:11 am, mic...@gmail.com wrote:
> > I am trying to read the text of a website using a URL object and a data
> > stream
> > It works well on CNN.com for example, but doesn't work well on:http://www.collegehumor.com:80/video:1674301
>
> What makes you think it does not work?
The fact instead of normal HTML text I'm getting gibbrish like this:
<?s?6²¿w¦???E?9¿$J´-e?I|/N|¶?^s???$$1¦??l«???· ?IQ²?v??¼d?X`???~8?tr????e??\~~????hm]??>????S??÷7??1?MB?4?B?H×?>jD?e??@×???;÷v?'S??J@X&vV??¬?³d?6??»#|¿x?h
¯?,£?¶?o??n¨??8cq?¾Y-?F|y7?2??????3??,?)o=·m
?RL?l¨?e6?I©7

> > How should I interpret the stream I'm getting?
>
> As HTML?
>
> I don't get exactly what you want to do, but have you considered
> Jakarta HttpClient?
Thanks for the tip - will give it a shot
Arne Vajhøj - 22 Oct 2006 02:10 GMT
> The fact instead of normal HTML text I'm getting gibbrish like this:
> <?s?6²¿w¦???E?9¿$J´-e?I|/N|¶?^s???$$1¦??l«???· ?IQ²?v??¼d?X`???~8?tr????e??\~~????hm]??>????S??÷7??1?MB?4?B?H×?>jD?e??@×???;÷v?'S??J@X&vV??¬?³d?6??»#|¿x?h
> ¯?,£?¶?o??n¨??8cq?¾Y-?F|y7?2??????3??,?)o=·m
> ?RL?l¨?e6?I©7

Look as if that URL are returning its content GZIP'ed.

Try wrap the InputStream in a GZIPInputStream.

Arne
William Brogden - 22 Oct 2006 16:03 GMT
>> On Oct 22, 1:11 am, mic...@gmail.com wrote:
>> > I am trying to read the text of a website using a URL object and a  
[quoted text clipped - 8 lines]
> ¯?,£?¶?o??n¨??8cq?¾Y-?F|y7?2??????3??,?)o=·m
> ?RL?l¨?e6?I©7

As another poster already said, this is gzip encoded.

When I do this sort of thing I just grab the data stream to a byte[] -
then take a look at the headers to see what the encoding is when I have
the whole message.

I found that it is necessary to search for the GZIP signature bytes
to locate the start of the gzip stream after the headers.

Bill
Tor Iver Wilhelmsen - 22 Oct 2006 09:41 GMT
> http://www.collegehumor.com:80/video:1674301
>
> How should I interpret the stream I'm getting?

I guess it's a video stream , so you should read it as binary and pass
it to a media library if you want to show it.

>          while ((s = dis.readLine()) != null) {

Last I checked, video formats were not line-oriented.
Andrew Thompson - 22 Oct 2006 10:41 GMT
> I am trying to read the text of a website using a URL object and a data
> stream
> It works well on CNN.com for example, but doesn't work well on:
> http://www.collegehumor.com:80/video:1674301

This source loads and displays (crudely) the web page
at that address.

<sscce>
import javax.swing.*;
import java.net.URL;

public class ShowURL {
 public static void main(String[] args) {
   String address = null;
   if (args.length==0) {
     address = JOptionPane.showInputDialog(null, "URL?");
   } else {
     address = args[0];
   }
   JEditorPane jep = null;
   try {
     URL url = new URL(address);
     jep = new JEditorPane(url);
   } catch(Exception e) {
     jep = new JEditorPane();
     jep.setText( e.toString() );
   }
   JScrollPane jsp = new JScrollPane(jep);
   jsp.setPreferredSize(new java.awt.Dimension(400,300));
   JOptionPane.showMessageDialog(null, jsp);
 }
}
</sscce>

..so the data is readable, and it is a web-page.

Andrew T.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.