Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / April 2007

Tip: Looking for answers? Try searching our database.

HTTP connection doesn't work on digg?

Thread view: 
Russell Glasser - 16 Apr 2007 13:48 GMT
I'm trying to familiarize myself with the method of connecting to web
sites with Java.  I've written a simple program to connect to a page
at a given URL, but I've noticed it behaves differently for different
sites.

Here's some code which I have stripped of most of the extra stuff just
to highlight the problem:

----

    public void connTest (String addr)
    {
        try {
            System.out.println("Trying to connect to "+addr);
            URL u = new URL(addr);
            HttpURLConnection conn = (HttpURLConnection) u.openConnection();
            conn.connect();
            InputStream is = conn.getInputStream();
            System.out.println("Input stream is open...");
            is.close();
            conn.disconnect();
        } catch (Exception e) {
            System.out.println ("Something's wrong");
        }
    }

----

Then to invoke it, I try:

    connTest("http://www.google.com");
    connTest("http://www.digg.com");

Here's the output:

Trying to connect to http://www.google.com
Input stream is open...
Trying to connect to http://www.digg.com

The first method call takes a few seconds, but then gives me what I
asked for (and then I can go ahead and print out all the html with a
reader).  The second method call just hangs.  As soon as it hits the
line "InputStream is = conn.getInputStream();" it's stuck.  The same
thing happens if I try to get any other property, such as
getResponseCode.

I've tried this with several web sites and Digg is the only widely-
used site that gives me this problem.  But I can open it in a browser
just fine.  Am I doing something wrong?

--
Russell Glasser
If you did not like the writing style in this message, then you will
almost certainly not enjoy my blog, which is at:
http://kazimskorner.blogspot.com
Gordon Beaton - 16 Apr 2007 14:19 GMT
> I'm trying to familiarize myself with the method of connecting to
> web sites with Java. I've written a simple program to connect to a
> page at a given URL, but I've noticed it behaves differently for
> different sites.

To debug protocol issues like this, use a tool like Wireshark to see
exactly what the browser is doing and compare it to what your program
does.

I think you need to set a User-Agent in the request before connecting:

 conn.setRequestProperty("User-Agent", "something useful");

You can use Google to find lists of valid User-Agent strings for
various browsers and OS platforms, or cut and paste one from the dump
you get from Wireshark when running your regular browser.

/gordon

--
Gordon Beaton - 16 Apr 2007 14:28 GMT
> You can use Google to find lists of valid User-Agent strings for
> various browsers and OS platforms, or cut and paste one from the dump
> you get from Wireshark when running your regular browser.

Also, try connecting here:
http://www.webprodevelopment.com/Web_Toys/What_Is_My_User_Agent/

/gordon

--
Russell Glasser - 16 Apr 2007 22:59 GMT
> > I'm trying to familiarize myself with the method of connecting to
> > web sites with Java. I've written a simple program to connect to a
[quoted text clipped - 16 lines]
>
> --

Thanks Gordon, your suggestion was right on.  I added a User-Agent
string and it works now.

Russell
Ian Wilson - 16 Apr 2007 14:20 GMT
> I'm trying to familiarize myself with the method of connecting to web
> sites with Java.  I've written a simple program to connect to a page
[quoted text clipped - 45 lines]
> used site that gives me this problem.  But I can open it in a browser
> just fine.  Am I doing something wrong?

1) You may not be waiting long enough, It can take minutes for DNS
resolution to give up or for conection attempts to fail. Try
connTest("http://imaginary.example.com") and see what result you get and
how long it takes.

2) Your exception handling discards all the useful information in the
exception. I'd at least print e.getMessage() or a stack trace.

3) Popular free services (like Google) often take measures to prevent
use of their normal HTTP service by anything other than a human clicking
a web-browser. Sometimes they have an API and a registration process for
software authors. Maybe Digg is even more intolerant than Google of what
they perceive as inappropriate use?
Russell Glasser - 16 Apr 2007 23:06 GMT
> 1) You may not be waiting long enough, It can take minutes for DNS
> resolution to give up or for conection attempts to fail. Try
> connTest("http://imaginary.example.com") and see what result you get and
> how long it takes.

That's interesting.  My problem is solved now so it doesn't matter,
but how do IE and Firefox handle this delay?  When I type an invalid
address  into a browser, I get an error very quickly.

> 2) Your exception handling discards all the useful information in the
> exception. I'd at least print e.getMessage() or a stack trace.

Yes, I know that.  I handle different error types in my real program,
but I wanted to post something bare-bones online.

Anyway, the code wasn't reaching the "catch" block -- it was just
stopping dead.  So having a descriptive message wouldn't have helped.

> 3) Popular free services (like Google) often take measures to prevent
> use of their normal HTTP service by anything other than a human clicking
> a web-browser. Sometimes they have an API and a registration process for
> software authors. Maybe Digg is even more intolerant than Google of what
> they perceive as inappropriate use?

That is interesting, and probably explains why I had to enter an agent
to my code.

One worry recently occurred to me. I'm planning to do a long-term data
analysis project for grad school, where I'm basically sucking data off
of popular web 2.0 sites like Digg and then doing data mining on them
to learn about interesting trends.

I wonder, how much risk is there that someone at these sites will
notice some non-standard usage, and then decide to block me?
Gordon Beaton - 17 Apr 2007 08:06 GMT
> One worry recently occurred to me. I'm planning to do a long-term
> data analysis project for grad school, where I'm basically sucking
[quoted text clipped - 3 lines]
> I wonder, how much risk is there that someone at these sites will
> notice some non-standard usage, and then decide to block me?

If your tool behaves like a standard webcrawler, there should be no
issues. However that means respecting things like robots.txt and other
mechanisms webcrawlers are expected to obey.

Some information here:
http://www.robotstxt.org/wc/robots.html
http://www.robotstxt.org/wc/guidelines.html
http://en.wikipedia.org/wiki/Robots_Exclusion_Standard
http://en.wikipedia.org/wiki/Spider_trap
http://en.wikipedia.org/wiki/Web_Crawler

/gordon

--
Ian Wilson - 17 Apr 2007 10:23 GMT
> One worry recently occurred to me. I'm planning to do a long-term data
> analysis project for grad school, where I'm basically sucking data off
[quoted text clipped - 3 lines]
> I wonder, how much risk is there that someone at these sites will
> notice some non-standard usage, and then decide to block me?

I think the polite thing to do would be to tell them what you plan to do
and ask them if they have any objections.

I guess you already read their terms and conditions of use? It sounds to
me like you should be using their RSS feeds rather than "sucking" HTML
pages.

"8 with the exception of accessing RSS feeds, you will not use any
robot, spider, scraper or other automated means to access the Site for
any purpose without our express written permission. Additionally, you
agree that you will not: (i) take any action that imposes, or may impose
in our sole discretion an unreasonable or disproportionately large load
on our infrastructure; (ii) interfere or attempt to interfere with the
proper working of the Site or any activities conducted on the Site; or
(iii) bypass any measures we may use to prevent or restrict access to
the Site;"

Have you briefed your School's legal team yet :-)


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.