> 1) You may not be waiting long enough, It can take minutes for DNS
> resolution to give up or for conection attempts to fail. Try
> connTest("http://imaginary.example.com") and see what result you get and
> how long it takes.
That's interesting. My problem is solved now so it doesn't matter,
but how do IE and Firefox handle this delay? When I type an invalid
address into a browser, I get an error very quickly.
> 2) Your exception handling discards all the useful information in the
> exception. I'd at least print e.getMessage() or a stack trace.
Yes, I know that. I handle different error types in my real program,
but I wanted to post something bare-bones online.
Anyway, the code wasn't reaching the "catch" block -- it was just
stopping dead. So having a descriptive message wouldn't have helped.
> 3) Popular free services (like Google) often take measures to prevent
> use of their normal HTTP service by anything other than a human clicking
> a web-browser. Sometimes they have an API and a registration process for
> software authors. Maybe Digg is even more intolerant than Google of what
> they perceive as inappropriate use?
That is interesting, and probably explains why I had to enter an agent
to my code.
One worry recently occurred to me. I'm planning to do a long-term data
analysis project for grad school, where I'm basically sucking data off
of popular web 2.0 sites like Digg and then doing data mining on them
to learn about interesting trends.
I wonder, how much risk is there that someone at these sites will
notice some non-standard usage, and then decide to block me?
Gordon Beaton - 17 Apr 2007 08:06 GMT
> One worry recently occurred to me. I'm planning to do a long-term
> data analysis project for grad school, where I'm basically sucking
[quoted text clipped - 3 lines]
> I wonder, how much risk is there that someone at these sites will
> notice some non-standard usage, and then decide to block me?
If your tool behaves like a standard webcrawler, there should be no
issues. However that means respecting things like robots.txt and other
mechanisms webcrawlers are expected to obey.
Some information here:
http://www.robotstxt.org/wc/robots.html
http://www.robotstxt.org/wc/guidelines.html
http://en.wikipedia.org/wiki/Robots_Exclusion_Standard
http://en.wikipedia.org/wiki/Spider_trap
http://en.wikipedia.org/wiki/Web_Crawler
/gordon
--
Ian Wilson - 17 Apr 2007 10:23 GMT
> One worry recently occurred to me. I'm planning to do a long-term data
> analysis project for grad school, where I'm basically sucking data off
[quoted text clipped - 3 lines]
> I wonder, how much risk is there that someone at these sites will
> notice some non-standard usage, and then decide to block me?
I think the polite thing to do would be to tell them what you plan to do
and ask them if they have any objections.
I guess you already read their terms and conditions of use? It sounds to
me like you should be using their RSS feeds rather than "sucking" HTML
pages.
"8 with the exception of accessing RSS feeds, you will not use any
robot, spider, scraper or other automated means to access the Site for
any purpose without our express written permission. Additionally, you
agree that you will not: (i) take any action that imposes, or may impose
in our sole discretion an unreasonable or disproportionately large load
on our infrastructure; (ii) interfere or attempt to interfere with the
proper working of the Site or any activities conducted on the Site; or
(iii) bypass any measures we may use to prevent or restrict access to
the Site;"
Have you briefed your School's legal team yet :-)