Im working on a project that involves searching with google. I have
been getting an http 403 error with the following code:
import java.net.*;
import java.io.*;
public class GoogleSearchTest
{
public static void main(String[] args) throws Exception{
URL hp = new URL("http://www.google.com/search?q=babelfish");
URLConnection hpCon = hp.openConnection();
hpCon.connect();
InputStream input = hpCon.getInputStream(); // error traces to here
/*
This code is all irrelevant to my problem because
the inputstream is refuted
String content = "";
int c;
while((c = input.read()) != -1)
content += (char)c;
*/
}
}
I know that http 403 error means that the server understood the
request, yet refused it. As you can probably tell I have very little
network programming experience, so maybe more experienced programmers
could help alter my approach, or explain a better one? Thanks.
Patricia Shanahan - 18 May 2006 02:13 GMT
> Im working on a project that involves searching with google. I have
> been getting an http 403 error with the following code:
...
Google offers a Java API, see http://www.google.com/apis/. It is much
easier than trying to get and parse a web page.
Note that they limit automated searching to 1000 queries per day,
non-commercial, and require a license key with each request.
Patricia
alexandre_paterson@yahoo.fr - 18 May 2006 02:19 GMT
...
> I know that http 403 error means that the server understood the
> request, yet refused it. As you can probably tell I have very little
> network programming experience, so maybe more experienced programmers
> could help alter my approach, or explain a better one? Thanks.
A better approach would be to use Google' APIs as Patricia pointed
out.
However this is not always an option (the API didn't help
for, eg, groups.google.com last time I checked [but this was
a long time ago I admit]).
Faking your user agent string will allow you to bypass the 403
(and it probably would be a breach of Google's terms).
--
(Don't pay attention to my .sig) Text file size: 1509 bytes
SHA1: bbfa3226005c2d4d04e3d72d49bfb1eb17e67f12
MD5: 38dfd87012a2754059a88341d66e2ef4
mfasoccer@gmail.com - 18 May 2006 02:26 GMT
> Faking your user agent string will allow you to bypass the 403
Could any provide a sample of how to fake my agent string?
alexandre_paterson@yahoo.fr - 18 May 2006 02:56 GMT
In your example, you insert one line:
URLConnection hpCon = hp.openConnection();
hpCon.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U;
Windows NT 5.0; en-US; rv:1.7.8) Gecko/20050511");
hpCon.connect();
and that may work.
But you still should respect Google's terms...
Andrea Desole - 18 May 2006 08:38 GMT
> In your example, you insert one line:
>
[quoted text clipped - 4 lines]
>
> and that may work.
I'm not sure this is enough.
You probably have to set the http.agent property:
http://java.sun.com/j2se/1.5.0/docs/guide/net/properties.html
Robert Klemme - 18 May 2006 08:55 GMT
>> In your example, you insert one line:
>>
[quoted text clipped - 9 lines]
>
> http://java.sun.com/j2se/1.5.0/docs/guide/net/properties.html
Additional hint: better use a decent HTTP client such as Apache's as the
standard library classes are quite limited.
Regards
robert
mfasoccer@gmail.com - 18 May 2006 12:06 GMT
> URLConnection hpCon = hp.openConnection();
> hpCon.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U;
> Windows NT 5.0; en-US; rv:1.7.8) Gecko/20050511");
> hpCon.connect();
it works, thanks.
VisionSet - 18 May 2006 14:16 GMT
> > URLConnection hpCon = hp.openConnection();
> > hpCon.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U;
> > Windows NT 5.0; en-US; rv:1.7.8) Gecko/20050511");
> > hpCon.connect();
> >
> it works, thanks.
But you'll still get the same restriction of 1000 hits per day however you
do it.
--
Mike W
mfasoccer@gmail.com - 18 May 2006 22:48 GMT
> But you'll still get the same restriction of 1000 hits per day however you
> do it.
Does this mean that even regular searches that are executed through
their website with an actual browser are also limited to 1000 hits per
day?
Thomas Weidenfeller - 19 May 2006 08:24 GMT
>> But you'll still get the same restriction of 1000 hits per day however you
>> do it.
>
> Does this mean that even regular searches that are executed through
> their website with an actual browser are also limited to 1000 hits per
> day?
You are not doing a regular search via a browser. You are trying to do
some automated querying. Googles ToS prohibits this
http://www.google.com/terms_of_service.html. Whatever you are trying to
do, you idea is flawed, since it is based on the concept of violating
the terms-of-service of the service you are using.
And do you really think you are the first one who had the glorious idea
to "work around" the API limitation (read: violate the ToS) by
simulating a browser?
The irony is that you even use a Google mail address to plan and
announce your intended violation of Google's ToS in public. What a great
idea.

Signature
The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
jeremiah johnson - 19 May 2006 10:30 GMT
>> But you'll still get the same restriction of 1000 hits per day however you
>> do it.
>
> Does this mean that even regular searches that are executed through
> their website with an actual browser are also limited to 1000 hits per
> day?
Google is *extremely* good at detecting automated queries. Just get
your program working, query Google a few hundred times, then try to
visit google.com in your browser. You will very likely see a message
that they have detected you.
Someone at my employer tried this the other day. A few hundred
automated queries later and the entire Fortune 50 company had to go
through a CAPTCHA each time we wanted to use Google. 180,000+ people.
Bent C Dalager - 19 May 2006 10:49 GMT
>Someone at my employer tried this the other day. A few hundred
>automated queries later and the entire Fortune 50 company had to go
>through a CAPTCHA each time we wanted to use Google. 180,000+ people.
How good are their CAPTCHAs? Is there a way to see them without first
getting oneself banned?
Cheers
Bent D

Signature
Bent Dalager - bcd@pvv.org - http://www.pvv.org/~bcd
powered by emacs
ashesh - 21 May 2006 06:41 GMT
hi!! have any one have idea about Hibernet,if u do then plz tell me
about this.
IchBin - 21 May 2006 07:13 GMT
> hi!! have any one have idea about Hibernet,if u do then plz tell me
> about this.
Do a google search on hibernet java
then look at the first article.
Thanks in Advance...
IchBin, Pocono Lake, Pa, USA
http://weconsultants.servebeer.com/JHackerAppManager
__________________________________________________________________________
'If there is one, Knowledge is the "Fountain of Youth"'
-William E. Taylor, Regular Guy (1952-)
Luke Webber - 23 May 2006 01:20 GMT
> hi!! have any one have idea about Hibernet,if u do then plz tell me
> about this.
I think you're looking for Hibernate, the Java ORM...
http://www.hibernate.org/
Cheers,
Luke
Oliver Wong - 25 May 2006 21:27 GMT
>>Someone at my employer tried this the other day. A few hundred
>>automated queries later and the entire Fortune 50 company had to go
>>through a CAPTCHA each time we wanted to use Google. 180,000+ people.
>
> How good are their CAPTCHAs? Is there a way to see them without first
> getting oneself banned?
When I google for "google captcha", I get
http://www.spy.org.uk/spyblog/2005/06/stupid_google_virusspyware_cap.html
which has a screenshot of their captcha test.
- Oliver
Roedy Green - 26 May 2006 22:52 GMT
>How good are their CAPTCHAs? Is there a way to see them without first
>getting oneself banned?
It would not take too much cleverness. All they have to do in monitor
hits per hour from a given IP. If it suddenly jumps up, and if the
hits have a stereotyped rigidity of format and timing, they have you
nailed.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 26 May 2006 22:49 GMT
On 17 May 2006 18:01:13 -0700, "mfasoccer@gmail.com"
<mfasoccer@gmail.com> wrote, quoted or indirectly quoted someone who
said :
>I know that http 403 error means that the server understood the
>request, yet refused it.
Here is what I would do. I don't know if this is the problem though.
Use a sniffer to watch the same query given by a browser. See
http://mindprod.com/jgloss/sniffer.html
Pad your request header out with additional fields the browser sends,
e.g. info on what encodings are acceptable in reply.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.