Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / May 2006

Tip: Looking for answers? Try searching our database.

Searching google in java

Thread view: 
mfasoccer@gmail.com - 18 May 2006 02:01 GMT
Im working on a project that involves searching with google. I have
been getting an http 403 error with the following code:

import java.net.*;
import java.io.*;

public class GoogleSearchTest
{
    public static void main(String[] args) throws Exception{
        URL hp = new URL("http://www.google.com/search?q=babelfish");
        URLConnection hpCon = hp.openConnection();
        hpCon.connect();
        InputStream input = hpCon.getInputStream(); // error traces to here

               /*
                  This code is all irrelevant to my problem because
the inputstream is refuted
               String content = "";
        int c;
        while((c = input.read()) != -1)
            content += (char)c;
               */
    }
}

I know that http 403 error means that the server understood the
request, yet refused it. As you can probably tell I have very little
network programming experience, so maybe more experienced programmers
could help alter my approach, or explain a better one? Thanks.
Patricia Shanahan - 18 May 2006 02:13 GMT
> Im working on a project that involves searching with google. I have
> been getting an http 403 error with the following code:
...

Google offers a Java API, see http://www.google.com/apis/. It is much
easier than trying to get and parse a web page.

Note that they limit automated searching to 1000 queries per day,
non-commercial, and require a license key with each request.

Patricia
alexandre_paterson@yahoo.fr - 18 May 2006 02:19 GMT
...
> I know that http 403 error means that the server understood the
> request, yet refused it. As you can probably tell I have very little
> network programming experience, so maybe more experienced programmers
> could help alter my approach, or explain a better one? Thanks.

A better approach would be to use Google' APIs as Patricia pointed
out.

However this is not always an option (the API didn't help
for, eg, groups.google.com last time I checked [but this was
a long time ago I admit]).

Faking your user agent string will allow you to bypass the 403
(and it probably would be a breach of Google's terms).

--
(Don't pay attention to my .sig)    Text file size: 1509 bytes
SHA1: bbfa3226005c2d4d04e3d72d49bfb1eb17e67f12
MD5: 38dfd87012a2754059a88341d66e2ef4
mfasoccer@gmail.com - 18 May 2006 02:26 GMT
> Faking your user agent string will allow you to bypass the 403

Could any provide a sample of how to fake my agent string?
alexandre_paterson@yahoo.fr - 18 May 2006 02:56 GMT
In your example, you insert one line:

URLConnection hpCon = hp.openConnection();
hpCon.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U;
Windows NT 5.0; en-US; rv:1.7.8) Gecko/20050511");
hpCon.connect();

and that may work.

But you still should respect Google's terms...
Andrea Desole - 18 May 2006 08:38 GMT
> In your example, you insert one line:
>
[quoted text clipped - 4 lines]
>
> and that may work.

I'm not sure this is enough.
You probably have to set the http.agent property:

http://java.sun.com/j2se/1.5.0/docs/guide/net/properties.html
Robert Klemme - 18 May 2006 08:55 GMT
>> In your example, you insert one line:
>>
[quoted text clipped - 9 lines]
>
> http://java.sun.com/j2se/1.5.0/docs/guide/net/properties.html

Additional hint: better use a decent HTTP client such as Apache's as the
standard library classes are quite limited.

Regards

    robert
mfasoccer@gmail.com - 18 May 2006 12:06 GMT
> URLConnection hpCon = hp.openConnection();
> hpCon.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U;
> Windows NT 5.0; en-US; rv:1.7.8) Gecko/20050511");
> hpCon.connect();

it works, thanks.
VisionSet - 18 May 2006 14:16 GMT
> > URLConnection hpCon = hp.openConnection();
> > hpCon.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U;
> > Windows NT 5.0; en-US; rv:1.7.8) Gecko/20050511");
> > hpCon.connect();
> >
> it works, thanks.

But you'll still get the same restriction of 1000 hits per day however you
do it.

--
Mike W
mfasoccer@gmail.com - 18 May 2006 22:48 GMT
> But you'll still get the same restriction of 1000 hits per day however you
> do it.

Does this mean that even regular searches that are executed through
their website with an actual browser are also limited to 1000 hits per
day?
Thomas Weidenfeller - 19 May 2006 08:24 GMT
>> But you'll still get the same restriction of 1000 hits per day however you
>> do it.
>
> Does this mean that even regular searches that are executed through
> their website with an actual browser are also limited to 1000 hits per
> day?

You are not doing a regular search via a browser. You are trying to do
some automated querying. Googles ToS prohibits this
http://www.google.com/terms_of_service.html. Whatever you are trying to
do, you idea is flawed, since it is based on the concept of violating
the terms-of-service of the service you are using.

And do you really think you are the first one who had the glorious idea
to "work around" the API limitation (read: violate the ToS) by
simulating a browser?

The irony is that you even use a Google mail address to plan and
announce your intended violation of Google's ToS in public. What a great
idea.

Signature

The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/

jeremiah johnson - 19 May 2006 10:30 GMT
>> But you'll still get the same restriction of 1000 hits per day however you
>> do it.
>
> Does this mean that even regular searches that are executed through
> their website with an actual browser are also limited to 1000 hits per
> day?

Google is *extremely* good at detecting automated queries.  Just get
your program working, query Google a few hundred times, then try to
visit google.com in your browser.  You will very likely see a message
that they have detected you.

Someone at my employer tried this the other day.  A few hundred
automated queries later and the entire Fortune 50 company had to go
through a CAPTCHA each time we wanted to use Google.  180,000+ people.
Bent C Dalager - 19 May 2006 10:49 GMT
>Someone at my employer tried this the other day.  A few hundred
>automated queries later and the entire Fortune 50 company had to go
>through a CAPTCHA each time we wanted to use Google.  180,000+ people.

How good are their CAPTCHAs? Is there a way to see them without first
getting oneself banned?

Cheers
    Bent D
Signature

Bent Dalager - bcd@pvv.org - http://www.pvv.org/~bcd
                                   powered by emacs

ashesh - 21 May 2006 06:41 GMT
hi!! have any one have idea about Hibernet,if u do then plz tell me
about this.
IchBin - 21 May 2006 07:13 GMT
> hi!! have any one have idea about Hibernet,if u do then plz tell me
> about this.

Do a google search on  hibernet java

  then look at the first article.

Thanks in Advance...
IchBin, Pocono Lake, Pa, USA
http://weconsultants.servebeer.com/JHackerAppManager
__________________________________________________________________________

'If there is one, Knowledge is the "Fountain of Youth"'
-William E. Taylor,  Regular Guy (1952-)
Luke Webber - 23 May 2006 01:20 GMT
> hi!! have any one have idea about Hibernet,if u do then plz tell me
> about this.

I think you're looking for Hibernate, the Java ORM...

http://www.hibernate.org/

Cheers,
Luke
Oliver Wong - 25 May 2006 21:27 GMT
>>Someone at my employer tried this the other day.  A few hundred
>>automated queries later and the entire Fortune 50 company had to go
>>through a CAPTCHA each time we wanted to use Google.  180,000+ people.
>
> How good are their CAPTCHAs? Is there a way to see them without first
> getting oneself banned?

When I google for "google captcha", I get
http://www.spy.org.uk/spyblog/2005/06/stupid_google_virusspyware_cap.html
which has a screenshot of their captcha test.

   - Oliver
Roedy Green - 26 May 2006 22:52 GMT
>How good are their CAPTCHAs? Is there a way to see them without first
>getting oneself banned?

It would not take too much cleverness.  All they have to do in monitor
hits per hour from a given IP.  If it suddenly jumps up, and if the
hits have a stereotyped rigidity of format and timing, they have you
nailed.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 26 May 2006 22:49 GMT
On 17 May 2006 18:01:13 -0700, "mfasoccer@gmail.com"
<mfasoccer@gmail.com> wrote, quoted or indirectly quoted someone who
said :

>I know that http 403 error means that the server understood the
>request, yet refused it.

Here is what I would do.  I don't know if this is the problem though.

Use a sniffer to watch the same query given by a browser.  See
http://mindprod.com/jgloss/sniffer.html

Pad your request header out with additional fields the browser sends,
e.g. info on what encodings are acceptable in reply.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.