Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / January 2007

Tip: Looking for answers? Try searching our database.

Automating Searches

Thread view: 
nowwho@gmail.com - 03 Jan 2007 19:59 GMT
Hey,

     New to Java! Trying to automate searhing Google, Yahoo, MSN, AOL
and Ask by sending queries to those engines using a java program and
storing the returned URL's in a MYSQL database. The program will open a
text file, upload the first line as the query, connect to each of the
search engines, and store URL's in a table called "Results_Table" which
has the following columns:

Search_Eng - This would record the searh engine name
Query - This would record the query text
Returned_URL - This is the URL thathe search engine returned
URL_Num - This is the Number of the URL's position from the search
engine.

Is it possible to do this and store the first 100 URL's the query
returns from each search engine?

Thanks!
Andrew Thompson - 03 Jan 2007 20:19 GMT
...
>       New to Java! Trying to automate searhing Google,

See the Google search API, but be prepared to pay,
for anything beyond the nominal numbers of queries
the Google API permits for free.

>...Yahoo, MSN, AOL and Ask ...

Dunno..  Aren't most of them using data from
Google, in any case?

>...by sending queries to those engines using a java program and
> storing the returned URL's in a MYSQL database.

Why would your users prefer to query your DB, than
query Google directly (for up to the moment data)?

>...The program will open a
> text file, upload the first line as the query, connect to each of the
> search engines,
....
> Is it possible to do this and store the first 100 URL's the query
> returns from each search engine?

Certainly - through whatever public API the search
engine offers - talk to their tech. departments and
they'll most probably instruct you how to get the
data as XML (or something else conveniently as
portable and easily parsable).

Andrew T.
Daniel Pitts - 03 Jan 2007 20:41 GMT
> ...
> >       New to Java! Trying to automate searhing Google,
[quoted text clipped - 28 lines]
>
> Andrew T.

Also, make sure you read the terms of use for all those services.

Although, I do wonder why you would want to store search results in a
database.
Its not that hard to make a data scrapper, and just use the website
directly. But Google DOES give you an API to do it more easily.
John Ersatznom - 04 Jan 2007 13:08 GMT
> Although, I do wonder why you would want to store search results in a
> database.
> Its not that hard to make a data scrapper, and just use the website
> directly. But Google DOES give you an API to do it more easily.

Yeah, but using that API (at least very much) is expensive. Scraping the
results after submitting a normal query URL and a) not diving too deeply
into the results or b) doing new queries too often you can probably fly
under the radar and unless you're coming from a datacenter somewhere
they won't know you from Adam doing manual searches in Firefox.

To top it off, Java makes transparently caching pages (and with 1.6
implementing cookies) easier too. Add in a deliberate request of the
front page before doing the search query, some random delays, and a
spoofed user-agent, and I'm guessing the only way Google could figure
out you weren't just a surfer using Mozilla 4.0 (compatible; MSIE 4.0)
would be by using a tool like EtherSniffer to analyze your incoming
requests and discovering that Java sends the HTTP headers in an
idiosyncratic sequence. And they won't do that unless your IP generates
an eyebrow-raising amount of traffic.

And for Google that "eyebrow-raising" threshold is set very high indeed;
"normal" traffic for Google is millions of searches per day and there
are frequently dozens per day from each of many individual IP addresses
as well as untold numbers of one-offs and the like.

And, of course, as long as you don't generate more traffic faster than
you could by typing in all those queries manually, I don't see any moral
qualms with this. At worst it's equivalent to adblocking the sponsored
links on the results page with a commonly-available Firefox extension.
All you've done is automate some tedium at your end without having any
discernible effect at theirs versus not automating the tedium. So unless
you do believe in victimless crimes or don't believe in the identity of
indiscernibles ... :)
Andrew Thompson - 04 Jan 2007 13:54 GMT
...
> And, of course, as long as you don't generate more traffic faster than
> you could by typing in all those queries manually, I don't see any moral
> qualms with this.

One might also argue that you were free to build
your own web-crawler, parse the pages it finds
for the content and links*, store the data in searchable
form, then rate and rank it according to whatever
criterion best suit you[1].  * Oh, of course then
'repeat for each link', & repeat each 7(?) days.

Setting up the software and hardware capable of
achieving that task, might cost a lot of money (I
guess) OTOH - you can pay a fee to someone
who has already gone to the effort, and has the
expertise.

Just because it is technically possible** to rip
Google off, does not make it right.

** + all the other iditioc reasons people generally
put forward to justify such theft, starting with..
- 'they don't have a right - it is free data!'.  No it isn't -
 the web pages themselves are free, but the search
 engines hope to add value by sorting and filtering.

Also, Google is no 'monopoly'.  As has been pointed
out in this (AFAIR) thread.  You don't like Google's
prices?  Go to the competition..

[1] And then, can you make it publicly available,
so I can rip your data, and resell it to my paying
clients?

Andrew T.
nowwho - 04 Jan 2007 18:28 GMT
Hey,
Thanks for the information so far. I didn't realise there was so much
legal stuff envolved, its for a once off educational project. Didn't
think it would amount to spamming. The pogram would only be run about
50 times in total. There is a set number of queries, and a set number
of results returned. As its an eductional project I never thought of
the legal side!
Andrew Thompson - 04 Jan 2007 19:29 GMT
> Hey,
> Thanks for the information so far. I didn't realise there was so much
> legal stuff envolved, its for a once off educational project.

You 'ivory tower' types are *so* naiive.  It's cute.  ;-)

>...Didn't
> think it would amount to spamming.

I am not sure I would use that term for it.

Spamming is generally pushing an advertising
related message out to people who do not want it.

This (when done the 'wrong way') simply amounts
to a bit of theft of the resources of others.

& for my part, while I might hassle the thieves,
I'll bludgeon the spammers.

>...The pogram would only be run about
> 50 times in total.

I think you might be well placed to use the 'legal
and free' API's currently offered!  Surely even the
small numbers of queries Google offers for free
would cover your requirement?

(In any case, from what I understand, Google simply
refuses further requests for the day if the limit
is struck - no hard feelings, and back tomorrow..)

>...There is a set number of queries, and a set number
> of results returned. As its an eductional project I never thought of
> the legal side!

Don't forget the there can be a few 'legalities' to the
educational side of things.  Be careful of tripping over
using someone elses code without proper attribution
or accreditation.. Plagiarism/academic misconduct.
There was a classic thread on these groups from
a chap by the name of RoboCop - he got to find
out the hard way.

Andrew T.
nowwho - 04 Jan 2007 19:59 GMT
> I am not sure I would use that term for it.

Fair enough, computers and technology aren't my main interest of study.

> I think you might be well placed to use the 'legal
> and free' API's currently offered!  Surely even the
> small numbers of queries Google offers for free
> would cover your requirement?

More than likely, but would still require advise on how to incorporate
these into a Java program.

> Don't forget the there can be a few 'legalities' to the
> educational side of things.  Be careful of tripping over
[quoted text clipped - 3 lines]
> a chap by the name of RoboCop - he got to find
> out the hard way.

The use of other peoples code is allowed , however ALL work and ALL
sources of information used in any way required for the project have to
be detailed, we were well warned about the conquences of plagiarism.
All websites accessed for the project along with any copyright date
must be included along with the date that the website was accessed
etc...
John Ersatznom - 06 Jan 2007 09:15 GMT
>>I think you might be well placed to use the 'legal
>>and free' API's currently offered!  Surely even the
[quoted text clipped - 7 lines]
> must be included along with the date that the website was accessed
> etc...

Oh what a tangled web we weave...what happened to the days when you
could just tinker and innovate without fear of lawyers or similar? Hmm?
Of course, wholesale copying of other stuff without permission and
misattributing it as your own original work is simply bad, but it's
because it's fraud and misrepresentation, not because it's copying, IMO.
Wheel-reinventing is supposed to be a bad thing. Let some attorneys get
involved and soon everyone is expecting you to get their permission to
copy anything. Then to *use* anything. Then to breathe or take a leak,
no doubt.

I think it's worth pointing out that unless you've signed something in
writing, you aren't in a binding agreement with Google about anything
(or anyone else) and only copyright, trademark, and patent law has any
true legal force. No matter what TOC boilerplate is on whose website.
Hell, they can't even prove that you *read* it, in any meaningful way,
even if your IP retrieved the page one day.

Of course the defacto law in the US isn't so rosy, thanks to a braindead
court system and a legislature that's long since been ritually auctioned
with great fanfare biannually to the highest bidder. I'd suggest a saner
country. Many in Europe and, I think, even Canada actually still have
sane legal systems, standards for when someone's actually entered into a
binding contract, standards of evidence to get subpoenas, warrants, and
judgments, and whatnot. Australia's as bad as the US or worse though. I
wonder how long it is before individuals have to jurisdiction-shop by
travel agent and $500 one-way airfare express just to do ordinary
victimless activities without legal repercussions and $50,000 in bogus
fines for phantom file sharing someone else on the neigborhood's cable
company internet service may or may not actually have done...
nowwho - 06 Jan 2007 11:51 GMT
> >>I think you might be well placed to use the 'legal
> >>and free' API's currently offered!  Surely even the
[quoted text clipped - 37 lines]
> fines for phantom file sharing someone else on the neigborhood's cable
> company internet service may or may not actually have done...

While the legal information is handy and can (more than likely will) be
included in the report, is there any suggestions on how to tackle the
coding of the problem or suggestions as to where I can look for further
information?
Chris Uppal - 06 Jan 2007 15:48 GMT
> While the legal information is handy and can (more than likely will) be
> included in the report, is there any suggestions on how to tackle the
> coding of the problem or suggestions as to where I can look for further
> information?

Unfortunately, it appears that Google suspended their Search API last month
(http://code.google.com/apis/soapsearch/), so you will probably have to use
some sort of screen scraping.

If you want to do it in Java (rather than, say, by using command-line tools
such as wget or curl) then you'll need an HTTP client package.  Java comes with
one (start with java.net.URL), but it has been said here that Google blocks
access via that, so you may be better off using a different, and more general,
package such as the Jakarta HTTP client
   http://jakarta.apache.org/commons/httpclient/

Then, once you have worked out how to download data, you will need to parse it
to find the links you want.  Parsing HTML with anything like reliability is not
easy (but you may not need much reliability in this case); you may find this
page of HTML parsers useful.
   http://www.java-source.net/open-source/html-parsers

   -- chris
Chris Uppal - 06 Jan 2007 15:47 GMT
> > The use of other peoples code is allowed , however ALL work and ALL
> > sources of information used in any way required for the project have to
[quoted text clipped - 5 lines]
> Oh what a tangled web we weave...what happened to the days when you
> could just tinker and innovate without fear of lawyers or similar?

I think the OP's problem here is not so much the legality (or otherwise) of
"borrowing" Google's data, but that this is work in an academic context where
all sources /must/ be declared for reasons of honesty in scholarship.

   -- chris
John Ersatznom - 05 Jan 2007 20:39 GMT
> Hey,
> Thanks for the information so far. I didn't realise there was so much
[quoted text clipped - 3 lines]
> of results returned. As its an eductional project I never thought of
> the legal side!

It's not spamming -- I don't know what the other guy was smoking when he
wrote the post you're replying to. There is NO DIFFERENCE discernible to
Google if you

a) do 10 searches during the day by typing in a Firefox window while
doing research or
b) have your computer do the searches with less/no typing on your part

Google is being "ripped off" iff you do something like:

a) use huge amounts of their bandwidth -- well in excess of a normal
user doing a bit of heavy research say, generating large numbers of
searches or delving very deeply into the result set. Fetching 10
first-pages-of-results one for each of 10 queries, whether done by one
mouse click or ten typed-in queries, has little impact on them, and of
course the one mouse click case makes it actually 10 queries instead of
11 because you mistyped one and had to do it again :)
b) or use google search results to populate your own rival "search
engine" site with revenue-generating ads or what-have-you, either by
scraping google's database or by just putting up a page with a script
that takes peoples' queries and passes them to google, then takes the
result page and replaces google's sponsored links with umpteen flashing
banner ads. Then you're using google's work output to actually compete
against google, rather than simply using google for research. That makes
a crucial difference.

Using code to drive Google lightly and for personal/educational/research
reasons rather than commercial ones doesn't seem to be evil to me,
especially if they cannot in practise distinguish it from "normal" use
anyway, as it isn't producing excessive traffic or being used to compete
against google in some way.

In fact, where do you draw the line? Firefox with manually-typed queries
is OK. Then we have Firefox with a MRU for queries; Firefox with query
guessing or autocompletion based on your current activities and
interests; Firefox with a plugin to take the result set too and
transform it e.g. to show 50 rather than 10 hits or to weed out
"supplemental results" that are usually MFA sites that really ARE
ripping off google; Firefox with a plugin to run the query of your
choice and bookmark the results every few days; ... Firefox with a
plugin to gradually build up a database of hits for various queries by
occasionally fetching the nth page of results for one of them, but you
don't publish these anywhere, just use them personally ...

I think the two things that mark a transition to being evil are causing
them excessive traffic and competing with them using their own data in
some way. (Also generating content-free MFA pages to generate revenue
via AdSense ads and SEOing them, but that's more using AdSense than
using the search engine proper, though the SEO will impact the latter
and pollute the results.)

I don't see any way to derive some kind of moral law that makes typing
something morally superior to doing it with one click, and actually
scheduling an automatic (infrequent) job or whatever actually sinful.
There's no inherent virtue in inefficiency, and computers exist to
enable automating tasks. Hyperlinks automate looking up and finding that
dusty reference or whatever; librarians may complain that they rot young
brains but the actual upshot is a gain in productivity, rather than some
kind of evil decadence setting in.
nowwho - 05 Jan 2007 22:46 GMT
> > Hey,
> > Thanks for the information so far. I didn't realise there was so much
[quoted text clipped - 29 lines]
> against google, rather than simply using google for research. That makes
> a crucial difference.

The point of the exercise is to get the URL's returned into an offline
database. It's an excersise purly to pull back the URL's from the
different search engines.

> Using code to drive Google lightly and for personal/educational/research
> reasons rather than commercial ones doesn't seem to be evil to me,
> especially if they cannot in practise distinguish it from "normal" use
> anyway, as it isn't producing excessive traffic or being used to compete
> against google in some way.

I don't think its a question of good or evil, I think people are
worried that the code could be used for commercial reasons.

> In fact, where do you draw the line? Firefox with manually-typed queries
> is OK. Then we have Firefox with a MRU for queries; Firefox with query
[quoted text clipped - 14 lines]
> using the search engine proper, though the SEO will impact the latter
> and pollute the results.)

This is an educational project and as computers is not my main interest
of study I don't know what a MFA, SEO are. Can this be explained?

> I don't see any way to derive some kind of moral law that makes typing
> something morally superior to doing it with one click, and actually
[quoted text clipped - 4 lines]
> brains but the actual upshot is a gain in productivity, rather than some
> kind of evil decadence setting in.

Any help with using the Google API or other suggestions would be a
great help. I also assume that Googe's API won't work with the other
serch engines, so would I have to write a different class for each
search engine?
Chris Uppal - 04 Jan 2007 19:13 GMT
> Add in a deliberate request of the
> front page before doing the search query, some random delays, and a
[quoted text clipped - 4 lines]
> idiosyncratic sequence. And they won't do that unless your IP generates
> an eyebrow-raising amount of traffic.

Google can and does have more intelligence than that.

The simplest thing to look for is the originating IP address of the request (at
the TCP/IP level).  A suspicious pattern of requests from one IP (e.g. too many
in one time period), and Google will stop serving queries from that IP address.
(The originating IP /can/ be spoofed, but not too many Java programmers will
have the necessary skills, and in any case is hardly worth the effort.)  That
criterion can also give false positives; for instance if an organisation is
working behind a NAT, so if one person from that organisation is detected
abusing Google's services, the entire organisation will be blocked.  Does
Google care ?  Why should it ?

Then, too, Google has available /all/ the data which enters its data-centres;
from low-level fingerprinting of IP packets, up through checking HTTP headers,
extending all the way to historical and cross-site access patterns (I would be
very surprised if they didn't use a custom TCP/IP stack implementation for
their HTTP servers).  How much of that information it actual uses (or even
collects) I don't know -- but I'd guess that it collects most of it, and uses
as much as it feels it has to in order to prevent abuse.

And they do actively work to prevent abuse.  There are many kinds of possible
abuse, and I imagine Google work to prevent most of them, but I doubt if there
are many things they dislike more than people attempting to steal their data.

   -- chris
John Ersatznom - 05 Jan 2007 20:59 GMT
> And they do actively work to prevent abuse.  There are many kinds of possible
> abuse, and I imagine Google work to prevent most of them, but I doubt if there
> are many things they dislike more than people attempting to steal their data.

All of this depends on what constitutes "stealing" their data. Copying
it and publishing it? Sort of -- it's some kind of infringement but not
really "theft".

Merely doing with one mouse click or zero what you'd do anyway with
twenty keypresses? I don't see how the amount of clacking emanating from
someone's workstation at location A is in any way relevant to Google as
long as a) a single user isn't suddenly hogging their resources and b)
the user is using the results "normally" rather than to compete with
Google or whatever.

The red flags that would make them look into their logfiles would be a)
excessive bandwidth use and b) a Google clone or whatever springing up
all of a sudden and competing for their revenue streams.

Personal use of the search results isn't anything they can fault. Nor
however a person chooses to generate the requests (so long as they
aren't excessively frequent) or however they choose to filter and use
the results so long as they don't use them commercially.

I see no logical reason for them to care whether the 3 requests a given
IP gave them in a given day came from 30 typed characters and 3 mouse
clicks, 3 mouse clicks, or 0 mouse clicks at the requesting end, as long
as they don't consider 3 requests in one day from one source to be
excessive and as long as they aren't using those results in a way that
competes somehow with Google.

Unless, of course, the real intent is to enforce terms that let them use
a business model based on charging ordinary users a premium merely to
avoid tedium. I hope that isn't their intent; it would violate their
famous motto. A tiered "typed queries are free, bookmarked are a dime
each, and cron jobs require a monthly $59.99 subscription fee and
special account" service where it actually costs them exactly the same
amount (next to nil) to provide for all three use cases seems not merely
silly, but tantamount to fraudulent. A tiered "more than xx queries a
day requires a premium $10/month account" thing with xx in the dozens or
hundreds might not be considered evil -- after all, generating that many
queries actually scales up the amount serving you is costing them per
day. And of course disallowing commercial use of the results (other than
incidentally, like researching a purchase or new hire -- more selling
the results themselves in some manner) without a licensing arrangement
where Google gets a percentage. That's only fair.
Chris Uppal - 06 Jan 2007 15:48 GMT
[me:]
> > And they do actively work to prevent abuse.  There are many kinds of
> > possible abuse, and I imagine Google work to prevent most of them, but
[quoted text clipped - 4 lines]
> it and publishing it? Sort of -- it's some kind of infringement but not
> really "theft".

I don't particularly want to focus on what word(s) best fit the malefaction.
I'll stick with the general purpose "abuse" (which doesn't necessarily even
imply illegality).

> Merely doing with one mouse click or zero what you'd do anyway with
> twenty keypresses? I don't see how the amount of clacking emanating from
> someone's workstation at location A is in any way relevant to Google as
> long as a) a single user isn't suddenly hogging their resources and b)
> the user is using the results "normally" rather than to compete with
> Google or whatever.

Here you are mentioning only one aspect of the abuse (as it might appear to
Google) -- namely overuse of their resources.  And I doubt if they are too
worried about that (within reason, of course).  But almost /any/ automated
scanning of their database is an abuse in another sense: they make that data
available to people (not machines) in order to make money off it.  Their (only,
as far as I know) source of cash is directly or indirectly from the advertising
they include with the search results.  If you don't see the advertising then
you are using their resources and data without paying for them.  How could they
/not/ want to minimise that ?

> The red flags that would make them look into their logfiles would be a)
> excessive bandwidth use and b) a Google clone or whatever springing up
> all of a sudden and competing for their revenue streams.

Or anything else that suggests that the search results are not being read by a
human...

Of course, they own the servers, they pay the (probably massive) network costs
and other data-centre costs, so it's up to them what they consider "fair".  If
they choose to object to people called "Chris" using their services, then
that's up to them -- I have no real right to complain -- they can be as
arbitrary as they like.  Naturally, since they want to make money, they can't
be too very arbitrary (and aren't), but by the same token, they do have good
reasons to (try to) protect their services from freeloaders.

   -- chris
Lew - 07 Jan 2007 02:02 GMT
> Of course, they own the servers, they pay the (probably massive) network costs
> and other data-centre costs, so it's up to them what they consider "fair".  If
[quoted text clipped - 3 lines]
> be too very arbitrary (and aren't), but by the same token, they do have good
> reasons to (try to) protect their services from freeloaders.

I am not sure if name-bigotry is covered, but in many countries discrimination
in the provision of goods or services for certain factors like race, religion,
national origin, physical or mental disabilities and some other like
attributes is illegal. The legal principle rests in part on whether a trait is
innate, like national origin, or voluntary, like whether to wear a beard (for
most). This in no wise invalidates points others have made in this thread
except to point out that legal niceties punch exceptions into many broad
generalizations about these topics.

The legal question of data ownership carries many perilous implications. Does
Google own the information, or merely its representation? Is that
representation limited to its appearance on the screen, or does its specific
storage in their databases qualify? What about the source whence came Google's
data - when they scraped information off foo.com to include it in their data,
did they violate foo.com's owner's intellectual property rights? If I scraped
foo.com and came up with similar information to Google's in a similar data
structure (because data structures are "obvious" to a competent software
engineer), have I violated any of Google's IP rights?

Larger jurisprudential question: what degree of data openness or private
ownership best benefits society?

Concomitant question: what constitutes fair use of another's data?

- Lew
Andrew Thompson - 07 Jan 2007 02:52 GMT
> ...What about the source whence came Google's
> data - when they scraped information off foo.com to include it in their data,
> did they violate foo.com's owner's intellectual property rights?

I assume they figure that complying with a 'robots.txt'*
gives them some justification that they were 'invited'
(or at the very least, not exluded or banned) from
the site in question.

* <http://www.robotstxt.org/>

Andrew T.
Andrew Thompson - 07 Jan 2007 05:37 GMT
> > ...What about the source whence came Google's
> > data - when they scraped information off foo.com to include it in their data,
> > did they violate foo.com's owner's intellectual property rights?
>
> I assume they figure that complying with a 'robots.txt'* ...

E.G. <http://www.google.com/robots.txt>

Andrew T.
John Ersatznom - 08 Jan 2007 07:35 GMT
>>>...What about the source whence came Google's
>>>data - when they scraped information off foo.com to include it in their data,
[quoted text clipped - 3 lines]
>
> E.G. <http://www.google.com/robots.txt>

Unfortunately, one defacto effect of this protocol is that a lot of
sites configure it to deny any automated access and then carve out a few
narrow exemptions for Google and a handful of other big names in search,
on the grounds that nobody else actually drives traffic and business to
their site in any real quantity. The logical outcome is to shut out
smaller search engines and private web-use automation, however. The
former means the current crop of big-name search engines now have a lock
on the market. The latter is simply dumb, since letting people automate
aspects of their web use makes the web (and your site) more useful to them.

Some potentially useful web services are especially likely to be badly
affected. Price comparators, for one. If you run an ecommerce site with
nine competitors, and they all let a price comparator site's bot have
access, and you do likewise, then 90% of the time it will forward people
to a competitor. Obviously as an ecommerce vendor you want to block
price comparator bots! Unfortunately, this is not beneficial to society,
since you are outnumbered by your market, and your market is harmed by
stifling access to information, and the additional ENTIRE market of
online price comparison is threatened if everyone behaves the same.

So there are strong incentives to ignore robots.txt directives for
search engine startups, price comparison engines and suchlike, and
personal automation. Of course, accessing the file but then ignoring a
directive in it is detectable by the site admin who will block your IP,
and the ability to change IPs readily is much more available to the
bigger sites that don't need it than to the smaller sites and
individuals, so that means small-time bots have to not even access it
(and have to fly under the radar -- not too much bandwidth and "look
human").

The good side is that robots.txt does force non-bigname bots to run very
quietly and not use much bandwidth at all or otherwise call attention to
themselves, which serves part of the purpose anyway (one function of
robot directives is to help site admins prevent overuse of their bandwidth).
John Ersatznom - 08 Jan 2007 07:27 GMT
> Larger jurisprudential question: what degree of data openness or private
> ownership best benefits society?

Complete openness, except for national security matters, and those have
to be things like non-stale battle plans that are of use to the enemy if
they get it in a timely fashion. Any other security-based secrecy is
security-through-obscurity; prefer a massive, well-understood defense to
one that depends on the enemy being totally incompetent at espionage.

So-called "intellectual property" may be the single biggest
legal/judicial mistake in history -- far from promoting innovation, all
it seems to do is promote monopolies and lock-in. Check out
againstmonopoly.org sometime. Bad patents are a recurring theme there
and at techdirt, slashdot and other tech sites, but they're just the tip
of the iceberg.

> Concomitant question: what constitutes fair use of another's data?

Any private, educational, or nonprofit use should IMO. Of course if I
had my druthers any use at all would. The only things "protectable"
would be personal information, which people would be able to insist
(with legal clout) companies like ChoicePoint delete or at least verify.
And, eventually, the person's actual mind itself, once the technology to
download or otherwise access it with the right tools is available. If I
don't want spammers pestering me at some email address I think I have
that right, but if I publish something nonpersonal by choice I don't
feel I should then try to dictate how others use it.
John Ersatznom - 08 Jan 2007 07:18 GMT
> Here you are mentioning only one aspect of the abuse (as it might appear to
> Google) -- namely overuse of their resources.  And I doubt if they are too
[quoted text clipped - 5 lines]
> you are using their resources and data without paying for them.  How could they
> /not/ want to minimise that ?

If accessing a site in such a way as to not see advertising is "wrong",
then using adblock plugins for your browser must be wrong. Using
Ad-Aware to wipe out those foo.doubleclick.com tracking cookies must be
wrong. Putting "*.doubleclick.com 127.0.0.1" in your hosts file must be
wrong. Hell, walking into the kitchen to fix yourself a snack when your
TV show goes to an ad must be wrong! Maybe even avoiding spam or
deleting it unread...

There is such a thing as taking something too far.

> Of course, they own the servers, they pay the (probably massive) network costs
> and other data-centre costs, so it's up to them what they consider "fair".  If
[quoted text clipped - 3 lines]
> be too very arbitrary (and aren't), but by the same token, they do have good
> reasons to (try to) protect their services from freeloaders.

That's completely aside any legal issues, and down to any business being
able to pick its customers selectively. And, of course, their ability to
do so is limited to the extent that they can detect whatever they don't
like. If they don't like people named "Chris" a Chris can use a phony
name and they won't know the difference unless they start demanding ID
verification to grant access, and they won't do that because it would be
a quick way to self-destruct in the search-engine business.

Automating some of your search usage is similarly something you can fly
below their radar, but in doing so you will clearly have to avoid any
high levels of usage that would bother them and get their attention. But
below that threshold, it's also a case of "what they don't know can't
hurt them"...
Luc The Perverse - 04 Jan 2007 06:39 GMT
>>...Yahoo, MSN, AOL and Ask ...
>
> Dunno..  Aren't most of them using data from
> Google, in any case?

Um . . . Certainly Yahoo and MSN are not.

--
LTP

:)
Andrew Thompson - 04 Jan 2007 10:24 GMT
> >>...Yahoo, MSN, AOL and Ask ...
> >
> > Dunno..  Aren't most of them using data from
> > Google, in any case?
>
> Um . . . Certainly Yahoo and MSN are not.

OK - I see lots of hits for MSN bots in my server logs,
but not one for Yahoo.  What does it's bot identify itself
as?

Andrew T.
TechBookReport - 04 Jan 2007 10:31 GMT
>>>> ...Yahoo, MSN, AOL and Ask ...
>>> Dunno..  Aren't most of them using data from
[quoted text clipped - 6 lines]
>
> Andrew T.

Look for: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/)

Signature

TechBookReport Java - http://www.techbookreport.com/JavaIndex.html

Andrew Thompson - 04 Jan 2007 11:05 GMT
..
> > ..I see lots of hits for MSN bots in my server logs,
> > but not one for Yahoo.
...
> Look for: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/)

OK - I see them now..

Yahoo! - 9246
msn - 21457
goog - 7638

I was surprised I did not find them on the first search..
Must have been something stupid I did.. (shrugs)

BTW - nice to see you 'about the place' again..
I think of you whenever somebody asks after books,
but a quick, very tentative, search failed to lay an URL
on your site.  I'll bookmark it.

Andrew T.
NoNickName - 05 Jan 2007 13:37 GMT
> ..

> BTW - nice to see you 'about the place' again..

Thanks. Been busy with end of year deadlines recently. Should be around
a bit more often now though.

Signature

TechBookReport Java - http://www.techbookreport.com/JavaIndex.html



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.