Java Forum / General / January 2007
Automating Searches
nowwho@gmail.com - 03 Jan 2007 19:59 GMT Hey,
New to Java! Trying to automate searhing Google, Yahoo, MSN, AOL and Ask by sending queries to those engines using a java program and storing the returned URL's in a MYSQL database. The program will open a text file, upload the first line as the query, connect to each of the search engines, and store URL's in a table called "Results_Table" which has the following columns:
Search_Eng - This would record the searh engine name Query - This would record the query text Returned_URL - This is the URL thathe search engine returned URL_Num - This is the Number of the URL's position from the search engine.
Is it possible to do this and store the first 100 URL's the query returns from each search engine?
Thanks!
Andrew Thompson - 03 Jan 2007 20:19 GMT ...
> New to Java! Trying to automate searhing Google, See the Google search API, but be prepared to pay, for anything beyond the nominal numbers of queries the Google API permits for free.
>...Yahoo, MSN, AOL and Ask ... Dunno.. Aren't most of them using data from Google, in any case?
>...by sending queries to those engines using a java program and > storing the returned URL's in a MYSQL database. Why would your users prefer to query your DB, than query Google directly (for up to the moment data)?
>...The program will open a > text file, upload the first line as the query, connect to each of the > search engines, ....
> Is it possible to do this and store the first 100 URL's the query > returns from each search engine? Certainly - through whatever public API the search engine offers - talk to their tech. departments and they'll most probably instruct you how to get the data as XML (or something else conveniently as portable and easily parsable).
Andrew T.
Daniel Pitts - 03 Jan 2007 20:41 GMT > ... > > New to Java! Trying to automate searhing Google, [quoted text clipped - 28 lines] > > Andrew T. Also, make sure you read the terms of use for all those services.
Although, I do wonder why you would want to store search results in a database. Its not that hard to make a data scrapper, and just use the website directly. But Google DOES give you an API to do it more easily.
John Ersatznom - 04 Jan 2007 13:08 GMT > Although, I do wonder why you would want to store search results in a > database. > Its not that hard to make a data scrapper, and just use the website > directly. But Google DOES give you an API to do it more easily. Yeah, but using that API (at least very much) is expensive. Scraping the results after submitting a normal query URL and a) not diving too deeply into the results or b) doing new queries too often you can probably fly under the radar and unless you're coming from a datacenter somewhere they won't know you from Adam doing manual searches in Firefox.
To top it off, Java makes transparently caching pages (and with 1.6 implementing cookies) easier too. Add in a deliberate request of the front page before doing the search query, some random delays, and a spoofed user-agent, and I'm guessing the only way Google could figure out you weren't just a surfer using Mozilla 4.0 (compatible; MSIE 4.0) would be by using a tool like EtherSniffer to analyze your incoming requests and discovering that Java sends the HTTP headers in an idiosyncratic sequence. And they won't do that unless your IP generates an eyebrow-raising amount of traffic.
And for Google that "eyebrow-raising" threshold is set very high indeed; "normal" traffic for Google is millions of searches per day and there are frequently dozens per day from each of many individual IP addresses as well as untold numbers of one-offs and the like.
And, of course, as long as you don't generate more traffic faster than you could by typing in all those queries manually, I don't see any moral qualms with this. At worst it's equivalent to adblocking the sponsored links on the results page with a commonly-available Firefox extension. All you've done is automate some tedium at your end without having any discernible effect at theirs versus not automating the tedium. So unless you do believe in victimless crimes or don't believe in the identity of indiscernibles ... :)
Andrew Thompson - 04 Jan 2007 13:54 GMT ...
> And, of course, as long as you don't generate more traffic faster than > you could by typing in all those queries manually, I don't see any moral > qualms with this. One might also argue that you were free to build your own web-crawler, parse the pages it finds for the content and links*, store the data in searchable form, then rate and rank it according to whatever criterion best suit you[1]. * Oh, of course then 'repeat for each link', & repeat each 7(?) days.
Setting up the software and hardware capable of achieving that task, might cost a lot of money (I guess) OTOH - you can pay a fee to someone who has already gone to the effort, and has the expertise.
Just because it is technically possible** to rip Google off, does not make it right.
** + all the other iditioc reasons people generally put forward to justify such theft, starting with.. - 'they don't have a right - it is free data!'. No it isn't - the web pages themselves are free, but the search engines hope to add value by sorting and filtering.
Also, Google is no 'monopoly'. As has been pointed out in this (AFAIR) thread. You don't like Google's prices? Go to the competition..
[1] And then, can you make it publicly available, so I can rip your data, and resell it to my paying clients?
Andrew T.
nowwho - 04 Jan 2007 18:28 GMT Hey, Thanks for the information so far. I didn't realise there was so much legal stuff envolved, its for a once off educational project. Didn't think it would amount to spamming. The pogram would only be run about 50 times in total. There is a set number of queries, and a set number of results returned. As its an eductional project I never thought of the legal side!
Andrew Thompson - 04 Jan 2007 19:29 GMT > Hey, > Thanks for the information so far. I didn't realise there was so much > legal stuff envolved, its for a once off educational project. You 'ivory tower' types are *so* naiive. It's cute. ;-)
>...Didn't > think it would amount to spamming. I am not sure I would use that term for it.
Spamming is generally pushing an advertising related message out to people who do not want it.
This (when done the 'wrong way') simply amounts to a bit of theft of the resources of others.
& for my part, while I might hassle the thieves, I'll bludgeon the spammers.
>...The pogram would only be run about > 50 times in total. I think you might be well placed to use the 'legal and free' API's currently offered! Surely even the small numbers of queries Google offers for free would cover your requirement?
(In any case, from what I understand, Google simply refuses further requests for the day if the limit is struck - no hard feelings, and back tomorrow..)
>...There is a set number of queries, and a set number > of results returned. As its an eductional project I never thought of > the legal side! Don't forget the there can be a few 'legalities' to the educational side of things. Be careful of tripping over using someone elses code without proper attribution or accreditation.. Plagiarism/academic misconduct. There was a classic thread on these groups from a chap by the name of RoboCop - he got to find out the hard way.
Andrew T.
nowwho - 04 Jan 2007 19:59 GMT > I am not sure I would use that term for it. Fair enough, computers and technology aren't my main interest of study.
> I think you might be well placed to use the 'legal > and free' API's currently offered! Surely even the > small numbers of queries Google offers for free > would cover your requirement? More than likely, but would still require advise on how to incorporate these into a Java program.
> Don't forget the there can be a few 'legalities' to the > educational side of things. Be careful of tripping over [quoted text clipped - 3 lines] > a chap by the name of RoboCop - he got to find > out the hard way. The use of other peoples code is allowed , however ALL work and ALL sources of information used in any way required for the project have to be detailed, we were well warned about the conquences of plagiarism. All websites accessed for the project along with any copyright date must be included along with the date that the website was accessed etc...
John Ersatznom - 06 Jan 2007 09:15 GMT >>I think you might be well placed to use the 'legal >>and free' API's currently offered! Surely even the [quoted text clipped - 7 lines] > must be included along with the date that the website was accessed > etc... Oh what a tangled web we weave...what happened to the days when you could just tinker and innovate without fear of lawyers or similar? Hmm? Of course, wholesale copying of other stuff without permission and misattributing it as your own original work is simply bad, but it's because it's fraud and misrepresentation, not because it's copying, IMO. Wheel-reinventing is supposed to be a bad thing. Let some attorneys get involved and soon everyone is expecting you to get their permission to copy anything. Then to *use* anything. Then to breathe or take a leak, no doubt.
I think it's worth pointing out that unless you've signed something in writing, you aren't in a binding agreement with Google about anything (or anyone else) and only copyright, trademark, and patent law has any true legal force. No matter what TOC boilerplate is on whose website. Hell, they can't even prove that you *read* it, in any meaningful way, even if your IP retrieved the page one day.
Of course the defacto law in the US isn't so rosy, thanks to a braindead court system and a legislature that's long since been ritually auctioned with great fanfare biannually to the highest bidder. I'd suggest a saner country. Many in Europe and, I think, even Canada actually still have sane legal systems, standards for when someone's actually entered into a binding contract, standards of evidence to get subpoenas, warrants, and judgments, and whatnot. Australia's as bad as the US or worse though. I wonder how long it is before individuals have to jurisdiction-shop by travel agent and $500 one-way airfare express just to do ordinary victimless activities without legal repercussions and $50,000 in bogus fines for phantom file sharing someone else on the neigborhood's cable company internet service may or may not actually have done...
nowwho - 06 Jan 2007 11:51 GMT > >>I think you might be well placed to use the 'legal > >>and free' API's currently offered! Surely even the [quoted text clipped - 37 lines] > fines for phantom file sharing someone else on the neigborhood's cable > company internet service may or may not actually have done... While the legal information is handy and can (more than likely will) be included in the report, is there any suggestions on how to tackle the coding of the problem or suggestions as to where I can look for further information?
Chris Uppal - 06 Jan 2007 15:48 GMT > While the legal information is handy and can (more than likely will) be > included in the report, is there any suggestions on how to tackle the > coding of the problem or suggestions as to where I can look for further > information? Unfortunately, it appears that Google suspended their Search API last month (http://code.google.com/apis/soapsearch/), so you will probably have to use some sort of screen scraping.
If you want to do it in Java (rather than, say, by using command-line tools such as wget or curl) then you'll need an HTTP client package. Java comes with one (start with java.net.URL), but it has been said here that Google blocks access via that, so you may be better off using a different, and more general, package such as the Jakarta HTTP client http://jakarta.apache.org/commons/httpclient/
Then, once you have worked out how to download data, you will need to parse it to find the links you want. Parsing HTML with anything like reliability is not easy (but you may not need much reliability in this case); you may find this page of HTML parsers useful. http://www.java-source.net/open-source/html-parsers
-- chris
Chris Uppal - 06 Jan 2007 15:47 GMT > > The use of other peoples code is allowed , however ALL work and ALL > > sources of information used in any way required for the project have to [quoted text clipped - 5 lines] > Oh what a tangled web we weave...what happened to the days when you > could just tinker and innovate without fear of lawyers or similar? I think the OP's problem here is not so much the legality (or otherwise) of "borrowing" Google's data, but that this is work in an academic context where all sources /must/ be declared for reasons of honesty in scholarship.
-- chris
John Ersatznom - 05 Jan 2007 20:39 GMT > Hey, > Thanks for the information so far. I didn't realise there was so much [quoted text clipped - 3 lines] > of results returned. As its an eductional project I never thought of > the legal side! It's not spamming -- I don't know what the other guy was smoking when he wrote the post you're replying to. There is NO DIFFERENCE discernible to Google if you
a) do 10 searches during the day by typing in a Firefox window while doing research or b) have your computer do the searches with less/no typing on your part
Google is being "ripped off" iff you do something like:
a) use huge amounts of their bandwidth -- well in excess of a normal user doing a bit of heavy research say, generating large numbers of searches or delving very deeply into the result set. Fetching 10 first-pages-of-results one for each of 10 queries, whether done by one mouse click or ten typed-in queries, has little impact on them, and of course the one mouse click case makes it actually 10 queries instead of 11 because you mistyped one and had to do it again :) b) or use google search results to populate your own rival "search engine" site with revenue-generating ads or what-have-you, either by scraping google's database or by just putting up a page with a script that takes peoples' queries and passes them to google, then takes the result page and replaces google's sponsored links with umpteen flashing banner ads. Then you're using google's work output to actually compete against google, rather than simply using google for research. That makes a crucial difference.
Using code to drive Google lightly and for personal/educational/research reasons rather than commercial ones doesn't seem to be evil to me, especially if they cannot in practise distinguish it from "normal" use anyway, as it isn't producing excessive traffic or being used to compete against google in some way.
In fact, where do you draw the line? Firefox with manually-typed queries is OK. Then we have Firefox with a MRU for queries; Firefox with query guessing or autocompletion based on your current activities and interests; Firefox with a plugin to take the result set too and transform it e.g. to show 50 rather than 10 hits or to weed out "supplemental results" that are usually MFA sites that really ARE ripping off google; Firefox with a plugin to run the query of your choice and bookmark the results every few days; ... Firefox with a plugin to gradually build up a database of hits for various queries by occasionally fetching the nth page of results for one of them, but you don't publish these anywhere, just use them personally ...
I think the two things that mark a transition to being evil are causing them excessive traffic and competing with them using their own data in some way. (Also generating content-free MFA pages to generate revenue via AdSense ads and SEOing them, but that's more using AdSense than using the search engine proper, though the SEO will impact the latter and pollute the results.)
I don't see any way to derive some kind of moral law that makes typing something morally superior to doing it with one click, and actually scheduling an automatic (infrequent) job or whatever actually sinful. There's no inherent virtue in inefficiency, and computers exist to enable automating tasks. Hyperlinks automate looking up and finding that dusty reference or whatever; librarians may complain that they rot young brains but the actual upshot is a gain in productivity, rather than some kind of evil decadence setting in.
nowwho - 05 Jan 2007 22:46 GMT > > Hey, > > Thanks for the information so far. I didn't realise there was so much [quoted text clipped - 29 lines] > against google, rather than simply using google for research. That makes > a crucial difference. The point of the exercise is to get the URL's returned into an offline database. It's an excersise purly to pull back the URL's from the different search engines.
> Using code to drive Google lightly and for personal/educational/research > reasons rather than commercial ones doesn't seem to be evil to me, > especially if they cannot in practise distinguish it from "normal" use > anyway, as it isn't producing excessive traffic or being used to compete > against google in some way. I don't think its a question of good or evil, I think people are worried that the code could be used for commercial reasons.
> In fact, where do you draw the line? Firefox with manually-typed queries > is OK. Then we have Firefox with a MRU for queries; Firefox with query [quoted text clipped - 14 lines] > using the search engine proper, though the SEO will impact the latter > and pollute the results.) This is an educational project and as computers is not my main interest of study I don't know what a MFA, SEO are. Can this be explained?
> I don't see any way to derive some kind of moral law that makes typing > something morally superior to doing it with one click, and actually [quoted text clipped - 4 lines] > brains but the actual upshot is a gain in productivity, rather than some > kind of evil decadence setting in. Any help with using the Google API or other suggestions would be a great help. I also assume that Googe's API won't work with the other serch engines, so would I have to write a different class for each search engine?
Chris Uppal - 04 Jan 2007 19:13 GMT > Add in a deliberate request of the > front page before doing the search query, some random delays, and a [quoted text clipped - 4 lines] > idiosyncratic sequence. And they won't do that unless your IP generates > an eyebrow-raising amount of traffic. Google can and does have more intelligence than that.
The simplest thing to look for is the originating IP address of the request (at the TCP/IP level). A suspicious pattern of requests from one IP (e.g. too many in one time period), and Google will stop serving queries from that IP address. (The originating IP /can/ be spoofed, but not too many Java programmers will have the necessary skills, and in any case is hardly worth the effort.) That criterion can also give false positives; for instance if an organisation is working behind a NAT, so if one person from that organisation is detected abusing Google's services, the entire organisation will be blocked. Does Google care ? Why should it ?
Then, too, Google has available /all/ the data which enters its data-centres; from low-level fingerprinting of IP packets, up through checking HTTP headers, extending all the way to historical and cross-site access patterns (I would be very surprised if they didn't use a custom TCP/IP stack implementation for their HTTP servers). How much of that information it actual uses (or even collects) I don't know -- but I'd guess that it collects most of it, and uses as much as it feels it has to in order to prevent abuse.
And they do actively work to prevent abuse. There are many kinds of possible abuse, and I imagine Google work to prevent most of them, but I doubt if there are many things they dislike more than people attempting to steal their data.
-- chris
John Ersatznom - 05 Jan 2007 20:59 GMT > And they do actively work to prevent abuse. There are many kinds of possible > abuse, and I imagine Google work to prevent most of them, but I doubt if there > are many things they dislike more than people attempting to steal their data. All of this depends on what constitutes "stealing" their data. Copying it and publishing it? Sort of -- it's some kind of infringement but not really "theft".
Merely doing with one mouse click or zero what you'd do anyway with twenty keypresses? I don't see how the amount of clacking emanating from someone's workstation at location A is in any way relevant to Google as long as a) a single user isn't suddenly hogging their resources and b) the user is using the results "normally" rather than to compete with Google or whatever.
The red flags that would make them look into their logfiles would be a) excessive bandwidth use and b) a Google clone or whatever springing up all of a sudden and competing for their revenue streams.
Personal use of the search results isn't anything they can fault. Nor however a person chooses to generate the requests (so long as they aren't excessively frequent) or however they choose to filter and use the results so long as they don't use them commercially.
I see no logical reason for them to care whether the 3 requests a given IP gave them in a given day came from 30 typed characters and 3 mouse clicks, 3 mouse clicks, or 0 mouse clicks at the requesting end, as long as they don't consider 3 requests in one day from one source to be excessive and as long as they aren't using those results in a way that competes somehow with Google.
Unless, of course, the real intent is to enforce terms that let them use a business model based on charging ordinary users a premium merely to avoid tedium. I hope that isn't their intent; it would violate their famous motto. A tiered "typed queries are free, bookmarked are a dime each, and cron jobs require a monthly $59.99 subscription fee and special account" service where it actually costs them exactly the same amount (next to nil) to provide for all three use cases seems not merely silly, but tantamount to fraudulent. A tiered "more than xx queries a day requires a premium $10/month account" thing with xx in the dozens or hundreds might not be considered evil -- after all, generating that many queries actually scales up the amount serving you is costing them per day. And of course disallowing commercial use of the results (other than incidentally, like researching a purchase or new hire -- more selling the results themselves in some manner) without a licensing arrangement where Google gets a percentage. That's only fair.
Chris Uppal - 06 Jan 2007 15:48 GMT [me:]
> > And they do actively work to prevent abuse. There are many kinds of > > possible abuse, and I imagine Google work to prevent most of them, but [quoted text clipped - 4 lines] > it and publishing it? Sort of -- it's some kind of infringement but not > really "theft". I don't particularly want to focus on what word(s) best fit the malefaction. I'll stick with the general purpose "abuse" (which doesn't necessarily even imply illegality).
> Merely doing with one mouse click or zero what you'd do anyway with > twenty keypresses? I don't see how the amount of clacking emanating from > someone's workstation at location A is in any way relevant to Google as > long as a) a single user isn't suddenly hogging their resources and b) > the user is using the results "normally" rather than to compete with > Google or whatever. Here you are mentioning only one aspect of the abuse (as it might appear to Google) -- namely overuse of their resources. And I doubt if they are too worried about that (within reason, of course). But almost /any/ automated scanning of their database is an abuse in another sense: they make that data available to people (not machines) in order to make money off it. Their (only, as far as I know) source of cash is directly or indirectly from the advertising they include with the search results. If you don't see the advertising then you are using their resources and data without paying for them. How could they /not/ want to minimise that ?
> The red flags that would make them look into their logfiles would be a) > excessive bandwidth use and b) a Google clone or whatever springing up > all of a sudden and competing for their revenue streams. Or anything else that suggests that the search results are not being read by a human...
Of course, they own the servers, they pay the (probably massive) network costs and other data-centre costs, so it's up to them what they consider "fair". If they choose to object to people called "Chris" using their services, then that's up to them -- I have no real right to complain -- they can be as arbitrary as they like. Naturally, since they want to make money, they can't be too very arbitrary (and aren't), but by the same token, they do have good reasons to (try to) protect their services from freeloaders.
-- chris
Lew - 07 Jan 2007 02:02 GMT > Of course, they own the servers, they pay the (probably massive) network costs > and other data-centre costs, so it's up to them what they consider "fair". If [quoted text clipped - 3 lines] > be too very arbitrary (and aren't), but by the same token, they do have good > reasons to (try to) protect their services from freeloaders. I am not sure if name-bigotry is covered, but in many countries discrimination in the provision of goods or services for certain factors like race, religion, national origin, physical or mental disabilities and some other like attributes is illegal. The legal principle rests in part on whether a trait is innate, like national origin, or voluntary, like whether to wear a beard (for most). This in no wise invalidates points others have made in this thread except to point out that legal niceties punch exceptions into many broad generalizations about these topics.
The legal question of data ownership carries many perilous implications. Does Google own the information, or merely its representation? Is that representation limited to its appearance on the screen, or does its specific storage in their databases qualify? What about the source whence came Google's data - when they scraped information off foo.com to include it in their data, did they violate foo.com's owner's intellectual property rights? If I scraped foo.com and came up with similar information to Google's in a similar data structure (because data structures are "obvious" to a competent software engineer), have I violated any of Google's IP rights?
Larger jurisprudential question: what degree of data openness or private ownership best benefits society?
Concomitant question: what constitutes fair use of another's data?
- Lew
Andrew Thompson - 07 Jan 2007 02:52 GMT > ...What about the source whence came Google's > data - when they scraped information off foo.com to include it in their data, > did they violate foo.com's owner's intellectual property rights? I assume they figure that complying with a 'robots.txt'* gives them some justification that they were 'invited' (or at the very least, not exluded or banned) from the site in question.
* <http://www.robotstxt.org/>
Andrew T.
Andrew Thompson - 07 Jan 2007 05:37 GMT > > ...What about the source whence came Google's > > data - when they scraped information off foo.com to include it in their data, > > did they violate foo.com's owner's intellectual property rights? > > I assume they figure that complying with a 'robots.txt'* ... E.G. <http://www.google.com/robots.txt>
Andrew T.
John Ersatznom - 08 Jan 2007 07:35 GMT >>>...What about the source whence came Google's >>>data - when they scraped information off foo.com to include it in their data, [quoted text clipped - 3 lines] > > E.G. <http://www.google.com/robots.txt> Unfortunately, one defacto effect of this protocol is that a lot of sites configure it to deny any automated access and then carve out a few narrow exemptions for Google and a handful of other big names in search, on the grounds that nobody else actually drives traffic and business to their site in any real quantity. The logical outcome is to shut out smaller search engines and private web-use automation, however. The former means the current crop of big-name search engines now have a lock on the market. The latter is simply dumb, since letting people automate aspects of their web use makes the web (and your site) more useful to them.
Some potentially useful web services are especially likely to be badly affected. Price comparators, for one. If you run an ecommerce site with nine competitors, and they all let a price comparator site's bot have access, and you do likewise, then 90% of the time it will forward people to a competitor. Obviously as an ecommerce vendor you want to block price comparator bots! Unfortunately, this is not beneficial to society, since you are outnumbered by your market, and your market is harmed by stifling access to information, and the additional ENTIRE market of online price comparison is threatened if everyone behaves the same.
So there are strong incentives to ignore robots.txt directives for search engine startups, price comparison engines and suchlike, and personal automation. Of course, accessing the file but then ignoring a directive in it is detectable by the site admin who will block your IP, and the ability to change IPs readily is much more available to the bigger sites that don't need it than to the smaller sites and individuals, so that means small-time bots have to not even access it (and have to fly under the radar -- not too much bandwidth and "look human").
The good side is that robots.txt does force non-bigname bots to run very quietly and not use much bandwidth at all or otherwise call attention to themselves, which serves part of the purpose anyway (one function of robot directives is to help site admins prevent overuse of their bandwidth).
John Ersatznom - 08 Jan 2007 07:27 GMT > Larger jurisprudential question: what degree of data openness or private > ownership best benefits society? Complete openness, except for national security matters, and those have to be things like non-stale battle plans that are of use to the enemy if they get it in a timely fashion. Any other security-based secrecy is security-through-obscurity; prefer a massive, well-understood defense to one that depends on the enemy being totally incompetent at espionage.
So-called "intellectual property" may be the single biggest legal/judicial mistake in history -- far from promoting innovation, all it seems to do is promote monopolies and lock-in. Check out againstmonopoly.org sometime. Bad patents are a recurring theme there and at techdirt, slashdot and other tech sites, but they're just the tip of the iceberg.
> Concomitant question: what constitutes fair use of another's data? Any private, educational, or nonprofit use should IMO. Of course if I had my druthers any use at all would. The only things "protectable" would be personal information, which people would be able to insist (with legal clout) companies like ChoicePoint delete or at least verify. And, eventually, the person's actual mind itself, once the technology to download or otherwise access it with the right tools is available. If I don't want spammers pestering me at some email address I think I have that right, but if I publish something nonpersonal by choice I don't feel I should then try to dictate how others use it.
John Ersatznom - 08 Jan 2007 07:18 GMT > Here you are mentioning only one aspect of the abuse (as it might appear to > Google) -- namely overuse of their resources. And I doubt if they are too [quoted text clipped - 5 lines] > you are using their resources and data without paying for them. How could they > /not/ want to minimise that ? If accessing a site in such a way as to not see advertising is "wrong", then using adblock plugins for your browser must be wrong. Using Ad-Aware to wipe out those foo.doubleclick.com tracking cookies must be wrong. Putting "*.doubleclick.com 127.0.0.1" in your hosts file must be wrong. Hell, walking into the kitchen to fix yourself a snack when your TV show goes to an ad must be wrong! Maybe even avoiding spam or deleting it unread...
There is such a thing as taking something too far.
> Of course, they own the servers, they pay the (probably massive) network costs > and other data-centre costs, so it's up to them what they consider "fair". If [quoted text clipped - 3 lines] > be too very arbitrary (and aren't), but by the same token, they do have good > reasons to (try to) protect their services from freeloaders. That's completely aside any legal issues, and down to any business being able to pick its customers selectively. And, of course, their ability to do so is limited to the extent that they can detect whatever they don't like. If they don't like people named "Chris" a Chris can use a phony name and they won't know the difference unless they start demanding ID verification to grant access, and they won't do that because it would be a quick way to self-destruct in the search-engine business.
Automating some of your search usage is similarly something you can fly below their radar, but in doing so you will clearly have to avoid any high levels of usage that would bother them and get their attention. But below that threshold, it's also a case of "what they don't know can't hurt them"...
Luc The Perverse - 04 Jan 2007 06:39 GMT >>...Yahoo, MSN, AOL and Ask ... > > Dunno.. Aren't most of them using data from > Google, in any case? Um . . . Certainly Yahoo and MSN are not.
-- LTP
:) Andrew Thompson - 04 Jan 2007 10:24 GMT > >>...Yahoo, MSN, AOL and Ask ... > > > > Dunno.. Aren't most of them using data from > > Google, in any case? > > Um . . . Certainly Yahoo and MSN are not. OK - I see lots of hits for MSN bots in my server logs, but not one for Yahoo. What does it's bot identify itself as?
Andrew T.
TechBookReport - 04 Jan 2007 10:31 GMT >>>> ...Yahoo, MSN, AOL and Ask ... >>> Dunno.. Aren't most of them using data from [quoted text clipped - 6 lines] > > Andrew T. Look for: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/)
 Signature TechBookReport Java - http://www.techbookreport.com/JavaIndex.html
Andrew Thompson - 04 Jan 2007 11:05 GMT ..
> > ..I see lots of hits for MSN bots in my server logs, > > but not one for Yahoo. ...
> Look for: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/) OK - I see them now..
Yahoo! - 9246 msn - 21457 goog - 7638
I was surprised I did not find them on the first search.. Must have been something stupid I did.. (shrugs)
BTW - nice to see you 'about the place' again.. I think of you whenever somebody asks after books, but a quick, very tentative, search failed to lay an URL on your site. I'll bookmark it.
Andrew T.
NoNickName - 05 Jan 2007 13:37 GMT > ..
> BTW - nice to see you 'about the place' again.. Thanks. Been busy with end of year deadlines recently. Should be around a bit more often now though.
 Signature TechBookReport Java - http://www.techbookreport.com/JavaIndex.html
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|