Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / July 2007

Tip: Looking for answers? Try searching our database.

extract data from web page

Thread view: 
jobs239@hotmail.com - 09 Jul 2007 21:55 GMT
I want to type a word in the search box of google like website and
then extract results from the result page and store in an excel file.
How can I programmatically  do the search and extract data?
Roedy Green - 09 Jul 2007 22:15 GMT
>I want to type a word in the search box of google like website and
>then extract results from the result page and store in an excel file.
>How can I programmatically  do the search and extract data?

see http://mindprod.com/products.html#COMMON11 includes code to GET or
POST to retrieve a web page.

You then have want to convert the entities to Unicode. See
http://mindprod.com/products1.html#ENTITIES

Then you want to strip out the <tags>.

import static com.mindprod.entities.StripEntities.*;

Which you can do in a single line:

return stripNbsp( stripEntities( stripHTMLTags( key.trim() ) ) );

--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Roedy Green - 10 Jul 2007 08:10 GMT
On Mon, 09 Jul 2007 21:15:53 GMT, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>see http://mindprod.com/products.html#COMMON11 includes code to GET or
>POST to retrieve a web page.

that is refactored into its own package now :
http://mindprod.com/products.html#HTTP
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Stefan Ram - 10 Jul 2007 13:03 GMT
>http://mindprod.com/products.html#HTTP

 Recently I started to code a class to read a web page,
 and I used »java.net.HttpURLConnection«.
 Is anything wrong with this approach?

 The class »com.mindprod.http.Read« contains the string
 »8859_1«. Isn't this a preconception, given that web pages
 might use other encodings? Or may be I have not understand the
 intended use yet.
Roedy Green - 12 Jul 2007 08:38 GMT
>  The class »com.mindprod.http.Read« contains the string
>  »8859_1«. Isn't this a preconception, given that web pages
>  might use other encodings? Or may be I have not understand the
>  intended use yet.

That should be improved. I will have a look. The header probably
contains info on what encoding to use.
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
yusong198412@163.com - 20 Jul 2007 04:13 GMT
We can do it for you.our company provides web data extraction
service,for more information,please contact us:
http://www.knowlesys.com
Andrew Thompson - 09 Jul 2007 23:06 GMT
>I want to type a word in the search box of google like website and
>then extract results from the result page and store in an excel file.

Is that allowed?  Google has historically objected to
such access to the data they collect and present.

Signature

Andrew Thompson
http://www.athompson.info/andrew/

Twisted - 10 Jul 2007 04:27 GMT
> jobs...@hotmail.com wrote:
> >I want to type a word in the search box of google like website and
> >then extract results from the result page and store in an excel file.
>
> Is that allowed?  Google has historically objected to
> such access to the data they collect and present.

How is it any different in principle from viewing the page manually
and making a mental note of what you saw there, or bookmarking all the
results with right clicks in the browser, or some such?

Anyway, what Google doesn't know can't hurt it. Just don't republish
it without their permission, or generate too heavy a load on their
servers with automated traffic. Make sure it accesses and downloads no
faster than a human user would, and use the results locally/privately
only. Then a) you're not doing anything morally wrong and b) Google
doesn't know you're doing this thing that isn't morally wrong, but
that they might decide they don't like.
Andrew Thompson - 20 Jul 2007 11:11 GMT
>I want to type a word in the search box of google like website and
>then extract results from the result page and store in an excel file.
>How can I programmatically  do the search and extract data?

Google does not condone such programmatic access to
search results, to the best of my knowledge.

Signature

Andrew Thompson
http://www.athompson.info/andrew/

Twisted - 21 Jul 2007 10:13 GMT
> jobs...@hotmail.com wrote:
> >I want to type a word in the search box of google like website and
[quoted text clipped - 3 lines]
> Google does not condone such programmatic access to
> search results, to the best of my knowledge.

I don't recall the OP asking for either your or Google's opinion on
that, but simply how to do it.

Or are we reaching the point now where there will be ubiquitous
enforcement of the wishes of all large corporations and a refusal by
most people to divulge any information that might enable someone to
act in any way contrary to same? If so, I'm packing my bags and moving
to someplace that is still sane. (Anyone know anywhere where society
still keeps big business in its place and supports the individual when
it comes down to choosing between an individual and a corporation, and
where the law isn't ludicrously business-centric and anti-consumer?)
Andrew Thompson - 21 Jul 2007 12:36 GMT
>> jobs...@hotmail.com wrote:
>> >I want to type a word in the search box of google like website and
[quoted text clipped - 4 lines]
>I don't recall the OP asking for either your or Google's opinion on
>that, but simply how to do it.

I don't recall asking for your opinion either, Twisted,
but given this is a discussion forum, it is not amazing
you would add it.  

..Welcome to the c.l.j.p. *discussion* forum*.

(* This is not a help desk)

Signature

Andrew Thompson
http://www.athompson.info/andrew/

nebulous99@gmail.com - 21 Jul 2007 13:45 GMT
> >> jobs...@hotmail.com wrote:
> >> >I want to type a word in the search box of google like website and
[quoted text clipped - 6 lines]
>
> I don't recall asking for your opinion either, Twisted,

The point being that your response to the OP was unuseful to the OP,
and appears to be a case of you playing at being rent-a-cop instead of
attempting to be helpful to someone with a coding question.
Andrew Thompson - 21 Jul 2007 15:37 GMT
>> >> jobs...@hotmail.com wrote:
>> >> >I want to type a word in the search box of google like website and
[quoted text clipped - 3 lines]
>
>The point being that your response to the OP was ..

..blah, blah, blah.   Try to get interesting.

Signature

Andrew Thompson
http://www.athompson.info/andrew/

Lew - 21 Jul 2007 16:53 GMT
>>>>> jobs...@hotmail.com wrote:
>>>>>> I want to type a word in the search box of google like website and
[quoted text clipped - 3 lines]
>
> .blah, blah, blah.   Try to get interesting.

Andrew's point could be /very/ helpful to the OP if it prevents jail time or a
massive judgment against them.

Signature

Lew

nebulous99@gmail.com - 21 Jul 2007 17:05 GMT
> > nebulou...@gmail.com wrote:
> >>>>> jobs...@hotmail.com wrote:
[quoted text clipped - 4 lines]
>
> > .blah, blah, blah.   Try to get interesting.

Pot, kettle, and all that.

> Andrew's point could be /very/ helpful to the OP if it prevents jail time or a
> massive judgment against them.

Unless the OP signed something, or does something dumb like scrape and
republish a huge amount of copyrighted stuff without permission, a
massive judgment seems unlikely, let alone jail time. Actual hacking
or commercial copyright infringement might lead to jail time. Merely
browsing a site with the browser software of his choice, without
either producing abnormally large traffic levels to the server (which
would get his usage noticed and might be treated as a DoS attack) or
republishing anything (which would get his usage noticed and might be
copyright infringement), certainly should do neither if the OP is in a
sane and just country. So unless he's in China or something...what
Google doesn't know won't hurt him. Or hurt Google.
anal_aviator - 25 Jul 2007 00:04 GMT
>>> nebulou...@gmail.com wrote:
>>>>>>> jobs...@hotmail.com wrote:
[quoted text clipped - 22 lines]
> sane and just country. So unless he's in China or something...what
> Google doesn't know won't hurt him. Or hurt Google.

don't sweat it,

Andrew pops his ugly head up now and again, whenever there's anything
unhelpful to be said, he's also quite a Lawyer is our Andrew, I believe he
consulted on the OJ case .

just consider him the news group pet, feed him or Kick him it's up to you.
G. Garrett Campbell - 26 Jul 2007 08:03 GMT
Isn't a web browser a program.
Isn't the response a URL page?

One would just need to construct a url with an appropriate post from the
search string and read the response.

For personal use, how can that be different than a web browser?

>>>> nebulou...@gmail.com wrote:
>>>>>>>> jobs...@hotmail.com wrote:
[quoted text clipped - 31 lines]
>
> just consider him the news group pet, feed him or Kick him it's up to you.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.