Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / June 2006

Tip: Looking for answers? Try searching our database.

program works on my computer but not on live site?

Thread view: 
cmills28@yahoo.com - 27 Jun 2006 02:24 GMT
I'm trying to build a small screen scraper to extract info from a
certain website, entering a search into a search engine & then parsing
the results.  I saved a few test pages on my computer to use, which
work fine -- everything I'm trying to extract is found & parsed
correctly.  But when I connect to the live site, things blow up --
results are wrong; in some fields I'm extracting I get nothing, other
fields I get partial results but with html tags in it, which should be
stripped out.

This is just a small utility running from the command line, which
connects to the site, and pulls the appropriate data.

I'm using Java5, plus the regular expressions & string functions.  But
like I said, the program works on my test cases on my local computer,
but not on live data.

Suggestions?

Thanks in advance!
Chris
hiwa - 27 Jun 2006 03:30 GMT
cmills28@yahoo.com :

> I'm trying to build a small screen scraper to extract info from a
> certain website, entering a search into a search engine & then parsing
[quoted text clipped - 16 lines]
> Thanks in advance!
> Chris
The line 2389 and 3921 of your code have bugs.
christopher@dailycrossword.com - 27 Jun 2006 04:38 GMT
some websites don't like spiders or bots, and read the header block of
your connection (that's also how they compile stats, etc).

> cmills28@yahoo.com :
>
> > I'm trying to build a small screen scraper to extract info from a
> > certain website, entering a search into a search engine & then parsing

> > Suggestions?
> >
> > Thanks in advance!
> > Chris
> The line 2389 and 3921 of your code have bugs.
haha --
Chris Uppal - 27 Jun 2006 11:31 GMT
> This is just a small utility running from the command line, which
> connects to the site, and pulls the appropriate data.
>
> I'm using Java5, plus the regular expressions & string functions.  But
> like I said, the program works on my test cases on my local computer,
> but not on live data.

So save the text that they return /before/ parsing it (to a file) and then look
at the file to see what's going wrong.  Some possibilities:

- they are looking at the headers and chosing to send a back different web-page
to what you expect.

- your regular expression code is not capable of parsing the HTML they use.

Both are likely, BTW.  Search sites don't like automated queries ('cos no one
will read the advertisements).   And it's almost impossible build a reliable
HTML parser out of regexps.

   -- chris


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.