I'm trying to build a small screen scraper to extract info from a
certain website, entering a search into a search engine & then parsing
the results. I saved a few test pages on my computer to use, which
work fine -- everything I'm trying to extract is found & parsed
correctly. But when I connect to the live site, things blow up --
results are wrong; in some fields I'm extracting I get nothing, other
fields I get partial results but with html tags in it, which should be
stripped out.
This is just a small utility running from the command line, which
connects to the site, and pulls the appropriate data.
I'm using Java5, plus the regular expressions & string functions. But
like I said, the program works on my test cases on my local computer,
but not on live data.
Suggestions?
Thanks in advance!
Chris
hiwa - 27 Jun 2006 03:30 GMT
cmills28@yahoo.com :
> I'm trying to build a small screen scraper to extract info from a
> certain website, entering a search into a search engine & then parsing
[quoted text clipped - 16 lines]
> Thanks in advance!
> Chris
The line 2389 and 3921 of your code have bugs.
christopher@dailycrossword.com - 27 Jun 2006 04:38 GMT
some websites don't like spiders or bots, and read the header block of
your connection (that's also how they compile stats, etc).
> cmills28@yahoo.com :
>
> > I'm trying to build a small screen scraper to extract info from a
> > certain website, entering a search into a search engine & then parsing
> > Suggestions?
> >
> > Thanks in advance!
> > Chris
> The line 2389 and 3921 of your code have bugs.
haha --
Chris Uppal - 27 Jun 2006 11:31 GMT
> This is just a small utility running from the command line, which
> connects to the site, and pulls the appropriate data.
>
> I'm using Java5, plus the regular expressions & string functions. But
> like I said, the program works on my test cases on my local computer,
> but not on live data.
So save the text that they return /before/ parsing it (to a file) and then look
at the file to see what's going wrong. Some possibilities:
- they are looking at the headers and chosing to send a back different web-page
to what you expect.
- your regular expression code is not capable of parsing the HTML they use.
Both are likely, BTW. Search sites don't like automated queries ('cos no one
will read the advertisements). And it's almost impossible build a reliable
HTML parser out of regexps.
-- chris