Hey, Is it possible to do the following?
1. Enter a search term in ask.com (Manually) and hit search
2. Once the Result page is shown, view the source file and save it to
the hard disk (Manually)
3. Use a Java program with the HTML parser embedded to extract the
returned URLs
4. Once the URLs are returned, they are to be automatically stored in
a MYSQL database.
The database has a Single table with the following columns:
Query - Stores a string of the search query used.
SearchEngine - Stores a string of the search engine (e.g. Ask)
ReturnedURL - Stores a string of the returned URL (this is got from
the parsed source sheet)
URLNo - Stores an int the position of the Returned URL (i.e. the first
URL is number 1 and so on)
Chris - 20 Mar 2007 03:35 GMT
> Hey, Is it possible to do the following?
>
[quoted text clipped - 12 lines]
> URLNo - Stores an int the position of the Returned URL (i.e. the first
> URL is number 1 and so on)
Yes, it is possible. Lots of ways to do it. The trick is to find a
reliable way to recognize the various entities in the page.
I would start by reading the page into a String or char array, and then
seeing if I could write regular expressions to recognize things. See
java.util.regex.
Don't use Xerces. It will choke on any ill-formed html.