Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / March 2007

Tip: Looking for answers? Try searching our database.

HTML parsing using Java and Xerces

Thread view: 
Camk - 19 Mar 2007 15:38 GMT
Hey, Is it possible to do the following?

1. Enter a search term in ask.com (Manually) and hit search
2. Once the Result page is shown, view the source file and save it to
the hard disk (Manually)
3. Use a Java program with the HTML parser embedded to extract the
returned URLs
4. Once the URLs are returned, they are to be automatically stored in
a MYSQL database.
The database has a Single table with the following columns:
Query - Stores a string of the search query used.
SearchEngine - Stores a string of the search engine (e.g. Ask)
ReturnedURL - Stores a string of the returned URL (this is got from
the parsed source sheet)
URLNo - Stores an int the position of the Returned URL (i.e. the first
URL is number 1 and so on)
Chris - 20 Mar 2007 03:35 GMT
> Hey, Is it possible to do the following?
>
[quoted text clipped - 12 lines]
> URLNo - Stores an int the position of the Returned URL (i.e. the first
> URL is number 1 and so on)

Yes, it is possible. Lots of ways to do it. The trick is to find a
reliable way to recognize the various entities in the page.

I would start by reading the page into a String or char array, and then
seeing if I could write regular expressions to recognize things. See
java.util.regex.

Don't use Xerces. It will choke on any ill-formed html.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.