Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / May 2007

Tip: Looking for answers? Try searching our database.

Web scrapping

Thread view: 
raybonds@gmail.com - 03 May 2007 14:05 GMT
I am trying to extract data from a website and store it.  Would
someone pose different ways to approach this problem or even
literature that I could read to help?
Lulu58e2 - 03 May 2007 15:16 GMT
On May 3, 7:05 am, raybo...@gmail.com wrote:
> I am trying to extract data from a website and store it.  Would
> someone pose different ways to approach this problem or even
> literature that I could read to help?

This is pretty quick in Groovy using the following:

def parser = new org.cyberneko.html.parsers.SAXParser()
parser.setFeature('http://xml.org/sax/features/namespaces', false)
def HTML = new XmlSlurper(parser).parse('http://www.somepage.html')
HTML.BODY.DIV[2].P[4].LI[2].TABLE[0].TR.each() { /* do something
*/ } // as an example

C>
Thomas Fritsch - 03 May 2007 16:16 GMT
> I am trying to extract data from a website and store it.  Would
> someone pose different ways to approach this problem or even
> literature that I could read to help?
Linux has the command-line-tool "wget" for downloading web-sites.
See http://www.google.com/search?q=wget

Signature

Thomas

Tris Orendorff - 03 May 2007 20:47 GMT
> I am trying to extract data from a website and store it.  Would
> someone pose different ways to approach this problem or even
> literature that I could read to help?

Here's the info from a spider I have used a few times:

/**
* That class implements a reusable spider. To use this
* class you must have a class setup to recieve
* the information found by the spider. This class must
* implement the ISpiderReportable method. Written by
* Jeff Heaton. Jeff Heaton is the author of "Programming
* Spiders, Bots, and Aggregators" by Sybex. Jeff can be
* contacted through his web site at http://www.jeffheaton.com.
*
* @author Jeff Heaton(http://www.jeffheaton.com)
* @version 1.0
*/

Signature

Tris Orendorff
[Q: What kind of modem did Jimi Hendrix use?
A: A purple Hayes.]

Ian Wilson - 04 May 2007 10:14 GMT
> I am trying to extract data from a website and store it.  Would
> someone pose different ways to approach this problem or even
> literature that I could read to help?

1. Use the site's API or RSS instead. If available.
2. Check the site's terms and conditions of use.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.