I am trying to extract data from a website and store it. Would
someone pose different ways to approach this problem or even
literature that I could read to help?
Lulu58e2 - 03 May 2007 15:16 GMT
On May 3, 7:05 am, raybo...@gmail.com wrote:
> I am trying to extract data from a website and store it. Would
> someone pose different ways to approach this problem or even
> literature that I could read to help?
This is pretty quick in Groovy using the following:
def parser = new org.cyberneko.html.parsers.SAXParser()
parser.setFeature('http://xml.org/sax/features/namespaces', false)
def HTML = new XmlSlurper(parser).parse('http://www.somepage.html')
HTML.BODY.DIV[2].P[4].LI[2].TABLE[0].TR.each() { /* do something
*/ } // as an example
C>
Thomas Fritsch - 03 May 2007 16:16 GMT
> I am trying to extract data from a website and store it. Would
> someone pose different ways to approach this problem or even
> literature that I could read to help?
Linux has the command-line-tool "wget" for downloading web-sites.
See http://www.google.com/search?q=wget

Signature
Thomas
Tris Orendorff - 03 May 2007 20:47 GMT
> I am trying to extract data from a website and store it. Would
> someone pose different ways to approach this problem or even
> literature that I could read to help?
Here's the info from a spider I have used a few times:
/**
* That class implements a reusable spider. To use this
* class you must have a class setup to recieve
* the information found by the spider. This class must
* implement the ISpiderReportable method. Written by
* Jeff Heaton. Jeff Heaton is the author of "Programming
* Spiders, Bots, and Aggregators" by Sybex. Jeff can be
* contacted through his web site at http://www.jeffheaton.com.
*
* @author Jeff Heaton(http://www.jeffheaton.com)
* @version 1.0
*/

Signature
Tris Orendorff
[Q: What kind of modem did Jimi Hendrix use?
A: A purple Hayes.]
Ian Wilson - 04 May 2007 10:14 GMT
> I am trying to extract data from a website and store it. Would
> someone pose different ways to approach this problem or even
> literature that I could read to help?
1. Use the site's API or RSS instead. If available.
2. Check the site's terms and conditions of use.