Hi. I want to spider just a few websites, not the entire site, just 1 or 2
levels deep. So can I use JSP or httpservlets for this? Does anyone know of
some tutorial/code/book that explains this? I usually use JSP and
httpservlets for processing requests, but I want to get the data from a
different website.
Or do I have to spider using perl, then store it in a database and retrieve
it using JSP/httpservlets? Thank you.
Roedy Green - 24 Dec 2005 08:42 GMT
>Hi. I want to spider just a few websites, not the entire site, just 1 or 2
>levels deep. So can I use JSP or httpservlets for this? Does anyone know of
>some tutorial/code/book that explains this? I usually use JSP and
>httpservlets for processing requests, but I want to get the data from a
>different website.
see http://mindprod.com/applets/fileio.htm
for how to do GET.
Then you have to find the links to spider e.g.
with pattern
<a href="xxxx"
you can crudely use indexOf "<a href="
or you can use a regex if you want to catch squirrelly stuff like
extra spaces or parms.
See http://mindprod.com/jgloss/regex.html
You add the links to a queue of links to be spidered.
See http://mindprod.com/queue.html
Then you spawn up to N threads that grab the next queue items and
spider it.
See http://mindprod.com/projects/htmlbrokenlink.html
for more details.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
John C. Bollinger - 26 Dec 2005 03:15 GMT
> Hi. I want to spider just a few websites, not the entire site, just 1 or 2
> levels deep. So can I use JSP or httpservlets for this? Does anyone know of
[quoted text clipped - 4 lines]
> Or do I have to spider using perl, then store it in a database and retrieve
> it using JSP/httpservlets? Thank you.
JSP and servlets are mechanisms for generating dynamic responses to HTTP
requests. They are most often used for serving HTML pages. They have
no special mechanism beyond any other Java code for making
general-purpose HTTP requests are doing anything with the results of
such a request.
Even though JSP and servlets specifically would be inappropriate choices
for a web spider, that does not mean that Java in general is wrong for
the task. To the contrary, the Java platform library has good support
for a wide variety of network- and web-oriented tasks, and there are a
multitude of 3rd party libraries that build further on that foundation.
Look at the URL, URLConnection, and HttpURLConnection classes in the
java.net package to start, and perhaps at DOM (package org.w3c.dom) for
document analysis. You might also find the Jakarta HTTP Client library
useful: http://jakarta.apache.org/commons/httpclient/ There are many
other resources available.
As for displaying pages previously retrieved by your spider, chances are
that a fairly simple servlet could handle the job admirably. There
might be reasons to do it with JSP / custom tags instead, but that
approach wouldn't be my first inclination.

Signature
John Bollinger
jobollin@indiana.edu