Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / December 2005

Tip: Looking for answers? Try searching our database.

JSP or httpservlet for Java spider?

Thread view: 
Greg Peters - 24 Dec 2005 05:25 GMT
Hi. I want to spider just a few websites, not the entire site, just 1 or 2
levels deep. So can I use JSP or httpservlets for this? Does anyone know of
some tutorial/code/book that explains this? I usually use JSP and
httpservlets for processing requests, but I want to get the data from a
different website.

Or do I have to spider using perl, then store it in a database and retrieve
it using JSP/httpservlets? Thank you.
Roedy Green - 24 Dec 2005 08:42 GMT
>Hi. I want to spider just a few websites, not the entire site, just 1 or 2
>levels deep. So can I use JSP or httpservlets for this? Does anyone know of
>some tutorial/code/book that explains this? I usually use JSP and
>httpservlets for processing requests, but I want to get the data from a
>different website.

see http://mindprod.com/applets/fileio.htm
for how to do GET.

Then you have to find the links to spider e.g.

with pattern
<a href="xxxx"

you can crudely use indexOf "<a href="
or you can use a regex if you want to catch squirrelly stuff like
extra spaces or parms.

See http://mindprod.com/jgloss/regex.html

You add the links to a queue of links to be spidered.
See http://mindprod.com/queue.html

Then you spawn up to N threads that grab the next queue items and
spider it.

See http://mindprod.com/projects/htmlbrokenlink.html
for more details.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

John C. Bollinger - 26 Dec 2005 03:15 GMT
> Hi. I want to spider just a few websites, not the entire site, just 1 or 2
> levels deep. So can I use JSP or httpservlets for this? Does anyone know of
[quoted text clipped - 4 lines]
> Or do I have to spider using perl, then store it in a database and retrieve
> it using JSP/httpservlets? Thank you.

JSP and servlets are mechanisms for generating dynamic responses to HTTP
requests.  They are most often used for serving HTML pages.  They have
no special mechanism beyond any other Java code for making
general-purpose HTTP requests are doing anything with the results of
such a request.

Even though JSP and servlets specifically would be inappropriate choices
for a web spider, that does not mean that Java in general is wrong for
the task.  To the contrary, the Java platform library has good support
for a wide variety of network- and web-oriented tasks, and there are a
multitude of 3rd party libraries that build further on that foundation.
   Look at the URL, URLConnection, and HttpURLConnection classes in the
java.net package to start, and perhaps at DOM (package org.w3c.dom) for
document analysis.  You might also find the Jakarta HTTP Client library
useful: http://jakarta.apache.org/commons/httpclient/  There are many
other resources available.

As for displaying pages previously retrieved by your spider, chances are
that a fairly simple servlet could handle the job admirably.  There
might be reasons to do it with JSP / custom tags instead, but that
approach wouldn't be my first inclination.

Signature

John Bollinger
jobollin@indiana.edu



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.