Arachnid
Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed.
Heritrix
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Java Web Crawler
Java Web Crawler is a simple Web crawling utility written in Java. It supports the robots exclusion standard.
JSpider
A highly configurable and customizable Web Spider engine, Developed under the LGPL Open Source license, In 100% pure Java.
WebEater
A 100% pure Java program for web site retrieval and offline viewing.
WebLech
WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as possible. WebLech is multithreaded and will feature a GUI console.
WebSPHINX
WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for Web crawlers that browse and process Web pages automatically.