swetha schrieb:
> I'm working on research projects in which i need to extract specified
> content from a html page using Java code. There are many URLs in the
[quoted text clipped - 4 lines]
> thanks,
> Swetha.
Hi!
I found this article about programming a Webcrawler...
http://www.devarticles.com/c/a/Java/Crawling-the-Web-with-Java/
It might help you, because it does exactly what you want to.
Just have a closer look at the methods (especially retrieveLinks() )
Hope this helps
Boris
> I'm working on research projects in which i need to extract specified
> content from a html page using Java code. There are many URLs in the
> html page and i have to write the code in such a way that it goes
> through the link and then extract the contents from that html page.
> In this way it must see through all the links.Can anyone help me in
> this? If anyone has a code for it please say it to me.
Have a look at the following articles, I think this will help you get
started:
http://jcsnippets.atspace.com/java/network-stuff/how-to-save-a-webpage.html
http://jcsnippets.atspace.com/java/regular-expressions/regular-expressions-f
ind-href.html
These will allow you to save a webpage, and extract links from said page. If
you'd like to extract more information, you need to use another regular
expression.
When you have a list of links, repeat the process of extracting those
webpages.
Best regards,
JayCee
--
http://jcsnippets.atspace.com/
a collection of source code, tips and tricks
Wibble - 11 Jun 2006 16:46 GMT
>> I'm working on research projects in which i need to extract specified
>> content from a html page using Java code. There are many URLs in the
[quoted text clipped - 22 lines]
> http://jcsnippets.atspace.com/
> a collection of source code, tips and tricks
We use HtmlUnit for testing servlets and jsp's but its a pretty
good screen scraper, javascript aware.
http://htmlunit.sourceforge.net/
Wibble - 11 Jun 2006 16:48 GMT
>>> I'm working on research projects in which i need to extract specified
>>> content from a html page using Java code. There are many URLs in the
[quoted text clipped - 27 lines]
>
> http://htmlunit.sourceforge.net/
Oops, actually HttpUnit
http://httpunit.sourceforge.net/