Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / June 2006

Tip: Looking for answers? Try searching our database.

Parsing a html page

Thread view: 
swetha - 08 Jun 2006 21:18 GMT
I'm working on research projects in which i need to extract specified
content from a html page using Java code. There are many URLs in the
html page and i have to write the code in such a way that it goes
through the link and then extract the contents from that html page.
In this way it must see through all the links.Can anyone help me in
this? If anyone has a code for it please say it to me.
thanks,
Swetha.
Boris Werner - 08 Jun 2006 22:00 GMT
swetha schrieb:
> I'm working on research projects in which i need to extract specified
> content from a html page using Java code. There are many URLs in the
[quoted text clipped - 4 lines]
> thanks,
> Swetha.

Hi!

I found this article about programming a Webcrawler...

http://www.devarticles.com/c/a/Java/Crawling-the-Web-with-Java/

It might help you, because it does exactly what you want to.
Just have a closer look at the methods (especially retrieveLinks() )

Hope this helps

Boris
jcsnippets.atspace.com - 09 Jun 2006 13:01 GMT
> I'm working on research projects in which i need to extract specified
> content from a html page using Java code. There are many URLs in the
> html page and i have to write the code in such a way that it goes
> through the link and then extract the contents from that html page.
> In this way it must see through all the links.Can anyone help me in
> this? If anyone has a code for it please say it to me.

Have a look at the following articles, I think this will help you get
started:
http://jcsnippets.atspace.com/java/network-stuff/how-to-save-a-webpage.html
http://jcsnippets.atspace.com/java/regular-expressions/regular-expressions-f
ind-href.html

These will allow you to save a webpage, and extract links from said page. If
you'd like to extract more information, you need to use another regular
expression.

When you have a list of links, repeat the process of extracting those
webpages.

Best regards,

JayCee
--
http://jcsnippets.atspace.com/
a collection of source code, tips and tricks
Wibble - 11 Jun 2006 16:46 GMT
>> I'm working on research projects in which i need to extract specified
>> content from a html page using Java code. There are many URLs in the
[quoted text clipped - 22 lines]
> http://jcsnippets.atspace.com/
> a collection of source code, tips and tricks

We use HtmlUnit for testing servlets and jsp's but its a pretty
good screen scraper, javascript aware.

http://htmlunit.sourceforge.net/
Wibble - 11 Jun 2006 16:48 GMT
>>> I'm working on research projects in which i need to extract specified
>>> content from a html page using Java code. There are many URLs in the
[quoted text clipped - 27 lines]
>
> http://htmlunit.sourceforge.net/
Oops, actually HttpUnit

http://httpunit.sourceforge.net/


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.