Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / July 2005

Tip: Looking for answers? Try searching our database.

Extracting bolds and italics from HTML

Thread view: 
Ezee - 26 Jul 2005 15:33 GMT
Hi,

I am trying to make a web crawler which will be topic focused. For
this, I have to make some calculations on the contents of url before
adding that url into my database.
I had found a very useful program of Word Count from sun java forum,
but its problem is that it also includes the HTML tags in calculation.
Can anybody please tell me is there any Java api or online help
available for

i) A program which counts words in HTML file but doesnt include HTML
tags.
ii) A program which counts only Bolds and Italics in HTML file.

Thanx in anticipation :)
Harald - 26 Jul 2005 19:27 GMT
> Hi,
>
[quoted text clipped - 8 lines]
> i) A program which counts words in HTML file but doesnt include HTML
> tags.

With http://www.ebi.ac.uk/~kirsch/monq-doc/monq/programs/Grep.html
you can do things like

java monq.programs.Grep '<[^>]+>' '' '[A-Za-z]+' '%0\n' <yourhtml.html

on the command line to get fetch all words that do not below to a
tag. The mechanism behind it is
http://www.ebi.ac.uk/~kirsch/monq-doc/monq/jfa/Nfa.html which you can
use progammatically.

> ii) A program which counts only Bolds and Italics in HTML file.

This would require to look for `<b>' and `<em>' tags and can easily be
added as pattern/action pairs to the Nfa doing the word counting.

I am off to the pub now, otherwise I would've written the class, max
20 lines:-) To download the software see signature.

 Harald.

Signature

---------------------+---------------------------------------------
Harald Kirsch (@home)|
Java Text Crunching: http://www.ebi.ac.uk/Rebholz-srv/whatizit/software



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.