Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / July 2005

Tip: Looking for answers? Try searching our database.

Finding repeated words in text documents: what Algorithm ?

Thread view: 
Daniele Menozzi - 17 Jul 2005 09:37 GMT
Hi all, I have this problem: I have some documents (10,20,30..) and I have
to find the words that repeats most of all.
Can you suggest me some Algorithm that can be used in this case?

Thank you :)
    Daniele
JosephWu - 17 Jul 2005 12:47 GMT
在 Sun, 17 Jul 2005 20:37:32 +1200,Daniele Menozzi <shine@me.com> 写道:

> Hi all, I have this problem: I have some documents (10,20,30..) and I  
> have
[quoted text clipped - 3 lines]
> Thank you :)
>     Daniele

huffman???
Stefan Schulz - 17 Jul 2005 13:28 GMT
> 在 Sun, 17 Jul 2005 20:37:32 +1200,Daniele Menozzi <shine@me.com> 写道:
>
[quoted text clipped - 7 lines]
>
> huffman???

Huffman iirc needs to have the frequencies.

Why not just make a list of all the words occurring in your documents, and
whenever you encounter a word, increment its frequency by one?

Signature

You can't run away forever,
But there's nothing wrong with getting a good head start.
          --- Jim Steinman, "Rock and Roll Dreams Come Through"
         

Bob Withers - 17 Jul 2005 13:26 GMT
> Hi all, I have this problem: I have some documents (10,20,30..) and I have
> to find the words that repeats most of all.
> Can you suggest me some Algorithm that can be used in this case?
>
> Thank you :)
>     Daniele

Here are some links that may help:

http://tinyurl.com/86g7n

http://tinyurl.com/dv4vd

Bob
Daniele Menozzi - 17 Jul 2005 20:00 GMT
> Here are some links that may help:
>
[quoted text clipped - 3 lines]
>
> Bob

theese links are great! Thank you so much :)

Daniele
Hemal  Pandya - 17 Jul 2005 16:55 GMT
> Hi all, I have this problem: I have some documents (10,20,30..) and I have
> to find the words that repeats most of all.
> Can you suggest me some Algorithm that can be used in this case?\

initialize collection word-frequency
for each document
 for each word in the document
   if the word exists in word-frequency
     bump frequency
   else
     add the word to word-frequency with frequency 1

initialize top-frequency to 0, top-word to null
for each word in word-frequency
 if its frequency is greater then top-frequency
   assign the frequency to top-frequency, word to top-word

the top-word is the word that repeats most

> Thank you :)
>     Daniele
George Cherry - 17 Jul 2005 19:16 GMT
>> Hi all, I have this problem: I have some documents (10,20,30..) and I
>> have
[quoted text clipped - 18 lines]
>> Thank you :)
>> Daniele

Maybe the the op meant successive repetitions as in
"the the" at the beginning of this sentence??? My
spelling checker detects this btw and warns me.

George
Roedy Green - 17 Jul 2005 20:44 GMT
>Hi all, I have this problem: I have some documents (10,20,30..) and I have
>to find the words that repeats most of all.
>Can you suggest me some Algorithm that can be used in this case?


tHere are two most commonly used:

1. sort and look for adjacent duplicates.

2. build a HashSet.  If word is already in there, you have a dup.

Signature

Bush crime family lost/embezzled $3 trillion from Pentagon.
Complicit Bush-friendly media keeps mum. Rumsfeld confesses on video.
http://www.infowars.com/articles/us/mckinney_grills_rumsfeld.htm

Canadian Mind Products, Roedy Green.
See http://mindprod.com/iraq.html photos of Bush's war crimes

Wibble - 18 Jul 2005 04:05 GMT
>>Hi all, I have this problem: I have some documents (10,20,30..) and I have
>>to find the words that repeats most of all.
[quoted text clipped - 6 lines]
>
> 2. build a HashSet.  If word is already in there, you have a dup.

Find all the words that aren't the most duplicated.  The remaining one
is your answer.
Roedy Green - 18 Jul 2005 07:31 GMT
>> 1. sort and look for adjacent duplicates.
>>
>> 2. build a HashSet.  If word is already in there, you have a dup.
>>
>Find all the words that aren't the most duplicated.  The remaining one
>is your answer.

IN that case use a HashMap and add to the count for every hit.  Then
sort the hit counts in order, and you can find your least and most
duplicated words, or just do a linear scan looking for the one you
want.

Signature

Bush crime family lost/embezzled $3 trillion from Pentagon.
Complicit Bush-friendly media keeps mum. Rumsfeld confesses on video.
http://www.infowars.com/articles/us/mckinney_grills_rumsfeld.htm

Canadian Mind Products, Roedy Green.
See http://mindprod.com/iraq.html photos of Bush's war crimes



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.