Hi all, I have this problem: I have some documents (10,20,30..) and I have
to find the words that repeats most of all.
Can you suggest me some Algorithm that can be used in this case?
Thank you :)
Daniele
JosephWu - 17 Jul 2005 12:47 GMT
在 Sun, 17 Jul 2005 20:37:32 +1200,Daniele Menozzi <shine@me.com> 写道:
> Hi all, I have this problem: I have some documents (10,20,30..) and I
> have
[quoted text clipped - 3 lines]
> Thank you :)
> Daniele
huffman???
Stefan Schulz - 17 Jul 2005 13:28 GMT
> 在 Sun, 17 Jul 2005 20:37:32 +1200,Daniele Menozzi <shine@me.com> 写道:
>
[quoted text clipped - 7 lines]
>
> huffman???
Huffman iirc needs to have the frequencies.
Why not just make a list of all the words occurring in your documents, and
whenever you encounter a word, increment its frequency by one?

Signature
You can't run away forever,
But there's nothing wrong with getting a good head start.
--- Jim Steinman, "Rock and Roll Dreams Come Through"
Bob Withers - 17 Jul 2005 13:26 GMT
> Hi all, I have this problem: I have some documents (10,20,30..) and I have
> to find the words that repeats most of all.
> Can you suggest me some Algorithm that can be used in this case?
>
> Thank you :)
> Daniele
Here are some links that may help:
http://tinyurl.com/86g7n
http://tinyurl.com/dv4vd
Bob
Daniele Menozzi - 17 Jul 2005 20:00 GMT
> Here are some links that may help:
>
[quoted text clipped - 3 lines]
>
> Bob
theese links are great! Thank you so much :)
Daniele
Hemal Pandya - 17 Jul 2005 16:55 GMT
> Hi all, I have this problem: I have some documents (10,20,30..) and I have
> to find the words that repeats most of all.
> Can you suggest me some Algorithm that can be used in this case?\
initialize collection word-frequency
for each document
for each word in the document
if the word exists in word-frequency
bump frequency
else
add the word to word-frequency with frequency 1
initialize top-frequency to 0, top-word to null
for each word in word-frequency
if its frequency is greater then top-frequency
assign the frequency to top-frequency, word to top-word
the top-word is the word that repeats most
> Thank you :)
> Daniele
George Cherry - 17 Jul 2005 19:16 GMT
>> Hi all, I have this problem: I have some documents (10,20,30..) and I
>> have
[quoted text clipped - 18 lines]
>> Thank you :)
>> Daniele
Maybe the the op meant successive repetitions as in
"the the" at the beginning of this sentence??? My
spelling checker detects this btw and warns me.
George
Roedy Green - 17 Jul 2005 20:44 GMT
>Hi all, I have this problem: I have some documents (10,20,30..) and I have
>to find the words that repeats most of all.
>Can you suggest me some Algorithm that can be used in this case?
tHere are two most commonly used:
1. sort and look for adjacent duplicates.
2. build a HashSet. If word is already in there, you have a dup.

Signature
Bush crime family lost/embezzled $3 trillion from Pentagon.
Complicit Bush-friendly media keeps mum. Rumsfeld confesses on video.
http://www.infowars.com/articles/us/mckinney_grills_rumsfeld.htm
Canadian Mind Products, Roedy Green.
See http://mindprod.com/iraq.html photos of Bush's war crimes
Wibble - 18 Jul 2005 04:05 GMT
>>Hi all, I have this problem: I have some documents (10,20,30..) and I have
>>to find the words that repeats most of all.
[quoted text clipped - 6 lines]
>
> 2. build a HashSet. If word is already in there, you have a dup.
Find all the words that aren't the most duplicated. The remaining one
is your answer.
Roedy Green - 18 Jul 2005 07:31 GMT
>> 1. sort and look for adjacent duplicates.
>>
>> 2. build a HashSet. If word is already in there, you have a dup.
>>
>Find all the words that aren't the most duplicated. The remaining one
>is your answer.
IN that case use a HashMap and add to the count for every hit. Then
sort the hit counts in order, and you can find your least and most
duplicated words, or just do a linear scan looking for the one you
want.

Signature
Bush crime family lost/embezzled $3 trillion from Pentagon.
Complicit Bush-friendly media keeps mum. Rumsfeld confesses on video.
http://www.infowars.com/articles/us/mckinney_grills_rumsfeld.htm
Canadian Mind Products, Roedy Green.
See http://mindprod.com/iraq.html photos of Bush's war crimes