Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / Virtual Machine / January 2007

Tip: Looking for answers? Try searching our database.

Keyword extractor's source code....where I can find it???

Thread view: 
giugy - 11 Jan 2007 15:59 GMT
Hi,sorry for my english but I don't speak it very well....

Someone knows where I can find the Keyword Extractor source code
written in java? A software that analyzes a text and extract the
keyword of the text (the most present words in the text....for example
the word "hello" is present forty times,the word "thanks" is present
thirty times....).

I need to see the software's source code written in java in order to
understand as it works....

Thaks,bye
glen herrmannsfeldt - 12 Jan 2007 01:56 GMT
> Someone knows where I can find the Keyword Extractor source code
> written in java? A software that analyzes a text and extract the
> keyword of the text (the most present words in the text....for example
> the word "hello" is present forty times,the word "thanks" is present
> thirty times....).

> I need to see the software's source code written in java in order to
> understand as it works....

It is very easy to write in Java.

First read a line and extract words using StringTokenizer.  Then
use a Hashtable to find out if you have seen that word before.
If so, increment a counter.  If not, add it to the Hashtable with
a count of 1.   I store a long[] in the hashtable for convenience
in incrementing, but others will do something different.

One trick, though.  After you extract words with StringTokenizer and
find they are not in the table, create a new String to store the
reference in the hash table.  If you don't it will take up too much
memory, as the whole line of characters is stored for each word.

After you finish reading the file, go through the Hashtable,
extract words and counts, and print them out.

It should not take long at all to write.

-- glen
giugy - 16 Jan 2007 17:20 GMT
Yes, I have found a code like this....

import java.io.*;
import java.util.*;

class Counter implements Comparable {
 private String word;
 private int count;
 public Counter(String word) {
   this.word = word;
   count = 1;
 }
 public void increment() { count++; }
 public String toString() {
   return "\n" + word + " [" + count + "]";
 }
 public boolean equals(Object obj) {
   return obj instanceof Counter &&
     ((Counter)obj).word.equals(word);
 }
 public int hashCode() {
   return word.hashCode();
 }
 public int compareTo(Object o) {
   return word.compareTo(((Counter)o).word);
 }
}

class CounterSet extends AbstractSet {
 private Map set = new TreeMap();
 public void addOrIncrement(String s) {
   Counter c = new Counter(s);
   if (set.containsKey(c))
     ((Counter)set.get(c)).increment();
   else
     set.put(c, c);
 }
 public Iterator iterator() {
   return set.keySet().iterator();
 }
 public int size() {
   return set.size();
 }
 public String toString() {
   return set.keySet().toString();
 }
}

class WordCount {
 private FileReader file;
 private StreamTokenizer st;

 private CounterSet counts = new CounterSet();
 WordCount(String filename)
   throws FileNotFoundException {
   try {
     file = new FileReader(filename);
     st = new StreamTokenizer(
       new BufferedReader(file));
     st.ordinaryChar('.');
     st.ordinaryChar('-');
    st.lowerCaseMode(true);

   } catch(FileNotFoundException e) {
     System.err.println(
       "Could not open " + filename);
     throw e;
   }
 }
 void cleanup() {
   try {
     file.close();
   } catch(IOException e) {
     System.err.println(
       "file.close() unsuccessful");
   }
 }
 void countWords() {
   try {
     while(st.nextToken() !=
       StreamTokenizer.TT_EOF) {
       String s = "a";
       switch(st.ttype) {
         case StreamTokenizer.TT_EOL:
           s = new String("EOL");
           break;

         case StreamTokenizer.TT_NUMBER:
       //    s = Double.toString(st.nval);
           break;

         case StreamTokenizer.TT_WORD:
           s = st.sval;
           break;
         default: // single character in ttype
           s = String.valueOf((char)st.ttype);
       }

       if(s.length() > 3)
           counts.addOrIncrement(s);
     }
   } catch(IOException e) {
     System.err.println(
       "st.nextToken() unsuccessful");
   }
 }
 public Iterator iterator() {
   return counts.iterator();
 }
 public String toString() {
   return counts.toString();
 }
}

public class KeyWordExtractor {
 public static void main(String[] args)
 throws FileNotFoundException {
   for(int i = 0; i < args.length; i++){
       WordCount wc =  new WordCount(args[i]);
       wc.countWords();
       System.out.println("WORD = " + wc);
       wc.cleanup();
   }
 }
}

and it give me to occurrency of every world in the text...in example if
i give in input a text like (a stupid example) "java function java
library function java"  in output I obtain WORD = [function[2] ,
java[3] , library[1]] ....that are the occurrences of the word in the
text,but my problem is that I need in output not all the word of the
text...but only the the word that appears many times in the text...in
this case java that is the keyword of the text....WORD = [java]

I know that there is still little code to write,but I do not know well
java and so I don't succeed to write it!!!
Please Help me....THANKS!!!

glen herrmannsfeldt ha scritto:

> > Someone knows where I can find the Keyword Extractor source code
> > written in java? A software that analyzes a text and extract the
[quoted text clipped - 24 lines]
>
> -- glen
glen herrmannsfeldt - 17 Jan 2007 07:14 GMT
> Yes, I have found a code like this....
>
[quoted text clipped - 11 lines]
>   public String toString() {
>     return "\n" + word + " [" + count + "]";

Change this to:

return count=" "+word;

The the output will have a list of count followed by word, and
can be input to the unix command

sort -rn  unsortedfile > sortedfile

which will output the list with the most common word first.

(snip)

-- glen
giugy - 17 Jan 2007 09:15 GMT
Sorry but maybe I make a stupid errore....if I change
return "\n" + word + " [" + count + "]";
with
return count=" "+word;

I obtain an error like this "found: java.lang.String required: int" ,
because count is an it and word is a string and the function required
gives back a String....how can i do?

glen herrmannsfeldt ha scritto:

> > Yes, I have found a code like this....
> >
[quoted text clipped - 26 lines]
>
> -- glen
giugy - 17 Jan 2007 09:15 GMT
Sorry but maybe I make a stupid errore....if I change
return "\n" + word + " [" + count + "]";
with
return count=" "+word;

I obtain an error like this "found: java.lang.String required: int" ,
because count is an it and word is a string and the function required
gives back a String....how can i do?

glen herrmannsfeldt ha scritto:

> > Yes, I have found a code like this....
> >
[quoted text clipped - 26 lines]
>
> -- glen
glen herrmannsfeldt - 17 Jan 2007 09:25 GMT
> Sorry but maybe I make a stupid errore....if I change
> return "\n" + word + " [" + count + "]";
> with
> return count=" "+word;
>
> I obtain an error like this "found: java.lang.String required: int" ,

Sorry, it was supposed to say return count+" "+word;

In both the original and this one, the int is converted to String.

By the way, you don't need to post three times for us to read it.

-- glen
giugy - 17 Jan 2007 09:15 GMT
Sorry but maybe I make a stupid errore....if I change
return "\n" + word + " [" + count + "]";
with
return count=" "+word;

I obtain an error like this "found: java.lang.String required: int" ,
because count is an it and word is a string and the function required
gives back a String....how can i do?

glen herrmannsfeldt ha scritto:

> > Yes, I have found a code like this....
> >
[quoted text clipped - 26 lines]
>
> -- glen
giugy - 16 Jan 2007 17:20 GMT
Yes, I have found a code like this....

import java.io.*;
import java.util.*;

class Counter implements Comparable {
 private String word;
 private int count;
 public Counter(String word) {
   this.word = word;
   count = 1;
 }
 public void increment() { count++; }
 public String toString() {
   return "\n" + word + " [" + count + "]";
 }
 public boolean equals(Object obj) {
   return obj instanceof Counter &&
     ((Counter)obj).word.equals(word);
 }
 public int hashCode() {
   return word.hashCode();
 }
 public int compareTo(Object o) {
   return word.compareTo(((Counter)o).word);
 }
}

class CounterSet extends AbstractSet {
 private Map set = new TreeMap();
 public void addOrIncrement(String s) {
   Counter c = new Counter(s);
   if (set.containsKey(c))
     ((Counter)set.get(c)).increment();
   else
     set.put(c, c);
 }
 public Iterator iterator() {
   return set.keySet().iterator();
 }
 public int size() {
   return set.size();
 }
 public String toString() {
   return set.keySet().toString();
 }
}

class WordCount {
 private FileReader file;
 private StreamTokenizer st;

 private CounterSet counts = new CounterSet();
 WordCount(String filename)
   throws FileNotFoundException {
   try {
     file = new FileReader(filename);
     st = new StreamTokenizer(
       new BufferedReader(file));
     st.ordinaryChar('.');
     st.ordinaryChar('-');
    st.lowerCaseMode(true);

   } catch(FileNotFoundException e) {
     System.err.println(
       "Could not open " + filename);
     throw e;
   }
 }
 void cleanup() {
   try {
     file.close();
   } catch(IOException e) {
     System.err.println(
       "file.close() unsuccessful");
   }
 }
 void countWords() {
   try {
     while(st.nextToken() !=
       StreamTokenizer.TT_EOF) {
       String s = "a";
       switch(st.ttype) {
         case StreamTokenizer.TT_EOL:
           s = new String("EOL");
           break;

         case StreamTokenizer.TT_NUMBER:
       //    s = Double.toString(st.nval);
           break;

         case StreamTokenizer.TT_WORD:
           s = st.sval;
           break;
         default: // single character in ttype
           s = String.valueOf((char)st.ttype);
       }

       if(s.length() > 3)
           counts.addOrIncrement(s);
     }
   } catch(IOException e) {
     System.err.println(
       "st.nextToken() unsuccessful");
   }
 }
 public Iterator iterator() {
   return counts.iterator();
 }
 public String toString() {
   return counts.toString();
 }
}

public class KeyWordExtractor {
 public static void main(String[] args)
 throws FileNotFoundException {
   for(int i = 0; i < args.length; i++){
       WordCount wc =  new WordCount(args[i]);
       wc.countWords();
       System.out.println("WORD = " + wc);
       wc.cleanup();
   }
 }
}

and it give me to occurrency of every world in the text...in example if
i give in input a text like (a stupid example) "java function java
library function java"  in output I obtain WORD = [function[2] ,
java[3] , library[1]] ....that are the occurrences of the word in the
text,but my problem is that I need in output not all the word of the
text...but only the the word that appears many times in the text...in
this case java that is the keyword of the text....WORD = [java]

I know that there is still little code to write,but I do not know well
java and so I don't succeed to write it!!!
Please Help me....THANKS!!!

glen herrmannsfeldt ha scritto:

> > Someone knows where I can find the Keyword Extractor source code
> > written in java? A software that analyzes a text and extract the
[quoted text clipped - 24 lines]
>
> -- glen


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.