Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / December 2006

Tip: Looking for answers? Try searching our database.

Summating Strings

Thread view: 
Mike - 22 Dec 2006 19:04 GMT
Please bear with me.  I'm a bit new to Java.  I have a list of strings
that I'd like to summate.  For example I'd like to take this data:

Apple
Orange
Apple
Apple
Pear
Peach
Banana
Banana

And generate this data:

Term        Count
----        -----
Apple        3
Orange    1
Pear        1
Peach        1
Banana    2

I'd then like to sort this data so the items with the higher counts come
first in the list.   The list of source strings can be anywhere from 1 -
20000 strings having anywhere from 1 - 500 unique terms.  The average
set will be roughly 300 strings with about 30 unique terms.  I am
currently doing this now by looping through the strings and keeping the
term/count data in an ArrayList.  I use a binary search to determine if
the string is already in my list.  If not I insert the string into the
ArrayList in the appropriate location to keep the list sorted so the
binary search will continue to work.

My method is fast, but I'm hoping it can be faster.  I'm running this
code in a web app and it needs to be as fast as possible.  The data
constantly changes so caching is not an option.  It's also not coming
from a SQL data source so performing a query to generate this is also
not an option.

What I'm wondering is if there's some innate Java class that does this
sort of thing already and hopefully does it much faster than I'm doing
it.  The code needs to work with Java Version 1.4.2_09.  I've looked at
TreeMap and some of the other various collection classes, but I'm not
sure from the API docs which one might be best suited for my scenario.  
I'm hoping the collective wisdom of this group can help save me some
time and aggravation.

Thanks in advance!

- Mike
Patricia Shanahan - 22 Dec 2006 19:20 GMT
> Please bear with me.  I'm a bit new to Java.  I have a list of strings
> that I'd like to summate.  For example I'd like to take this data:
[quoted text clipped - 27 lines]
> ArrayList in the appropriate location to keep the list sorted so the
> binary search will continue to work.

Here is a little class I wrote for a similar problem. Once you have the
set of keys, you can sort it using a Comparator that orders based on the
associated count. You will need to modify it to remove the generics to
use with 1.4. I'd be interested in hearing whether it is faster or
slower than your current solution.

/*
 * Created on Aug 14, 2004
 */
package utilities;

import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;

/**
 * Utility class for counting instances of equal objects.
 */
public class Counter {

  private Map<Object,Count> data = new HashMap<Object,Count>();

  /**
   * Increment by one the count associated with a specified
   * key.
   * @param key
   */
  public void increment(Object key) {
    Count count = data.get(key);
    if (count == null) {
      count = new Count();
      data.put(key, count);
    }
    count.increment();
  }

  /**
   * Get the count associated with a specified key.
   * @param key The key whose count is required.
   * @return The number of times increment has been called with
   * a key equal to this one.
   */
  public int get(Object key) {
    Count count = data.get(key);
    if (count == null) {
      return 0;
    } else {
      return count.get();
    }
  }

  /**
   * Get number of unique counted objects
   *
   * @return Number of key objects for which increment
   * has been called. Equal objects are only counted once.
   */
  public int size() {
    return data.size();
  }

  /**
   * Get all the unique counted objects
   * @return A set containing a refence to the counted objects.
   */
  public Set getKeys() {
    return Collections.unmodifiableSet(data.keySet());
  }

  private static class Count {
    private int val = 0;

    private void increment() {
      val++;
    }

    private int get() {
      return val;
    }
  }

}
Stefan Ram - 22 Dec 2006 19:27 GMT
>Term        Count
>----        -----
>Apple        3
>Orange    1

 So you want

sort | uniq -c

 You might use something like

class NumericMapUtils
{ public static <D> void addTo /* autovivificate the value to 0 */
 ( final java.util.Map<D,java.lang.Integer> map, final D d, final int i )
 { map.put( d, i +( map.containsKey( d )? map.get( d ): 0 )); }}

 and a sorted map

java.util.TreeMap<java.lang.String,java.lang.Integer> map;

 then add each text like

NumericMapUtils.addTo<java.lang.String>( map, "Apple",  1 );
NumericMapUtils.addTo<java.lang.String>( map, "Orange", 1 );
NumericMapUtils.addTo<java.lang.String>( map, "Apple",  1 );
NumericMapUtils.addTo<java.lang.String>( map, "Apple",  1 );

 Then iterate: »for( final java.lang.String key: map.keySet() )«
Mike - 22 Dec 2006 20:18 GMT
Thanks for the replies guys!  I'll give them a shot and see how they
perform.

- Mike
Hemal  Pandya - 22 Dec 2006 20:32 GMT
Now I am confused.

[....]
> I'd then like to sort this data so the items with the higher counts come
> first in the list.

Neither of the suggested solutions would meet this condition. Or would
they? What am I missing?

You need a collection of (word, frequency). When building the
collection, you want it searchable by the word. When reporting, you
need it sorted by frequency. One way is to sort the collection after
the scan. Would that be worse then maintaining two collections during
scan? I don't know...

If I remember right, the K&R C Programming Language book has a sample
code for word frequency.
Mike - 22 Dec 2006 20:50 GMT
You are correct, they don't.  However, I can still see if they're any
faster at building the list.  Currently I am sorting as a separate step so
I can still compare the speed of the various methods.  I included the part
about sorting in the original post because I thought it might influence the
method used.

- Mike

> Neither of the suggested solutions would meet this condition. Or would
> they? What am I missing?
Patricia Shanahan - 22 Dec 2006 20:52 GMT
Hemal Pandya wrote:
> Now I am confused.
>
[quoted text clipped - 4 lines]
> Neither of the suggested solutions would meet this condition. Or would
> they? What am I missing?

Maybe you missed my comment "Once you have the set of keys, you can sort
it using a Comparator that orders based on the associated count."?

> You need a collection of (word, frequency). When building the
> collection, you want it searchable by the word. When reporting, you
> need it sorted by frequency. One way is to sort the collection after
> the scan. Would that be worse then maintaining two collections during
> scan? I don't know...

A continuously sorted collection would be quite expensive to maintain,
especially considering the statistics stated in the base message: "The
list of source strings can be anywhere from 1 - 20000 strings having
anywhere from 1 - 500 unique terms. The average set will be roughly 300
strings with about 30 unique terms."

Maintaining sorted order requires adjustment of the structure for every
input, 300 times in the average case. A HashMap indexed by string has
one insert for each unique string, 30 times in the average case,
followed by a 30 element sort at the end. The remaining 270 operations
do a HashMap get followed by an add.

The OP did say "I'd then like to sort this data so the items with the
higher counts come first in the list." which suggests to me that
solutions that first collect the data and then sort meet the requirements.

> If I remember right, the K&R C Programming Language book has a sample
> code for word frequency.

I don't think K&R gave enough attention to the java.util package to be a
really good guide to Java programming in this area. :-)

Patricia
Hemal  Pandya - 23 Dec 2006 05:20 GMT
[...]

> Maybe you missed my comment "Once you have the set of keys, you can sort
> it using a Comparator that orders based on the associated count."?

Of course. I didn't think it likely that you would not have addressed
it. Perhaps I was thrown off because reading the above to mean to sort
the keys and I was stuck with the through that the collection of pairs
needs to be sorted.

Thank you for the explanation of performance issue.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.