Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / November 2005

Tip: Looking for answers? Try searching our database.

HashSet performance

Thread view: 
zero - 12 Nov 2005 16:00 GMT
In a chess program I am working on I was using a HashSet to store objects
of class Pattern, which contained a HashSet of Pieces.  During execution
there can be anywhere between 1 and several hundred thousand patterns in
the Set.  I was using Sets because there should be no duplicates of Pieces
in a Pattern, and no duplicate Patterns in the set.

However, the first tests showed that the program was unacceptably slow, and
quickly threw OutOfMemoryExceptions.  I soon found that the bottleneck was
the Set.add() method.  I replaced each occurance of HashSet with ArrayList,
adding my own code to keep duplicates out.  The result is a lot faster (I
can't give an estimate, but trust me it's *a lot*), and uses three times
less memory...

My reason for picking Sets was simply the elimination of duplicates, and
the constant time for add() sounded good too.  Next time I'll think twice
before using a HashSet.
John C. Bollinger - 13 Nov 2005 03:44 GMT
> In a chess program I am working on I was using a HashSet to store objects
> of class Pattern, which contained a HashSet of Pieces.  During execution
[quoted text clipped - 12 lines]
> the constant time for add() sounded good too.  Next time I'll think twice
> before using a HashSet.

You should always think carefully when choosing what type of object best
suits your purpose.

With HashSets you need to appreciate that every add() requires the hash
code of the new object to be computed, and may involve up to size()
equals() comparisons.  If you have objects whose hash codes and equals()
methods are expensive, then storing large numbers of them in a HashSet
may be prohibitively expensive.  All Sets' equals() and hashCode()
methods are based on their contents, and they can indeed be expensive
for perverse or numerous elements.

With ArrayLists, on the other hand, neither a hash code nor any equals()
 comparisons need to be computed when a new element is added, but you
don't get guaranteed uniqueness.  It typically _is_ faster to add
elements to an ArrayList than to a HashSet, but sometimes the guaranteed
uniqueness that comes with a Set is very handy.  And HashSet generally
works nicely with elements that are suitable for it: those that have
fast (and good) hashCode() and equals() methods.

Signature

John Bollinger
jobollin@indiana.edu

Roedy Green - 13 Nov 2005 06:36 GMT
On Sat, 12 Nov 2005 22:44:12 -0500, "John C. Bollinger"
<jobollin@indiana.edu> wrote, quoted or indirectly quoted someone who
said :

> If you have objects whose hash codes and equals()
>methods are expensive, then storing large numbers of them in a HashSet
>may be prohibitively expensive.

hashCode methods are not necessarily expensive.

For example Color just returns the int value of the colour.

hashCode for String is quite expensive though it is cached so it is
computed only once. But if you are not using interned strings, you
will end up computing it for every lookup.

  public int hashCode() {
       int h = hash;
       if (h == 0) {
           int off = offset;
           char val[] = value;
           int len = count;

           for (int i = 0; i < len; i++) {
               h = 31*h + val[off++];
           }
           hash = h;
       }
       return h;
   }

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

zero - 13 Nov 2005 12:30 GMT
> On Sat, 12 Nov 2005 22:44:12 -0500, "John C. Bollinger"
> <jobollin@indiana.edu> wrote, quoted or indirectly quoted someone who
[quoted text clipped - 26 lines]
>         return h;
>     }

I still use hashCode() and equals() to check if a certain objects is
already in the ArrayList.  I do this in my own code now, instead of
letting the HashSet do it.  However, in the hashCode() implementation I
did change from using "pieces.toString().hashCode()" (where pieces was a
HashSet) to a different algorithm (see code below).  So that may be part
of why it's so much faster now.  I did not know computing the hashCode
of a String was that expensive.

This is my new hash computation code:

toChar() returns a char, square.getRank() and square.getFile() return
ints.

public abstract class Piece
{
  // ...

  public int hashCode()
  {
     return toChar() + square.getRank() + square.getFile();
  }
}

public class Pattern
{
  private ArrayList<Piece> pieces;

  // ...

  public int hashCode()
  {
     int code = 0;

     for(Piece p : pieces)
        code += p.hashCode();

     return code;
  }
}

I have now run into a different problem: when serializing a HashMap and
then de-serializing it on a subsequent run of the program, some of the
entries are missing.  I'll do some testing, and if I can't figure it out
I'll ask you guys for help.
Chris Uppal - 13 Nov 2005 13:03 GMT
>    public int hashCode()
>    {
>       return toChar() + square.getRank() + square.getFile();
>    }

>    public int hashCode()
>    {
[quoted text clipped - 5 lines]
>       return code;
>    }

I think that part of the unexpectedly poor performance of HashSet here, is that
the above are very poor hash functions.  They will map many naturally occurring
unequal instance of Piece to the same code, and similarly for Pattern, so your
HashSets will have many collisions, and consequently perform much worse than
you might expect.

It is, in general, a bad idea to use addition as the sole basis of a hash
function, unless you know something special about the values that are being
added.  Good hash functions tend to be based on one of:

   multiplication by a prime plus addition;
   bit-shifting the values into non-overlapping (bitwise) ranges;
   bit-shifting and XORing;
   ?...

(Just XORing is nearly always a bad idea too, but it can be fine provided that
you bitshift the values first).  Some examples (all untested, and almost
certainly suboptimal):

   public int hashCode()
   {
       return
           (toChar() * 999961)
           + (square.getRank() * 997)
           + square.getFile();
   }

   public int hashCode()
   {
       return
           (toChar() << 16)
           | (square.getRank() << 8)
           | square.getFile();
   }

   public int hashCode()
   {
       return
           ((toChar() << 1)
           | (square.getRank() << 1)
           | square.getFile();
   }

At a guess, the last is probably a lot worse than the first two in this case,
although the technique is valid in general (especially if you rotate the bits
rather than just allowing them to be shifted "off the end" as in my example).

   -- chris
zero - 13 Nov 2005 14:57 GMT
>>    public int hashCode()
>>    {
[quoted text clipped - 16 lines]
> and similarly for Pattern, so your HashSets will have many collisions,
> and consequently perform much worse than you might expect.

Actually the bad HashSet performance didn't use this code.  But you're
right about the collisions, I didn't think of that.

> It is, in general, a bad idea to use addition as the sole basis of a
> hash function, unless you know something special about the values that
[quoted text clipped - 39 lines]
>
>     -- chris

Thanks for the suggestions, I'll check these out and do some testing.
Roedy Green - 13 Nov 2005 16:48 GMT
>Actually the bad HashSet performance didn't use this code.  But you're
>right about the collisions, I didn't think of that.

The only thing a hashCode has to do is avoid collisions as best it
can. Other than speed, there is nothing else to think about.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Thomas Hawtin - 14 Nov 2005 00:15 GMT
>>   public int hashCode()
>>   {
[quoted text clipped - 5 lines]
>>      return code;
>>   }

You are allocating an iterator every time this method is called.
Allocation isn't particularly expensive, but neither is it anywhere near
free. You might find it help to set -Xms to the same value as -Xmx.

When you combine this with potential collision problems, then we have
problems.

>     public int hashCode()
>     {
[quoted text clipped - 3 lines]
>             | square.getFile();
>     }

You will probably find that the performance of this hash function is
highly JRE implementation dependent. Sun's 1.4.1, IIRC, HashMap/HashSet
requires a good spread of values in the least significant bits.

The old multiply by a small but not too simple prime and accumulate
works for me (isn't so good if you have two classes that do the same
calculation with the same data).

    private int hash;
    @Override
    public int hashCode() {
        int hash = this.hash;
        if (hash == 0) {
            hash = 1; // Possibly all fields are 0...
            hash = hash*37 + charValue;
            hash = hash*37 + rank;
            hash = hash*37 + file;
            this.hash = hash;
        }
        return hash;
    }

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

zero - 14 Nov 2005 01:12 GMT
Thomas Hawtin <usenet@tackline.plus.com> wrote in news:4377d648$0$1479
$ed2619ec@ptn-nntp-reader01.plus.net:

> The old multiply by a small but not too simple prime and accumulate
> works for me (isn't so good if you have two classes that do the same
[quoted text clipped - 15 lines]
>
> Tom Hawtin

I currently have

in Piece:
public int hashCode()
{
  if(hash == 0)
     (toChar()*19) + (square.getRank()*37) + square.getFile();
  return hash;
}

in Pattern:
public int hashCode()
{
  if(hash == 0)
  {
     for(Piece p : pieces)
        hash += p.hashCode();
  }

  return hash;
}

Given that there are only 12 possible values for toChar, 8 for getRank,
and 8 for getFile, I think the Piece hashCode is unique, and all
possible hash values fall in a small enough range.

For Pattern an Iterator is being created, but I think it's acceptable -
especially with caching the code.  Of course the code has to be reset to
0 each time the Piece or Pattern changes, but that's still better than
re-calculating it every time it's needed.

Thanks everyone for the suggestions, and if you can think of any more
don't hesitate to let me know :-)

zero
Chris Uppal - 13 Nov 2005 14:04 GMT
> public class Pattern
> {
[quoted text clipped - 12 lines]
>    }
> }

I've just noticed something else that might be worth mentioning.  The
hashCode() is dependent on the ordering of the Pieces.  That might be
deliberate, but it seems a bit suspicious to me.  If your equals() method is
/not/ dependent on the order, then the combination will break (and may explain
your serialisation problem).

   -- chris
zero - 13 Nov 2005 14:40 GMT
>> public class Pattern
>> {
[quoted text clipped - 20 lines]
>
>     -- chris

How is it dependent on the ordering?  It's a sum of all positive numbers.
Chris Uppal - 13 Nov 2005 15:37 GMT
> How is it dependent on the ordering?  It's a sum of all positive numbers.

Sorry, my mistake.  I had mentally translated it into one of the other forms I
mentioned, which would be order dependent.

(Coughs, and changes the subject hurriedly...)

Incidentally, from the code you posted, it looks to me as if you can only have
2**11 = 2048 possible distinct Pieces.  If so then it would probably be worth
investigating creating them all upfront, and then only ever passing around
references to those instances.  A sort 2048-way Singleton pattern, or very like
the way that Enums are implemented.  Of course, you may already be doing that,
but if not then it should save a lot of space and also allow you to compare
them using == and the default (identity) hashCode().   The downside is that
moving Pieces and [de]serialisation would be more complicated.

Also, by given each Piece a unique 11-bit index, you could then probably devise
significantly better (in space and/or time, possibly both) internal
representations of a Pattern than as an ArrayList of Pieces.

   -- chris
zero - 13 Nov 2005 19:18 GMT
> Incidentally, from the code you posted, it looks to me as if you can
> only have 2**11 = 2048 possible distinct Pieces.  If so then it would
[quoted text clipped - 12 lines]
>
>     -- chris

Since it took me quite some time to program all the move rules, I think
I'll keep it as is - at least for now.  First I want to get the program
working correctly.  Once it's up and running I'll see about improving
performance.

btw I believe I have found the serialization problem I was having, and as
usual it was a stupid (*hitting myself on the head*) mistake.  Some Pieces
were being changed after the Pattern they were in was added to the HashMap.  
This of course changed the hashCode of the Pattern, causing some Patterns
in the Map to have the same hashCode but different locations.

I think I'll go crawl in some dark hole until the shame subsides...
Roedy Green - 13 Nov 2005 16:46 GMT
> return toChar() + square.getRank() + square.getFile();

you could make that slightly better distributed with
return toChar() ^ square.getRank() ^ square.getFile();

You could make it even better distributed with:
return (toChar() << 8) | (square.getRank()<<4) | square.getFile();

Both of these are still pretty quick.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Mark Thornton - 13 Nov 2005 10:35 GMT
>> In a chess program I am working on I was using a HashSet to store
>> objects of class Pattern, which contained a HashSet of Pieces.  During
[quoted text clipped - 7 lines]
> With HashSets you need to appreciate that every add() requires the hash
> code of the new object to be computed, and may involve up to size()

The problem with large HashSet's is not usually the computation of hash
codes, but the memory consumed by each object added to the set. For
member of the set there is an Entry object (it implements Map.Entry). So
you can end up pushing the limits of memory (or encountering thrashing)
earlier than with more compact Collection implementations. The
performance hit is caused by either excessive garbage collection (cure
by specifying a larger minimum heap size) or memory paging activity.

Mark Thornton


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.