Java Forum / General / November 2005
HashSet performance
zero - 12 Nov 2005 16:00 GMT In a chess program I am working on I was using a HashSet to store objects of class Pattern, which contained a HashSet of Pieces. During execution there can be anywhere between 1 and several hundred thousand patterns in the Set. I was using Sets because there should be no duplicates of Pieces in a Pattern, and no duplicate Patterns in the set.
However, the first tests showed that the program was unacceptably slow, and quickly threw OutOfMemoryExceptions. I soon found that the bottleneck was the Set.add() method. I replaced each occurance of HashSet with ArrayList, adding my own code to keep duplicates out. The result is a lot faster (I can't give an estimate, but trust me it's *a lot*), and uses three times less memory...
My reason for picking Sets was simply the elimination of duplicates, and the constant time for add() sounded good too. Next time I'll think twice before using a HashSet.
John C. Bollinger - 13 Nov 2005 03:44 GMT > In a chess program I am working on I was using a HashSet to store objects > of class Pattern, which contained a HashSet of Pieces. During execution [quoted text clipped - 12 lines] > the constant time for add() sounded good too. Next time I'll think twice > before using a HashSet. You should always think carefully when choosing what type of object best suits your purpose.
With HashSets you need to appreciate that every add() requires the hash code of the new object to be computed, and may involve up to size() equals() comparisons. If you have objects whose hash codes and equals() methods are expensive, then storing large numbers of them in a HashSet may be prohibitively expensive. All Sets' equals() and hashCode() methods are based on their contents, and they can indeed be expensive for perverse or numerous elements.
With ArrayLists, on the other hand, neither a hash code nor any equals() comparisons need to be computed when a new element is added, but you don't get guaranteed uniqueness. It typically _is_ faster to add elements to an ArrayList than to a HashSet, but sometimes the guaranteed uniqueness that comes with a Set is very handy. And HashSet generally works nicely with elements that are suitable for it: those that have fast (and good) hashCode() and equals() methods.
 Signature John Bollinger jobollin@indiana.edu
Roedy Green - 13 Nov 2005 06:36 GMT On Sat, 12 Nov 2005 22:44:12 -0500, "John C. Bollinger" <jobollin@indiana.edu> wrote, quoted or indirectly quoted someone who said :
> If you have objects whose hash codes and equals() >methods are expensive, then storing large numbers of them in a HashSet >may be prohibitively expensive. hashCode methods are not necessarily expensive.
For example Color just returns the int value of the colour.
hashCode for String is quite expensive though it is cached so it is computed only once. But if you are not using interned strings, you will end up computing it for every lookup.
public int hashCode() { int h = hash; if (h == 0) { int off = offset; char val[] = value; int len = count;
for (int i = 0; i < len; i++) { h = 31*h + val[off++]; } hash = h; } return h; }
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
zero - 13 Nov 2005 12:30 GMT > On Sat, 12 Nov 2005 22:44:12 -0500, "John C. Bollinger" > <jobollin@indiana.edu> wrote, quoted or indirectly quoted someone who [quoted text clipped - 26 lines] > return h; > } I still use hashCode() and equals() to check if a certain objects is already in the ArrayList. I do this in my own code now, instead of letting the HashSet do it. However, in the hashCode() implementation I did change from using "pieces.toString().hashCode()" (where pieces was a HashSet) to a different algorithm (see code below). So that may be part of why it's so much faster now. I did not know computing the hashCode of a String was that expensive.
This is my new hash computation code:
toChar() returns a char, square.getRank() and square.getFile() return ints.
public abstract class Piece { // ...
public int hashCode() { return toChar() + square.getRank() + square.getFile(); } }
public class Pattern { private ArrayList<Piece> pieces;
// ...
public int hashCode() { int code = 0;
for(Piece p : pieces) code += p.hashCode();
return code; } }
I have now run into a different problem: when serializing a HashMap and then de-serializing it on a subsequent run of the program, some of the entries are missing. I'll do some testing, and if I can't figure it out I'll ask you guys for help.
Chris Uppal - 13 Nov 2005 13:03 GMT > public int hashCode() > { > return toChar() + square.getRank() + square.getFile(); > }
> public int hashCode() > { [quoted text clipped - 5 lines] > return code; > } I think that part of the unexpectedly poor performance of HashSet here, is that the above are very poor hash functions. They will map many naturally occurring unequal instance of Piece to the same code, and similarly for Pattern, so your HashSets will have many collisions, and consequently perform much worse than you might expect.
It is, in general, a bad idea to use addition as the sole basis of a hash function, unless you know something special about the values that are being added. Good hash functions tend to be based on one of:
multiplication by a prime plus addition; bit-shifting the values into non-overlapping (bitwise) ranges; bit-shifting and XORing; ?...
(Just XORing is nearly always a bad idea too, but it can be fine provided that you bitshift the values first). Some examples (all untested, and almost certainly suboptimal):
public int hashCode() { return (toChar() * 999961) + (square.getRank() * 997) + square.getFile(); }
public int hashCode() { return (toChar() << 16) | (square.getRank() << 8) | square.getFile(); }
public int hashCode() { return ((toChar() << 1) | (square.getRank() << 1) | square.getFile(); }
At a guess, the last is probably a lot worse than the first two in this case, although the technique is valid in general (especially if you rotate the bits rather than just allowing them to be shifted "off the end" as in my example).
-- chris
zero - 13 Nov 2005 14:57 GMT >> public int hashCode() >> { [quoted text clipped - 16 lines] > and similarly for Pattern, so your HashSets will have many collisions, > and consequently perform much worse than you might expect. Actually the bad HashSet performance didn't use this code. But you're right about the collisions, I didn't think of that.
> It is, in general, a bad idea to use addition as the sole basis of a > hash function, unless you know something special about the values that [quoted text clipped - 39 lines] > > -- chris Thanks for the suggestions, I'll check these out and do some testing.
Roedy Green - 13 Nov 2005 16:48 GMT >Actually the bad HashSet performance didn't use this code. But you're >right about the collisions, I didn't think of that. The only thing a hashCode has to do is avoid collisions as best it can. Other than speed, there is nothing else to think about.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Thomas Hawtin - 14 Nov 2005 00:15 GMT >> public int hashCode() >> { [quoted text clipped - 5 lines] >> return code; >> } You are allocating an iterator every time this method is called. Allocation isn't particularly expensive, but neither is it anywhere near free. You might find it help to set -Xms to the same value as -Xmx.
When you combine this with potential collision problems, then we have problems.
> public int hashCode() > { [quoted text clipped - 3 lines] > | square.getFile(); > } You will probably find that the performance of this hash function is highly JRE implementation dependent. Sun's 1.4.1, IIRC, HashMap/HashSet requires a good spread of values in the least significant bits.
The old multiply by a small but not too simple prime and accumulate works for me (isn't so good if you have two classes that do the same calculation with the same data).
private int hash; @Override public int hashCode() { int hash = this.hash; if (hash == 0) { hash = 1; // Possibly all fields are 0... hash = hash*37 + charValue; hash = hash*37 + rank; hash = hash*37 + file; this.hash = hash; } return hash; }
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
zero - 14 Nov 2005 01:12 GMT Thomas Hawtin <usenet@tackline.plus.com> wrote in news:4377d648$0$1479 $ed2619ec@ptn-nntp-reader01.plus.net:
> The old multiply by a small but not too simple prime and accumulate > works for me (isn't so good if you have two classes that do the same [quoted text clipped - 15 lines] > > Tom Hawtin I currently have
in Piece: public int hashCode() { if(hash == 0) (toChar()*19) + (square.getRank()*37) + square.getFile(); return hash; }
in Pattern: public int hashCode() { if(hash == 0) { for(Piece p : pieces) hash += p.hashCode(); }
return hash; }
Given that there are only 12 possible values for toChar, 8 for getRank, and 8 for getFile, I think the Piece hashCode is unique, and all possible hash values fall in a small enough range.
For Pattern an Iterator is being created, but I think it's acceptable - especially with caching the code. Of course the code has to be reset to 0 each time the Piece or Pattern changes, but that's still better than re-calculating it every time it's needed.
Thanks everyone for the suggestions, and if you can think of any more don't hesitate to let me know :-)
zero
Chris Uppal - 13 Nov 2005 14:04 GMT > public class Pattern > { [quoted text clipped - 12 lines] > } > } I've just noticed something else that might be worth mentioning. The hashCode() is dependent on the ordering of the Pieces. That might be deliberate, but it seems a bit suspicious to me. If your equals() method is /not/ dependent on the order, then the combination will break (and may explain your serialisation problem).
-- chris
zero - 13 Nov 2005 14:40 GMT >> public class Pattern >> { [quoted text clipped - 20 lines] > > -- chris How is it dependent on the ordering? It's a sum of all positive numbers.
Chris Uppal - 13 Nov 2005 15:37 GMT > How is it dependent on the ordering? It's a sum of all positive numbers. Sorry, my mistake. I had mentally translated it into one of the other forms I mentioned, which would be order dependent.
(Coughs, and changes the subject hurriedly...)
Incidentally, from the code you posted, it looks to me as if you can only have 2**11 = 2048 possible distinct Pieces. If so then it would probably be worth investigating creating them all upfront, and then only ever passing around references to those instances. A sort 2048-way Singleton pattern, or very like the way that Enums are implemented. Of course, you may already be doing that, but if not then it should save a lot of space and also allow you to compare them using == and the default (identity) hashCode(). The downside is that moving Pieces and [de]serialisation would be more complicated.
Also, by given each Piece a unique 11-bit index, you could then probably devise significantly better (in space and/or time, possibly both) internal representations of a Pattern than as an ArrayList of Pieces.
-- chris
zero - 13 Nov 2005 19:18 GMT > Incidentally, from the code you posted, it looks to me as if you can > only have 2**11 = 2048 possible distinct Pieces. If so then it would [quoted text clipped - 12 lines] > > -- chris Since it took me quite some time to program all the move rules, I think I'll keep it as is - at least for now. First I want to get the program working correctly. Once it's up and running I'll see about improving performance.
btw I believe I have found the serialization problem I was having, and as usual it was a stupid (*hitting myself on the head*) mistake. Some Pieces were being changed after the Pattern they were in was added to the HashMap. This of course changed the hashCode of the Pattern, causing some Patterns in the Map to have the same hashCode but different locations.
I think I'll go crawl in some dark hole until the shame subsides...
Roedy Green - 13 Nov 2005 16:46 GMT > return toChar() + square.getRank() + square.getFile(); you could make that slightly better distributed with return toChar() ^ square.getRank() ^ square.getFile();
You could make it even better distributed with: return (toChar() << 8) | (square.getRank()<<4) | square.getFile();
Both of these are still pretty quick.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Mark Thornton - 13 Nov 2005 10:35 GMT >> In a chess program I am working on I was using a HashSet to store >> objects of class Pattern, which contained a HashSet of Pieces. During [quoted text clipped - 7 lines] > With HashSets you need to appreciate that every add() requires the hash > code of the new object to be computed, and may involve up to size() The problem with large HashSet's is not usually the computation of hash codes, but the memory consumed by each object added to the set. For member of the set there is an Entry object (it implements Map.Entry). So you can end up pushing the limits of memory (or encountering thrashing) earlier than with more compact Collection implementations. The performance hit is caused by either excessive garbage collection (cure by specifying a larger minimum heap size) or memory paging activity.
Mark Thornton
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|