Java Forum / General / July 2006
hash codes
rmacnak@gmail.com - 26 Jul 2006 23:01 GMT What propose/benefit does Object.hashCode() serve? Why would I want to use this?
VisionSet - 26 Jul 2006 23:14 GMT > What propose/benefit does Object.hashCode() serve? Why would I want to > use this? It is used to perform fast hashing ie equality checks of objects. the equals() method is often slow to evaluate, hashcode should provide a quickly evaluated value such that 2 equal objects always produce the same hashcode. If 2 equal hashcodes do not however indicate an object is equal, but rather if this is the case the calling code should also then call equals(). This reduces hugely the number of times in total equals() is called on say a large collection of objects.
-- Mike W
rmacnak@gmail.com - 27 Jul 2006 00:50 GMT > > What propose/benefit does Object.hashCode() serve? Why would I want to > > use this? [quoted text clipped - 9 lines] > -- > Mike W Thanks alot, but are there any other purposes besides faster equals()? I think I heard something about checksums and hashs.
Mark Space - 27 Jul 2006 01:44 GMT > Thanks alot, but are there any other purposes besides faster equals()? > I think I heard something about checksums and hashs. Besides the obvious (using hashcodes in a hashmap to lookup/search faster), there's Bloom Filters. Bloom Filters speed computation of a true/false value when the computation is very expensive.
http://www.beachsoftware.com/bloom/white_paper.html
Simon - 27 Jul 2006 09:22 GMT VisionSet schrieb:
>> What propose/benefit does Object.hashCode() serve? Why would I want to >> use this? [quoted text clipped - 3 lines] > quickly evaluated value such that 2 equal objects always produce the same > hashcode. Assume two objects are represented by n and m bits, respectively. Then, obviously, comparison is in O(min(n,m)). Can you give an example where hash code computation for both objects is in o(min(n,m))?
What you can do, of course, is cache the hash value. Then, hash value computation often is only a lookup and then is in O(1). Also, if you want to compare to a remote object, comparing the hash value first can save communication cost. However, I don't see how you can save computation time.
> If 2 equal hashcodes do not however indicate an object is equal, > but rather if this is the case the calling code should also then call > equals(). This reduces hugely the number of times in total equals() is I agree. So actually, if hash values are cached, comparing hashValue()s should be the first thing to do inside the equals() method (as opposed to letting the user do this each time before he uses the equals). Do you know why String.equals() doen't do this?
Cheers, Simon
Patricia Shanahan - 27 Jul 2006 14:14 GMT > VisionSet schrieb: ...
>> If 2 equal hashcodes do not however indicate an object is equal, >> but rather if this is the case the calling code should also then call [quoted text clipped - 4 lines] > user do this each time before he uses the equals). Do you know why > String.equals() doen't do this? The hash values are cached, and are pre-checked before calling equals, in classes such as HashMap.
I'm not sure that calculating and caching the hash values in String would be a gain. It would depend on the average number of comparisons per String, and also the lengths of the matching prefixes.
HashMap needs the hash code anyway, so pre-checking is effectively free.
Patricia
Simon - 27 Jul 2006 15:06 GMT Patricia Shanahan schrieb:
>> VisionSet schrieb: > ... [quoted text clipped - 11 lines] > The hash values are cached, and are pre-checked before calling equals, > in classes such as HashMap. Probably I expressed myself incorrectly. I should have said "*since* hash values are cached..." I'm not asking why String doesn't cache them, but why String doesn't use them in its equals() method.
> I'm not sure that calculating and caching the hash values in String > would be a gain. It would depend on the average number of comparisons > per String, and also the lengths of the matching prefixes. Let's forget about the prefix length for a second and assume that the strings differ only in the last character compared. Then, comparison and hash code computation need, I believe, almost the same computation time. So you gain something as soon as a String is involved in more than only a few (I would guess 2) comparisons.
Now, assume the strings differ after a short prefix. Then, as you said, you can easily lose time by computing hash values instead of starting the comparison and terminating with false quickly. Now you can do several things:
1 - Use the hash value at least if it was computed by some other reason before (this is free)
2 - Many Strings of length n are created in linear time anyway (e.g. by reading them from a file or stream, creating them with a StringBuffer, user input, ...). So you can afford computing the hash value on the fly or after it has been created and still have linear time. The only way to create a string of length n in sublinear time I can think of is by substring() etc. E.g. the constructors String(char[]), String(char[],int,int), and String(String) could call hashValue() almost for free since they need linear time anyway.
3 - In String.equals() you could compute the hash value as you go along. If the Strings are equal, you have computed the hash code for free and you can cache it. If you terminate earlier, you can throw away your intermediate result. You could still improve this by completing the hash value computation if the matching prefix has length at least n/2, say. Ok, that doesn't improve the situation much for adversarial input, but for typical inputs it could be competitive. And you can also randomize that, and...
What? You think this is getting clumsy...
Well. One could use at least the first option since this one is essentially free.
Cheers, Simon
Patricia Shanahan - 27 Jul 2006 15:50 GMT > Patricia Shanahan schrieb: >>> VisionSet schrieb: [quoted text clipped - 14 lines] > are cached..." I'm not asking why String doesn't cache them, but why String > doesn't use them in its equals() method. You seem to be ignoring the issue of which class caches hash codes when.
String does not cache hash codes. HashMap, for example, does cache the hash code for each key in the map, and only calculates the probe key hash code once during a get.
>> I'm not sure that calculating and caching the hash values in String >> would be a gain. It would depend on the average number of comparisons [quoted text clipped - 12 lines] > 1 - Use the hash value at least if it was computed by some other reason before > (this is free) Checking the hash codes costs three comparisons. The condition to abandon is that the hashes are different and neither of them is zero, the value used for a hash that has not yet been calculated.
If either hash code has not been calculated, or the strings being compared are equal, or differ but have equal hash codes, or differ somewhere in the first few characters, it is a net loss.
Note that the pre-check behavior of HashMap tends to reduce the probability of comparisons between strings whose hash codes have already been calculated. While it does force hash code calculation, it never calls equals for a pair of objects with different hash codes.
It is possible that it might be a gain, but not obvious. It would require experiments across a wide range of benchmarks.
> 2 - Many Strings of length n are created in linear time anyway (e.g. by reading > them from a file or stream, creating them with a StringBuffer, user input, ...). [quoted text clipped - 3 lines] > String(char[]), String(char[],int,int), and String(String) could call > hashValue() almost for free since they need linear time anyway. String(char[]) etc. use System.arraycopy. Although it is also linear time, for any reasonably long string is it much faster than character by character copying.
> 3 - In String.equals() you could compute the hash value as you go along. If the > Strings are equal, you have computed the hash code for free and you can cache [quoted text clipped - 7 lines] > > Well. One could use at least the first option since this one is essentially free. Nope. If it does any good at all, it costs three comparisons.
Patricia
Simon - 27 Jul 2006 16:20 GMT > You seem to be ignoring the issue of which class caches hash codes when. > > String does not cache hash codes. From the source of String.java, JDK 1.6.0
/** Cache the hash code for the string */ private int hash; // Default to 0
[...]
public int hashCode() { int h = hash; if (h == 0) { int off = offset; char val[] = value; int len = count;
for (int i = 0; i < len; i++) { h = 31*h + val[off++]; } hash = h; } return h; }
> It is possible that it might be a gain, but not obvious. It would > require experiments across a wide range of benchmarks. ACK.
>> Well. One could use at least the first option since this one is >> essentially free. > > Nope. If it does any good at all, it costs three comparisons. You wrote in an earlier post:
> HashMap needs the hash code anyway, so pre-checking is effectively free. I'm just saying: If String.equals() does have the hash code anyway (by chance), pre-checking is effectively free. Three comparisons aren't all that much are they? You need 2 comparisons in any iteration of the comparison loop.
Cheers, Simon
Patricia Shanahan - 27 Jul 2006 19:12 GMT >> You seem to be ignoring the issue of which class caches hash codes when. >> [quoted text clipped - 6 lines] > > [...] Sorry about that, I did check and found it does cache when hashCode is called, but forgot I had already typed that line.
>> It is possible that it might be a gain, but not obvious. It would >> require experiments across a wide range of benchmarks. [quoted text clipped - 12 lines] > pre-checking is effectively free. Three comparisons aren't all that much are > they? You need 2 comparisons in any iteration of the comparison loop. There are two important differences between HashMap and String on this:
1. HashMap must calculate the hash code in order to find the right bucket, so it never has to test whether the hash code has been calculated. It only has to do one compare, not three, to know that two strings have different hash codes. String may nor may not already have one or both hash codes, so it has to test whether they exist.
2. HashMap is going to have to incur the cost of a method call, at an absolute minimum, if it does call equals. String is already sitting in its own equals method, all ready to start comparing characters.
A lot depends on both the probability of an equals call for two String instances that have different hash codes and both have been calculated, and the average number of iterations of the comparison loop, given the strings are different.
Patricia
Patricia Shanahan - 27 Jul 2006 21:28 GMT ...
> I'm just saying: If String.equals() does have the hash code anyway (by chance), > pre-checking is effectively free. Three comparisons aren't all that much are > they? You need 2 comparisons in any iteration of the comparison loop. ...
I should perhaps explain why I think comparisons can be a big deal. The problem is not the comparison itself, which can be just as fast as an integer add, but the conditional branch that follows.
One of the nastiest, most disruptive things you can do to a processor is force it to empty its pipeline and begin again.
The processor has to decide what to load next in the pipeline immediately after loading the branch, many cycles before it executes and the processor finds out if it is taken or not taken. The processor has to guess, and if it guesses wrong all the code in the pipeline following the branch is junk, and must be thrown away. It will have to wait for the correct instructions to work their way through before it can do any more useful work.
To reduce the frequency, processor designs incorporate various attempts to predict the branch direction for frequently executed branches. Branch prediction depends on the same sort of issues as caching - essentially, it is an attempt to predict behavior based on recent history.
However, I don't think the branches involved in testing for hash codes unequal but both non-zero are going to be particularly predictable, so there is a good chance that at least one of them will be mispredicted.
Again, this is something that can only really be evaluated by experiment, but the stakes are far higher than adding a few integer instructions.
Patricia
Thomas Fritsch - 27 Jul 2006 16:27 GMT > String does not cache hash codes. I have looked into Sun's "String.java" source (Java1.4.2), and as I see it, String does cache its hash code:
<quote> [...] /** Cache the hash code for the string */ private int hash = 0; [...] public int hashCode() { int h = hash; if (h == 0) { int off = offset; char val[] = value; int len = count;
for (int i = 0; i < len; i++) { h = 31*h + val[off++]; } hash = h; } return h; } [...] </quote>
> HashMap, for example, does cache the > hash code for each key in the map, and only calculates the probe key > hash code once during a get. Agreed.
 Signature Thomas
cp - 26 Jul 2006 23:30 GMT The purpose of computing a hash code is that the hash should be quicker to calculate and compare than computing full object equality. eg. someObject.equals(otherObejct);
Chris Uppal - 27 Jul 2006 10:06 GMT > The purpose of computing a hash code is that the hash should be quicker to > calculate and compare than computing full object equality. eg. > someObject.equals(otherObejct); Not necessarily true, nor is it a requirement. For the hash to be performance-effective, it only needs to be fast enough that the time saved by computing it /once/ is greater than the time saved by not having to do /many/ equality comparisons.
-- chris
Lasse Reichstein Nielsen - 27 Jul 2006 22:41 GMT > What propose/benefit does Object.hashCode() serve? Why would I want to > use this? On top of the other benefits mentioned, it is also important that hashCode generates a numeric value from an object. This allows it to be used in numeric calculations like picking a bucket in HashMap.
You could make a "hash value" from objects that allowed quick comparisons and was always equal if the objects were equal, but that was not a number, but that would not sufficient for implementing a HashMap.
/L
 Signature Lasse Reichstein Nielsen - lrn@hotpop.com DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html> 'Faith without judgement merely degrades the spirit divine.'
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|