Java Forum / General / March 2007
String based hashCode
jlukar@gmail.com - 08 Mar 2007 00:31 GMT The javadoc says the formula to calcuate the hashCode for String is:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
this can result in negative numbers which is not desirable for me.
>From the formula this is not clear why as always positive number should be generated. Any idea folks ?
Second question:
I am counting on hashCode to help me with generating a unique Integer key using two concatenated strings the combination of which is gauranteed to be unique. .
e.g. ("LASTNAME"+"."+"EMPLOYEE-CATEGORY") will always be unique.
so
("LASTNAME"+"."+"EMPLOYEE-CATEGORY") .hashCode() is what I wanted to use.
will above hashCode() always be unique given that the string combination used to generate it is unique ?
help??? k.
Patricia Shanahan - 08 Mar 2007 01:06 GMT > The javadoc says the formula to calcuate the hashCode for String is: > [quoted text clipped - 4 lines] > should be generated. > Any idea folks ? Java int arithmetic is 2's complement signed, and can wrap around from positive to negative. For example, Integer.MAX_VALUE+1 is negative.
> Second question: > > I am counting on hashCode to help me with generating a unique Integer > key using two concatenated strings the combination of which is > gauranteed to be unique. . ...
String hash codes are not unique. If a type has more than 2**32 different values, its hash code cannot possibly be unique.
A hash that is unique for all values in its domain is called a "perfect hash". There are algorithms that will generate a perfect hash for a small fixed set of strings. Would that help? If not, you need to resort to the conventional solution - use hashing to reduce the number of potential collisions, but then search for the exact one you want.
Presumably, you are doing this to solve a higher level problem. If you were to describe that problem, someone may suggest a better, more Java-ish, solution.
Patricia
Lew - 08 Mar 2007 02:38 GMT jlukar@gmail.com wrote:
>> The javadoc says the formula to calcuate the hashCode for String is: >> >> s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] (N.b., the symbol ^ in this context refers to exponentiation, not XOR.)
>> this can result in negative numbers which is not desirable for me. Why are negative numbers undesirable?
Incidentally, this hash code is not arbitrary. It is difficult to make a well-designed hash function and this one happens to suit well for 32-bit twos-complement arithmetic.
See Knuth for more on this.
<http://www-cs-faculty.stanford.edu/~knuth/> <http://en.wikipedia.org/wiki/Donald_Knuth> <http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming>
I believe it's /Volume 2 - Seminumerical Algorithms/ that covers this topic.
Beware: changing hashCode() for a class mandates that you also override equals().
-- Lew
Chris Smith - 08 Mar 2007 05:09 GMT > Beware: changing hashCode() for a class mandates that you also override equals(). The necessity is the other way around. You may choose any hashCode implementation you like, so long as it is consistent with equals. Any hash code implementation that is repeatable is consistent with Object's implementation of equals, so hashCode can literally do practically anything.
It's when you override equals that maintaining that consistency (namely, any two objects that are equal also have the same hash code) requires some effort.
 Signature Chris Smith
jlukar@gmail.com - 08 Mar 2007 22:47 GMT > jlu...@gmail.com wrote: > >> The javadoc says the formula to calcuate the hashCode for String is: [quoted text clipped - 6 lines] > > Why are negative numbers undesirable? Hi. I guess to think about it within the context of how I am using it a negative number is ok. I am using the generated number as a key into the HashMap for the object. But the comments from all implies that this is futile because the ("String1"+"String2").hashCode() will not guarntee a unique key anyways.
> Incidentally, this hash code is not arbitrary. It is difficult to make a > well-designed hash function and this one happens to suit well for 32-bit [quoted text clipped - 5 lines] > <http://en.wikipedia.org/wiki/Donald_Knuth> > <http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming> thank you for the links. I was hoping for a JAVA API for this without the use of a algorithm.
> I believe it's /Volume 2 - Seminumerical Algorithms/ that covers this topic. > > Beware: changing hashCode() for a class mandates that you also override equals(). > > -- Lew jlukar@gmail.com - 08 Mar 2007 22:40 GMT > jlu...@gmail.com wrote: > > The javadoc says the formula to calcuate the hashCode for String is: [quoted text clipped - 8 lines] > Java int arithmetic is 2's complement signed, and can wrap around from > positive to negative. For example, Integer.MAX_VALUE+1 is negative. thank you for that. it makes sense now.
> > Second question: > [quoted text clipped - 10 lines] > hash". There are algorithms that will generate a perfect hash for a > small fixed set of strings. Would that help? If not, you need to resort If by "small fixed set" you mean something like only "A","F","d" will appear then that wont work.
I mean that the lengght of string is fixed an no more than 11 chars overall. You see I know that LASTNAME can not be more than 5 characters. I also know that EMPLOYEE-CATEGORY can not be bigger than 5 characters. So the combination is LASTNAME+.+EMPLOYEE-CATEGORY = 11 characters total.
I used LASTNAME for clarity sake. The field is actually something else and can't be more than 5 chars. Same with EMPLOYEE-CATEGORY.
> to the conventional solution - use hashing to reduce the number of > potential collisions, but then search for the exact one you want. > > Presumably, you are doing this to solve a higher level problem. If you > were to describe that problem, someone may suggest a better, more > Java-ish, solution. The problem is that I want to generate uniqueue integer keys for the combonation of "String1"+.+"String2" so that I can use it as a key in a HashMap type of cache.
I wanted to do it this way so that I can alway derive the key from "String1" and "String2" that exist in my domain object and therefore do a cache lookup based on a key derived from "String1" and "String2".
thanks much for your comments already.
> Patricia Patricia Shanahan - 08 Mar 2007 22:47 GMT ...
> The problem is that I want to generate uniqueue integer keys for the > combonation of "String1"+.+"String2" so that I can use it as a key in > a HashMap type of cache. Why not use a HashMap with the concatenated string as key? HashMap knows all about potential collisions, and deals with them correctly.
Patricia
jlukar@gmail.com - 08 Mar 2007 22:56 GMT > jlu...@gmail.com wrote: > [quoted text clipped - 8 lines] > > Patricia You mean instead of using the int I simply use the "String1"."String2" for the key ?
The only reason I was trying to get a unique integer out of it was because the domain Object I was using (lets say Employee) has a employee ID as a key. Some employees from one building (building A) all have employee ID's. Employees from another building (building B) do not have employee ID's yet the legacy system I am working on uses employee ID as the key into a HashMap.
I was trying to use the same interface to generate integer unique ID's for the employess from building B.
thanks
Lew - 09 Mar 2007 02:29 GMT Patricia Shanahan wrote:
>> Why not use a HashMap with the concatenated string as key? HashMap knows >> all about potential collisions, and deals with them correctly.
> You mean instead of using the int I simply use the > "String1"."String2" for the key ? Yes.
-- Lew
Joshua Cranmer - 08 Mar 2007 23:06 GMT >> jlu...@gmail.com wrote: >>> The javadoc says the formula to calcuate the hashCode for String is: [quoted text clipped - 29 lines] > 5 characters. So the combination is LASTNAME+.+EMPLOYEE-CATEGORY > = 11 characters total. Can't be unique. Each character has 2^16 possible values and, not including the period, there are 10 values, so there are 2^16^10 = 2^160 ~ 10^48 different possible combinations. Of course, I assume that you are using only ~128 = 2^7 values of the characters (7-bit ASCII), but even so, a hash would produce 2^7^10 = 2^70 ~ 10^21 combinations.
> I used LASTNAME for clarity sake. The field is actually something else > and can't be more than 5 chars. Same with EMPLOYEE-CATEGORY. [quoted text clipped - 13 lines] > "String1" and "String2" that exist in my domain object and therefore > do a cache lookup based on a key derived from "String1" and "String2". Most HashMap types of cache implement bucketed algorithms or such.
> thanks much for your comments already. > >> Patricia Eric Sosman - 09 Mar 2007 03:37 GMT > [concerning non-uniqueness of hash codes] > I mean that the lengght of string is fixed an no more than 11 chars > overall. You see I know that LASTNAME can not be more than 5 > characters. I also know that EMPLOYEE-CATEGORY can not be bigger than > 5 characters. So the combination is LASTNAME+.+EMPLOYEE-CATEGORY > = 11 characters total. Let's do some arithmetic.
There is exactly one String of length zero.
There are 65536 different Strings of length one.
There are 65536^2 = 4,294,967,296 Strings of length two (using ^ to indicate exponentiation).
...
There are 65536^11 Strings of length eleven.
All told, there are (65536^12 - 1) / 65535 Strings of length eleven or less. That's just a little under 1e53 (the 11-character Strings make up the vast majority).
Meanwhile, there are 2^32 = 4,294,967,296 distinct hashCode() values, because that's how many different `int' values exist.
4,294,967,296 may seem like a lot of hashCode() values, but next to 95,782,432,828,056,470,036,404,813,049,000,000,000,000,- 000,000,000,000+ it is as nothing. If you try to marry the Strings and the hashCode()s, you will commit bigamy on a truly grand scale.
And that is why your hope for a universal and unique hash code for Strings -- even for short Strings -- is in vain.
 Signature Eric Sosman esosman@acm-dot-org.invalid
jlukar@gmail.com - 09 Mar 2007 19:30 GMT that was funny. I also now know what bigamy is which I had to lookup.
Thanks very much for this detail explanation. It is very clear why my string hashCode can not possibly result in a unique key fitting into an integer.
I have since settled for using the concatentated string object as a secondary key, still utilizing my original domain business object interface but just filling bogus int. for the records primary key.
thanks. k.
> jlu...@gmail.com wrote: > > [concerning non-uniqueness of hash codes] [quoted text clipped - 36 lines] > Eric Sosman > esos...@acm-dot-org.invalid Eric Sosman - 10 Mar 2007 02:31 GMT > that was funny. I also now know what bigamy is which I had to lookup. Obligatory warning:
"Kids: Don't try this at home!"
 Signature Eric Sosman esosman@acm-dot-org.invalid
Mark Rafn - 08 Mar 2007 18:31 GMT >The javadoc says the formula to calcuate the hashCode for String is: > s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] >this can result in negative numbers which is not desirable for me. Don't use hashCode for something it's not suitable for. If negative numbers are undesirable, you're using it wrong, and probably want a different method.
>>From the formula this is not clear why as always positive number >>should be generated. Think overflow.
>Second question: >I am counting on hashCode to help me with generating a unique Integer >key using two concatenated strings the combination of which is >gauranteed to be unique. . Then you're using it wrong again. hashCode() has no guarantee of uniqueness, and for many data types CANNOT guarantee uniqueness.
>("LASTNAME"+"."+"EMPLOYEE-CATEGORY") will always be unique. So equals() should be based on that. great.
>("LASTNAME"+"."+"EMPLOYEE-CATEGORY") .hashCode() is what I wanted to >use. Not unique. There are only 2^32 possible hashCode() return values, and there are FAR more possible strings than that, so duplicate hash for different values are alwas an issue. This is called "hash collision".
>will above hashCode() always be unique given that the string >combination used to generate it is unique ? No. Not even close. There's no possible hashing function that will guarantee uniqueness. ALWAYS account for collisions.
For some purposes, you can get away with using a bigger hash and deciding that you'll live with the astronomically low chance of collision. You could use a 128-bit hash (say, MD5), which makes accidental collisions very very unlikely, but you can't call it hashCode(), because it's not an int. -- Mark Rafn dagon@dagon.net <http://www.dagon.net/>
Patricia Shanahan - 08 Mar 2007 21:06 GMT ...
> No. Not even close. There's no possible hashing function that will > guarantee uniqueness. ALWAYS account for collisions. I think you are overstating the situation. Given a limited set of possible key strings, it is possible to construct a perfect hash that will never map two strings in the set to the same code.
Of course, the Java String hash is not, and cannot be, a perfect hash because there are more String values than there are hash codes. On the other hand, the hashCode function for Integer is perfect.
Patricia
Mark Rafn - 08 Mar 2007 21:50 GMT >... >> No. Not even close. There's no possible hashing function that will >> guarantee uniqueness. ALWAYS account for collisions.
>I think you are overstating the situation. Given a limited set of >possible key strings, it is possible to construct a perfect hash that >will never map two strings in the set to the same code. As long as there is an enumerated list of less than 2^32 of them, and you're willing to do the work, sure. But the OP had something called "lastname" in his example, of which there are an unlimited number of possibilities. -- Mark Rafn dagon@dagon.net <http://www.dagon.net/>
jlukar@gmail.com - 08 Mar 2007 23:00 GMT > >... > >> No. Not even close. There's no possible hashing function that will [quoted text clipped - 7 lines] > willing to do the work, sure. But the OP had something called "lastname" in > his example, of which there are an unlimited number of possibilities. Hi. Let me clarify. The Lastname is used to oversimplify my example. In actuality the "lastname" is a business domain field that represents a string of 5 characters. So it could be "ABCDE" or any combination of. That combined with "EMPLOYEE-CATEGORY" which is also 5 characters gives a combined number of 10 characters. So something like this "ABCDE"."XIEPE" is a posiblity. But any permutation of characters could be in those two fields.
thanks
> -- > Mark Rafn d...@dagon.net <http://www.dagon.net/> Joshua Cranmer - 08 Mar 2007 23:20 GMT >>> ... >>>> No. Not even close. There's no possible hashing function that will [quoted text clipped - 19 lines] >> -- >> Mark Rafn d...@dagon.net <http://www.dagon.net/> Just saw this. There are 5^5*26^5 (I assume) = (130)^5, 5 * lg (130) ~ 35.1, so there are too many to make into a unique int. That said, if you only had 10 characters with 5 possible entries each, a potential would be to treat it as a 10-digit base 5 number: ((a[0]*5+a[1])*5+a[2])*5+...
jlukar@gmail.com - 09 Mar 2007 00:31 GMT > jlu...@gmail.com wrote: > >>> ... [quoted text clipped - 26 lines] > would be to treat it as a 10-digit base 5 number: > ((a[0]*5+a[1])*5+a[2])*5+... Thats more than I'd like to chew on. I really needed a simple unique integer key derived from two strings concat of which is guaruanteed to be unique within my business application.
looks like it might not be possible in my particular situation using a simple Java API. I might have to write my own key generator derived from the string concat.
anything in apache commons library that might help me there if anyone knows about.
thanks much for all input.
Mark Rafn - 09 Mar 2007 01:12 GMT >Thats more than I'd like to chew on. I really needed a simple unique >integer key derived from two strings concat of which is guaruanteed to >be unique within my business application. One possible way is to use a database. Store the string with a sequence-defined identifier, and use that identifier as your ID, guaranteed unique. Or just use the string as the key.
You haven't really explained why you need it to be unique, though. Java's HashMap implementation (and every other implementation of hash I've seen) handles collisions just fine, and you can too. -- Mark Rafn dagon@dagon.net <http://www.dagon.net/>
Lew - 09 Mar 2007 02:33 GMT jlukar@gmail.com <jlukar@gmail.com> wrote:
>> Thats more than I'd like to chew on. I really needed a simple unique >> integer key derived from two strings concat of which is guaruanteed to [quoted text clipped - 7 lines] > HashMap implementation (and every other implementation of hash I've seen) > handles collisions just fine, and you can too. Really he should just use the concatenated String as the key and forget all about using an int.
-- Lew
jlukar@gmail.com - 09 Mar 2007 19:25 GMT > jlu...@gmail.com <jlu...@gmail.com> wrote: > >Thats more than I'd like to chew on. I really needed a simple unique [quoted text clipped - 4 lines] > sequence-defined identifier, and use that identifier as your ID, guaranteed > unique. Or just use the string as the key. can't do DB. It is a real-time high throughput system which means no DB in between a transaction.
> You haven't really explained why you need it to be unique, though. Java's > HashMap implementation (and every other implementation of hash I've seen) > handles collisions just fine, and you can too. I want to be able to retrieve a HashMap value using a key. So I don't understand your comment about HashMap handling collisions fine. (which couple of other posters referred to as well).
e.g.
if I have "xyz123" and "abc345" both happen to get the same key value of 124444 and I search the HashMap for key 124444, I will get two objects returned. That would not be desirable.
Do you mean I should then do another check and see that I should pick "xyz123" or "abc345" ??
thanks much.
> -- > Mark Rafn d...@dagon.net <http://www.dagon.net/> Patricia Shanahan - 09 Mar 2007 20:10 GMT ...
>>You haven't really explained why you need it to be unique, though. Java's >>HashMap implementation (and every other implementation of hash I've seen) [quoted text clipped - 12 lines] > Do you mean I should then do another check and see that I should pick > "xyz123" or "abc345" ?? No, we mean that the HashMap keys should be the strings "xyz123" and "abc345". It is possible, but unlikely, that they have the same hash code. HashMap knows that, and deals with it.
If you do:
myMap.put("xyz123",emp1); myMap.put("abc345",emp2);
where emp1 and emp2 are different Employee references, then myMap.get("xyz123") will definitely return emp1, not emp2, regardless of hashCode collisions.
Patricia
djthomp - 09 Mar 2007 20:25 GMT On Mar 9, 1:25 pm, "jlu...@gmail.com" <jlu...@gmail.com> wrote:
> I want to be able to retrieve a HashMap value using a key. So I don't > understand your comment about HashMap handling collisions fine. (which [quoted text clipped - 10 lines] > > thanks much. A HashMap is a reliable mapping between a set of keys and a collection of values, and any object in java can be used as a key or stored as a value.
If you do myMap.put("xyz123", objectA) and myMap.put("abc345", objectB), then myMap.get("xyz123") will always return objectA and myMap.get("abc345") will always return objectB (provided that you don't overwrite either entry by using the same key with a different object). It doesn't matter what the hashCode result is for either string, HashMap knows how to keep your objects distinct even if their keys have the same hashCode as long as the keys themselves are not equal.
If all of the objects you are storing are represented uniquely by the 11 character string you have described, then you should be able to use that 11 character string as the direct key into the HashMap without worrying about the details of the implementation.
Patricia Shanahan - 09 Mar 2007 20:56 GMT ...
> A HashMap is a reliable mapping between a set of keys and a collection > of values, and any object in java can be used as a key or stored as a > value. The only limitation on this is that the key object's classes MUST have equals and hashCode implemented in accordance with the contract described in the Object API documentation.
Of course, String does have correct equals and hashCode implementations.
Patricia
Tom Hawtin - 09 Mar 2007 22:30 GMT > ... >> A HashMap is a reliable mapping between a set of keys and a collection [quoted text clipped - 4 lines] > equals and hashCode implemented in accordance with the contract > described in the Object API documentation. Point of pedantry: And it mustn't do anything silly with recursion. For instance, using a HashMap as a key for a HashMap it uses as a key (if you follow). It's also possible to break deserialisation dependent upon the order (and nesting) of the object.
> Of course, String does have correct equals and hashCode implementations. More pedantry: Way back in JLS1 String was defined to have an unimplementable hashCode.
Tom Hawtin
Chris Uppal - 10 Mar 2007 16:46 GMT > More pedantry: Way back in JLS1 String was defined to have an > unimplementable hashCode. While that is primarily of academic interest, I don't think that mentioning it is /pedantry/, as such, is it ?
;-)
-- chris
Mark Rafn - 09 Mar 2007 22:59 GMT >> One possible way is to use a database. Store the string with a >> sequence-defined identifier, and use that identifier as your ID, guaranteed >> unique. Or just use the string as the key.
>can't do DB. It is a real-time high throughput system which means no >DB in between a transaction. Using a sequence in a DB doesn't mean you need to query it every time. You can cache dozens or hundreds of them, use them as needed, and throw away the ones you don't use.
If you need a truly unique number (say, for a primary key), this is really your best bet.
>> You haven't really explained why you need it to be unique, though. Java's >> HashMap implementation (and every other implementation of hash I've seen) [quoted text clipped - 3 lines] >understand your comment about HashMap handling collisions fine. (which >couple of other posters referred to as well). If you're using a java.util.HashMap, why were you worried about generating integers, and why did you care if they were negative? It's keyed on Object, so your string ID is fine.
>if I have "xyz123" and "abc345" both happen to get the same key >value of 124444 and I search the HashMap for key 124444, I will get >two objects returned. That would not be desirable. Right, so search on your ACTUAL key - either a Key object you define, or a string concatenation with some separator that's not part of the strings. Let the Map implementation deal with how the hashing works.
>Do you mean I should then do another check and see that I should pick >"xyz123" or "abc345" ?? No. If "xyz123abc345" is your business key, use it as your Map key. There's no reason to mess with picking an integer for a key. -- Mark Rafn dagon@dagon.net <http://www.dagon.net/>
Daniel Pitts - 09 Mar 2007 01:12 GMT On Mar 8, 4:31 pm, "jlu...@gmail.com" <jlu...@gmail.com> wrote:
> > jlu...@gmail.com wrote: > > >>> ... [quoted text clipped - 39 lines] > > thanks much for all input. It is very unlikely that you will find a hash code that is unique. As a matter of fact, you may have misunderstood the use of hashes.
A hash code helps you decide which "bucket" to look in for the key that you want. If you have 5 buckets and 30 unique keys, you aren't going to have 30 unique hash codes, but 5. This lets you reduce your average look up by a factor of 5. You only have to search (on average) a bucket with 6 items in it.
So, to reiterate. A hash isn't the same as a key. The hash is a property of the key, and is defined so that if keyA is equal to keyB, then keyA's hash is equal to keyB's hash. The converse isn't necessarily true.
This lets you know that if hashA != hashB, then A != B. if hashA == hashB, A might = B.
Having said all that... If you restrict the range of values an object can take, then you *might* be able to construct a hash function that results in unique hashs for unique objects.
If your business field is exactly 5 uppercase letters, then that gives you a possibility of 26^5 combinations, which fits within a 24 bit number. You can get your hash value by treating the string of 5 uppercase letters as a base 26 number, A = 0, B = 1, ..., Z = 25. The string AZWAA would be the number 0 * 26^0 + 25 * 26^1 + 22 * 26^2 + 0 * 26^3 + 0 * 26^4 Notice that each term is a factor of a power of 26.
Hope this helps.
jlukar@gmail.com - 09 Mar 2007 19:19 GMT > On Mar 8, 4:31 pm, "jlu...@gmail.com" <jlu...@gmail.com> wrote: > > It is very unlikely that you will find a hash code that is unique. As > a matter of fact, you may have misunderstood the use of hashes. Yes I believe so. I thought it can be used in place of a key as it implied in my mind a unique key.
> This lets you know that if hashA != hashB, then A != B. if hashA == > hashB, A might = B. aha!
> If your business field is exactly 5 uppercase letters, then that gives > you a possibility of 26^5 combinations, which fits within a 24 bit [quoted text clipped - 3 lines] > + 0 * 26^3 + 0 * 26^4 > Notice that each term is a factor of a power of 26. business field will be a 11 chars combined. The calculation another poster did, makes this impossible for me.
> Hope this helps. very much so. thank you.
Ben Kaufman - 10 Mar 2007 23:02 GMT >The javadoc says the formula to calcuate the hashCode for String is: > [quoted text clipped - 24 lines] >help??? >k. Why do you believe that ("LASTNAME"+"."+"EMPLOYEE-CATEGORY") will always unique?
Ben
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|