Java Forum / General / December 2006
hash two keys to one index
Mark - 15 Nov 2006 23:38 GMT Hi,
I need to figure out how to map two different keys (a string and an integer) to the same index. I don't even know where to start... but I don't want two copies of the the hash table.
Any ideas?
Eric Sosman - 16 Nov 2006 01:23 GMT > Hi, > [quoted text clipped - 3 lines] > > Any ideas? My only "idea" is that you should take a few steps back from the immediate tactical issue and explain your overall strategic goal. *Why* do you want "ABC" and new Integer(42) to produce the same hashCode(), and what good do you think it will do you? On first sight, at least, it seems daft. Pray explain the well-concealed sanity that lies beneath.
 Signature Eric Sosman esosman@acm-dot-org.invalid
Mark - 16 Nov 2006 01:57 GMT > > Hi, > > [quoted text clipped - 14 lines] > Eric Sosman > esosman@acm-dot-org.invalid Well, if I want to pull up all the data pertaining to a student, but I only know his last name or his student number, not both...
Patricia Shanahan - 16 Nov 2006 02:22 GMT >>> Hi, >>> [quoted text clipped - 16 lines] > Well, if I want to pull up all the data pertaining to a student, but I > only know his last name or his student number, not both... That is a very good reason for having a HashMap with names as keys, and a HashMap with student numbers as keys...
Note that every non-null non-primitive variable or expression in Java is a pointer to some object, not the value of an object. That means that the two HashMap instances can point to the same Student instances.
For the specific case of name and number, you could get away with a single table. It might be slightly slower than two tables, but probably only slightly. However, mixed key tables cannot take advantage of generics, and you must be very sure that you never need e.g. to index by two different numbers, where there could be a risk of hitting on the wrong key.
Generally, two tables is much cleaner.
Patricia
hongbo9t@yahoo.com.cn - 16 Nov 2006 02:47 GMT Mark - 18 Nov 2006 09:28 GMT > >>> Hi, > >>> [quoted text clipped - 34 lines] > > Patricia Well, two tables is okay I guess as long as there aren't actually two instances of the data. So basically... I have two arrays of references...
Eric Sosman - 18 Nov 2006 14:17 GMT >>>>>Hi, >>>>> [quoted text clipped - 33 lines] > instances of the data. > So basically... I have two arrays of references... Almost: you have two Maps that associate keys with references. But you seem to have grasped the main point, which is that the references point to the objects but are not the object instances. What goes into the map are pairs of (reference to key, reference to target), not the keys and the targets themselves.
Faulty analogy: You have your best friends' phone numbers in your Palm Pilot, and maybe written down in a few places, and for that matter you could always look 'em up in the phone book. But your friends' actual, physical phones are in none of these places; the things you store and retrieve are "references" to those phones. Through the magic of the phone network you can use a phone number to make a friend's phone ring, just as through the magic of the JVM you can use a reference to make an object instance do something. But the friend's phone, like the object instance, is not "in your possession;" all you're holding is the reference.
 Signature Eric Sosman esosman@acm-dot-org.invalid
Mark - 21 Nov 2006 21:39 GMT > >>>>>Hi, > >>>>> [quoted text clipped - 54 lines] > Eric Sosman > esosman@acm-dot-org.invalid I've run into a bit of a snag.
When I insert an object into the hash table, I pass in (a reference to) the object, and a hash code, which can either be based on the last name or the student number. To deal with collisions, I use a quadratic probing technique...
void insert(Object obj, int hash) throws HashTableFull { if( count == table.length ) throw new HashTableFull("Cannot insert: hash table is full"); int key = probe(hash); table[key] = obj; ++count; }
int sign(int i) { if( i < 0 ) return -1; return 1; }
int probe(int hash) { int probe = 0;
while(true) { int key = (hash + probe*probe*sign(probe)) % table.length;
if( table[key] == null ) return key;
if( probe <= 0 ) probe = -probe+1; else probe = -probe; } }
Which, in theory, should work nicely. However... when retrieving the object, how do I know when I've found the right one?
Object find(int hash) throws ObjectNotFound { int probe = 0;
for(int i=0; i<table.length; i++) { int key = (hash + probe*probe*sign(probe)) % table.length;
if( /* table[key] == ??? */ ) return table[key];
if( probe <= 0 ) probe = -probe+1; else probe = -probe; } throw new ObjectNotFound("Object not found"); }
Mark - 21 Nov 2006 21:57 GMT > > >>>>>Hi, > > >>>>> [quoted text clipped - 117 lines] > throw new ObjectNotFound("Object not found"); > } Rereading what you wrote, you indicate that a reference to the key should also be stored. I didn't really understand why a reference to the key should be stored, if the key could easily be derived from hashing the object. However... I suppose it could be useful when trying to find an object...
For example, if one object has a hash "5" and another object has hash "15", and the hash table capacity is 10... then 5%10 and 15%10 both equal 5..so they have the same key, but different hashes.. then we can check if we've found the right object like so...
Object find(int hash) throws ObjectNotFound { int probe = 0;
for(int i=0; i<m_table.length; i++) { int key = (hash + probe*probe*sign(probe)) % m_table.length;
if( m_hash[key] == hash ) return m_table[key];
if( probe <= 0 ) probe = -probe+1; else probe = -probe; }
throw new ObjectNotFound("Object not found"); }
using the modified insert...
void insert(Object obj, int hash) throws HashTableFull { if( count == m_table.length ) throw new HashTableFull("Cannot insert: hash m_table is full"); int key = probe(hash); m_table[key] = obj; m_hash[key] = hash; ++count; }
am I correct?
Eric Sosman - 21 Nov 2006 22:55 GMT Mark wrote On 11/21/06 16:57,:
>>I've run into a bit of a snag. >> >>When I insert an object into the hash table, I pass in (a reference to) >>the object, and a hash code, which can either be based on the last name >>or the student number. To deal with collisions, I use a quadratic >>probing technique... If you're trying to learn how to implement a hash table, fine: go ahead and fiddle around to your curiosity's limits. But if you're just trying to use a hash table in the solution of some other problem, I recommend java.util.HashMap to your attention. The wheel has already been invented, and is not too badly out of round ...
>> void insert(Object obj, int hash) throws HashTableFull >> { [quoted text clipped - 30 lines] >> >>Which, in theory, should work nicely. I haven't studied it closely, but at a quick glance it doesn't look noticeably better than linear probing. Every hash value that starts by probing a particular bucket [k] then follows the same path: [k+1], [k-1], [k+4], [k-4], ... Also, I haven't convinced myself that this probe sequence will eventually visit every table location instead of just cycling back on itself.
Also, the % operator doesn't do quite what you want. For example, -7 % 4 yields -3, not 1.
>>However... when retrieving the >>object, how do I know when I've found the right one? [quoted text clipped - 5 lines] > hashing the object. However... I suppose it could be useful when > trying to find an object... The "key" is the actual Student name or number, not the hash code derived from it nor the location of some bucket in the table. You hash the key, do some numerical hocus-pocus, and derive a bucket number. In that bucket, if it's not empty, you find a (key,value) pair. You compare the bucket's key to the search key to find out whether this is the desired pair or whether you need to probe further.
Note: Since you stop searching as soon as you've found a bucket whose key equals the search key, it follows that the keys (not necessarily their hash codes) must be unique. If you've got two different Students both named "John Smith" there'll be trouble. You could, if you wanted, have a Map that associated a name with a collection of all the Students sharing that same name; many collections would contain just one Student apiece, but some might contain several.
Recommended reading: D.E. Knuth, "The Art of Computer Programming, Volume III: Sorting and Searching." Some people are intimidated by Knuth's rigor (I confess that some sections are beyond my own mathematical skills), but he always ends the deeper derivations with a straightforward statement of the conclusions -- very helpful for those of us unlikely to win a Fields Medal any time soon.
 Signature Eric.Sosman@sun.com
Mark - 24 Nov 2006 05:53 GMT Ahh.. yes. I was getting my terminology mixed up. Ive been using "hash" to mean the value that is returned by the hash function, and "key" to mean the array index (after applying hocus pocus to the hash).
> You compare the bucket's key to > the search key to find out whether this is the desired pair or > whether you need to probe further. I was thinking about passing a reference to the object into the "find" method and then comparing the two objects... but of couse, this really doesn't make sense. If you already know what the object is, you don't need to search for it. However passing just the key makes much more sense.
> If you're trying to learn how to implement a hash table, > fine: go ahead and fiddle around to your curiosity's limits. > But if you're just trying to use a hash table in the solution > of some other problem, I recommend java.util.HashMap to your > attention. The wheel has already been invented, and is not > too badly out of round ... It's a learning process :) And evidentally a needed one too.
> I haven't studied it closely, but at a quick glance it > doesn't look noticeably better than linear probing. Every [quoted text clipped - 3 lines] > will eventually visit every table location instead of just > cycling back on itself. Quadratic probing is actually considerably better. Yes, each inserted object will follow the same path, but you won't get these large clusters around the bucket, which means other objects that are inserted nearby won't need to probe as far. However, a random or rehash probing technique would be much better.
> Also, the % operator doesn't do quite what you want. > For example, -7 % 4 yields -3, not 1. Heheh... I realized this when I started getting "ArrayIndexOutOfBounds" exceptions. Easily fixed.
> Recommended reading: D.E. Knuth, "The Art of Computer > Programming, Volume III: Sorting and Searching." Some people [quoted text clipped - 3 lines] > conclusions -- very helpful for those of us unlikely to win > a Fields Medal any time soon. I'm not really a textbook sort of guy (more hands on), but I'll keep this is mind.
Thank you so much for your help.
Eric Sosman - 24 Nov 2006 13:52 GMT > Ahh.. yes. I was getting my terminology mixed up. Ive been using > "hash" to mean the value that is returned by the hash function, and [quoted text clipped - 8 lines] > doesn't make sense. If you already know what the object is, you don't > need to search for it. [...] No, but you might want to ask a different kind of question, like "Is this Student enrolled in Basket Weaving 101?" Given a Student and a collection of all the enrollees in BW101, the problem isn't so much to find the Student as to determine whether he is or isn't in the collection. That's why Java has HashSet (and other Set flavors, too).
> [in re Knuth:] > I'm not really a textbook sort of guy (more hands on), but I'll keep > this is mind. As in many other endeavors, hands-on will suffice for a while and take you for a certain distance. But eventually you will find that progress depends on being able to gain from the experience of others without expending all the time they did while acquiring it. (Unless you happen to be immortal with a lot of time on your hands, or such a blazing genius that you can re-invent an entire field of knowledge overnight and without aid.) To paraphrase a fairly smart person, it is foolish to turn down a giant's offer to allow you to stand on his shoulder.
 Signature Eric Sosman esosman@acm-dot-org.invalid
Chris Uppal - 24 Nov 2006 14:37 GMT > > [in re Knuth:] > > I'm not really a textbook sort of guy (more hands on), but I'll keep [quoted text clipped - 4 lines] > that progress depends on being able to gain from the experience of > others without expending all the time they did while acquiring it. [...]
> To paraphrase a fairly smart > person, it is foolish to turn down a giant's offer to allow you to > stand on his shoulder. Though that shoulder may itself to be too high and dry to be reachable without the aid of a lesser giant.
Or, to put it less indirectly: maybe a less daunting text would be a better place to start (assuming the desire to start at all). If so, then I'd recommend Sedgewick's one-volume "Algorithms in <Language>" book(s). (That's the one-volume versions, not the significantly expanded, and not yet complete, n-volume versions.)
-- chris
Mark - 25 Nov 2006 05:03 GMT > > Ahh.. yes. I was getting my terminology mixed up. Ive been using > > "hash" to mean the value that is returned by the hash function, and [quoted text clipped - 33 lines] > Eric Sosman > esosman@acm-dot-org.invalid Well... I *had* a textbook (~$130), but then I decided it wasn't worth my money since almost all the information I need is on the internet, so I returned it. Nevertheless, I've learnt about "chaining"... never would have thought of it myself. Much easier to implement AND much more efficient than probing. I'm using the binary search tree class I used earlier for the chains.
This is going to be one crazy abstract data type... it can retrieve any student by last name or student number in O(1), or administrator via a password in O(1)...and print all students in ascending order by last name or student number in O(n)... however, at the minor cost of worsening insertion time to O(log n). Heh.. I'm using two hash tables, and two binary search trees... kind of space inefficient, yes.
Mark - 25 Nov 2006 05:18 GMT > Heh.. I'm using two hash tables, > and two binary search trees... kind of space inefficient, yes. Scratch that. A doubly linked list instead of two hash tables. Just realized I couldn't insert the same object in both trees and have them sorted differently.
Chris Uppal - 25 Nov 2006 13:44 GMT > Nevertheless, I've learnt about "chaining"... never > would have thought of it myself. Much easier to implement AND much > more efficient than probing. When considering the efficiency of data-structures in the modern world, it's always worth thinking a bit about cache effects. Both open chaining, and any form of double hashing, have poor effects on locality-of-reference, with memory accesses jumping all over the shop rather than sequential.
May not make much difference in Java (which tends to be cache-unfriendly anyway for objects), but worth thinking about. Especially if you end up considering on-disk structures (I realise that isn't relevant to your immediate task).
-- chris
Mark Thornton - 25 Nov 2006 15:31 GMT > May not make much difference in Java (which tends to be cache-unfriendly anyway > for objects), Newly allocated objects are usually adjacent in memory regardless of their type. Depending on the garbage collector, objects with links to each other may well be copied to nearby locations in the main heap. Many C/C++ allocators might often place objects allocated in succession a considerable distance apart if the heap is badly fragmented.
Mark Thornton
Chris Uppal - 27 Nov 2006 13:47 GMT > Newly allocated objects are usually adjacent in memory regardless of > their type. Depending on the garbage collector, objects with links to > each other may well be copied to nearby locations in the main heap. [I'm mostly repeating myself, but just to bring some closure to this]
Sure, but hash-tables don't tend to place sequentially-allocated objects in sequential slots. The GC will probably group the chain-links together, but probably not the objects refered to by the links. In the end open chaining adds another layer of indirection, double-hashing adds an extra "bounce" to the memory access patterns. It's up to the developer to choose (with or without measurement) a design which appeals; my point is only that considerations of the /number/ of probes doesn't exhaust the influences on performance.
-- chris
Mark Thornton - 27 Nov 2006 19:28 GMT >>Newly allocated objects are usually adjacent in memory regardless of >>their type. Depending on the garbage collector, objects with links to [quoted text clipped - 5 lines] > sequential slots. The GC will probably group the chain-links together, but > probably not the objects refered to by the links. Allegedly the train collector is likely to do exactly that depending on what other links exist and whether the objects are of similar 'age' to the hash table.
> measurement) a design which appeals; my point is only that considerations of > the /number/ of probes doesn't exhaust the influences on performance. Indeed not. Which is why it is good to have many different implementations of Map.
Mark Thornton
Mark - 05 Dec 2006 03:27 GMT > > Nevertheless, I've learnt about "chaining"... never > > would have thought of it myself. Much easier to implement AND much [quoted text clipped - 10 lines] > > -- chris What about with LRU associative mapping? Then it doesn't really matter where it's stored in memory -- although you might only have one useful value in each cache block rather than the multiple you might get by not using a sparse array.
tam@lheapop.gsfc.nasa.gov - 20 Nov 2006 16:08 GMT > >>> Hi, > >>> [quoted text clipped - 34 lines] > > Patricia An alternative implementation is two hashes where one hash gives the student number given the name and the other gives the student record given the student number. I suspect this would mimic the underlying database more closely and it seems a bit cleaner to me. There could still be a convenience method that does studentRecord.get(studentNumber.get(studentName)) so from the user perspective it's just as convenient. [Might be slower but I doubt that will be an issue.] When we create a new student record, we only update a single hash, and when we update a name (e.g., after marriage) we only update items relating to the name, not the the student record hash. Regards, Tom McGlynn
Mark - 21 Nov 2006 19:53 GMT > > >>> Hi, > > >>> [quoted text clipped - 50 lines] > Regards, > Tom McGlynn In that case, the student number is dependent on the student's name, no? Unfortunately, that doesn't work for me.
tam@lheapop.gsfc.nasa.gov - 22 Nov 2006 18:10 GMT ...>
> > An alternative implementation is two hashes where > > one hash gives the student number given the name and the other [quoted text clipped - 15 lines] > no? > Unfortunately, that doesn't work for me. Not sure quite what you mean by 'the number is dependent on the name'. You would need to be able to map to a student's number given a student name but the name does not determine the ID -- they are associated by a table. If you have students who have a name but no number then this approach won't work but if every student has a number if should be fine.
In practice one normally uses a student number because student names are not reliable indices for the database. It's common for multiple students to have the same name and there can be many variations on how the name is coded. E.g., I've seen just my last name written as McGlynn, Mc Glynn, MCGLYNN or Mcglynn. Numbers (or some equivalent coding) insure uniqueness and standardize how the a coding for the record indices.
The database records are indexed by the number/code and when users want to query by name there is a translation from the name to the number and then a search for the associated record.
This allows graceful handling of the case when two students have the same name. If multiple numbers match a name, the user is asked for additional information to resolve the ambiguity.
This code is cleanly separated from the code that uses the student records. There is a single method for getting the student's records once we have resolved any ambiguities about which student is involved.
Of course, this may not fit with your database or your other requirements but it's a bit of what database programmers call database normalization poking into a Java implementation.
Regards, Tom McGlynn
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|