Java Forum / General / December 2005
Hash table: linear probe vs. chaining
Ian Pilcher - 09 Dec 2005 18:43 GMT Background:
I'm working on a Java program, and I need to associate some additional data with a large number of objects. I do not have the freedom to modify or wrap the existing object classes. Additionally, I don't want to interfere with the garbage collection of the objects.
I've come to the conclusion that I need a "WeakIdentityHashMap," a data structure which combines the features of an IdentityHashMap and a WeakHashMap. After a couple of attempts to build such a thing on top of one of the existing classes, I don't think that enough of the internals are exposed, and I'm going to have to create my own hash table.
The question:
Sun's documentation states that they use a linear-probe hash table for the IdentityHashMap; the GNU Classpath does the same thing. As far as I can tell, the other hash table-based collections use chaining.
It seems to me that a linear-probe hash table is a bad idea if mappings are going to be removed from the table very often. When searching for a given key, the search must continue until the key itself is found or a slot which has NEVER been occupied is found. (Finding a slot which is CURRENTLY empty is not enough, since it may have been occupied when the key being sought was inserted into the map.)
Anyone see a flaw in my reasoning?
Thanks!
 Signature ======================================================================== Ian Pilcher i.pilcher@comcast.net ========================================================================
Ben Pfaff - 09 Dec 2005 19:40 GMT > It seems to me that a linear-probe hash table is a bad idea if mappings > are going to be removed from the table very often. When searching for a [quoted text clipped - 4 lines] > > Anyone see a flaw in my reasoning? You are overlooking the algorithm for removing an item from a linear-probing hash table. In Knuth this is Algorithm 6.4R (Deletion with linear probing).
 Signature Ben Pfaff email: blp@cs.stanford.edu web: http://benpfaff.org
Ian Pilcher - 09 Dec 2005 21:42 GMT > You are overlooking the algorithm for removing an item from a > linear-probing hash table. In Knuth this is Algorithm 6.4R > (Deletion with linear probing). Only because I don't know about it. (I'm not a professional programmer; what knowledge I have about hash tables comes from Google and the GNU Classpath source code.)
Can you point me to a description of this algorithm on the web?
Trying to think it through myself, the goal would presumably be to eliminate "holes" caused by removing a mapping that has previously caused an entry to "bounce" to a subsequent slot. How might one go about this...
When an entry is removed, search "forward" in the table for an entry that could occupy the newly freed slot. ("Forward" being the direction in which entries "bounce".) In this case, the search can end at any empty slot.
If an eligible entry is found, move it to the newly freed slot and recursively remove it from its original slot.
How does that sound?
 Signature ======================================================================== Ian Pilcher i.pilcher@comcast.net ========================================================================
Ben Pfaff - 10 Dec 2005 19:14 GMT >> You are overlooking the algorithm for removing an item from a >> linear-probing hash table. In Knuth this is Algorithm 6.4R [quoted text clipped - 5 lines] > > Can you point me to a description of this algorithm on the web? I couldn't easily find a description of it on the web. There is an implementation of it in a hash table of mine (in C), which can be viewed here: http://cvs.savannah.gnu.org/viewcvs/pspp/pspp/src/hash.c?rev=1.17&view=markup
 Signature Ben Pfaff email: blp@cs.stanford.edu web: http://benpfaff.org
Chuck F. - 12 Dec 2005 15:25 GMT >>> You are overlooking the algorithm for removing an item from a >>> linear-probing hash table. In Knuth this is Algorithm 6.4R [quoted text clipped - 10 lines] > be viewed here: > http://cvs.savannah.gnu.org/viewcvs/pspp/pspp/src/hash.c?rev=1.17&view=markup He can also look at the coding for my hashlib package, which allows deletions. See:
<http://cbfalconer.home.att.net/download/hashlib.zip>
 Signature Read about the Sony stealthware that is a security leak, phones home, and is generally illegal in most parts of the world. Also the apparent connivance of the various security software firms. http://www.schneier.com/blog/archives/2005/11/sonys_drm_rootk.html
Ben Pfaff - 12 Dec 2005 23:32 GMT >>>> You are overlooking the algorithm for removing an item from a >>>> linear-probing hash table. In Knuth this is Algorithm 6.4R [quoted text clipped - 14 lines] > > <http://cbfalconer.home.att.net/download/hashlib.zip> hashlib doesn't implement the algorithm in question.
 Signature "The fact is, technical people are better off not looking at patents. If you don't know what they cover and where they are, you won't be knowingly infringing on them. If somebody sues you, you change the algorithm or you just hire a hit-man to whack the stupid git." --Linus Torvalds
Chuck F. - 13 Dec 2005 05:16 GMT >>>>> You are overlooking the algorithm for removing an item from a >>>>> linear-probing hash table. In Knuth this is Algorithm 6.4R [quoted text clipped - 16 lines] > > hashlib doesn't implement the algorithm in question. True. However it does implement deletions in a linear probing table, which is not exactly especially common.
 Signature Read about the Sony stealthware that is a security leak, phones home, and is generally illegal in most parts of the world. Also the apparent connivance of the various security software firms. http://www.schneier.com/blog/archives/2005/11/sonys_drm_rootk.html
Chuck F. - 13 Dec 2005 05:16 GMT >>>>> You are overlooking the algorithm for removing an item from a >>>>> linear-probing hash table. In Knuth this is Algorithm 6.4R [quoted text clipped - 16 lines] > > hashlib doesn't implement the algorithm in question. True. However it does implement deletions in a linear probing table, which is not exactly especially common.
 Signature Read about the Sony stealthware that is a security leak, phones home, and is generally illegal in most parts of the world. Also the apparent connivance of the various security software firms. http://www.schneier.com/blog/archives/2005/11/sonys_drm_rootk.html
Charles Richmond - 13 Dec 2005 05:27 GMT > >>> You are overlooking the algorithm for removing an item from a > >>> linear-probing hash table. In Knuth this is Algorithm 6.4R [quoted text clipped - 15 lines] > > <http://cbfalconer.home.att.net/download/hashlib.zip> Welcome back, CBF!!!
I have been kicking around the idea of creating a hash table myself. I *need* to learn more about hashing, so I plan to study Knuth on the subject. Of course, I intend to study *your* hashlib also, CBF.
I do *not* need deletion, but I plan to use a re-hash function in case of a collision.
Charles Richmond - 13 Dec 2005 05:27 GMT > >>> You are overlooking the algorithm for removing an item from a > >>> linear-probing hash table. In Knuth this is Algorithm 6.4R [quoted text clipped - 15 lines] > > <http://cbfalconer.home.att.net/download/hashlib.zip> Welcome back, CBF!!!
I have been kicking around the idea of creating a hash table myself. I *need* to learn more about hashing, so I plan to study Knuth on the subject. Of course, I intend to study *your* hashlib also, CBF.
I do *not* need deletion, but I plan to use a re-hash function in case of a collision.
MSCHAEF.COM - 09 Dec 2005 20:37 GMT ...
>It seems to me that a linear-probe hash table is a bad idea if mappings >are going to be removed from the table very often. As Ben Pfaff pointed out, there are ways to solve this.
One advantage of linear probing is that it can have better performance on modern hardware. Linear scans through memory are significantly cheaper than linked list traversals, even if the linked list nodes are adjacent in memory. The linked list really has two disadvantages:
* It requires a both a pointer access and a comparison for each node checked. This means that morr data has to be loaded from memory to check a list node.
* In linear probing, adjacent nodes are guaranteed to be adjacent to each other in the address space. This makes it more likely that the memory subsystem can predict the access pattern and prefetch data before you need it.
I've done some simple tests on my Centrino laptop: A linked list scan of adjacent nodes was half the performance [1] of a linear array scan. Randomly ordering the list elements in memory could drop performance to (not by) 4-5% of the array scan.
-Mike
1] This was a simple microbenchmark involving adding lists of 32-bit integers. Pointers were also 32-bits, so a linked list having half the performance of an array is not too suprising.
 Signature http://www.mschaef.com
Ian Pilcher - 09 Dec 2005 21:51 GMT > One advantage of linear probing is that it can have better performance on > modern hardware. Linear scans through memory are significantly cheaper [quoted text clipped - 9 lines] > subsystem can predict the access pattern and prefetch data before > you need it. I think the fact that this is Java, may reduce the impact of these points, particularly since my keys are going to be Java WeakReference objects.
In the linear probing case, each key in the table will actually be a reference to a WeakReference (or a subclass thereof), which can be used to obtain a reference to the actual key. In the chaining case, I can use a subclass of WeakReference as my list node, so the number of objects will be the same, as will the number of references that need to be traversed per comparison.
If Java would allow me to create an actual array of objects, I think your logic would hold, but since objects are *always* accessed through references in Java, I think that the chained table should perform just as well.
What do you think?
Thanks for the feedback, BTW!
 Signature ======================================================================== Ian Pilcher i.pilcher@comcast.net ========================================================================
Thomas Hawtin - 09 Dec 2005 23:53 GMT > I think the fact that this is Java, may reduce the impact of these > points, particularly since my keys are going to be Java WeakReference > objects. As your key? Conventionally entries subclass WeakReferences to point to your key, with a conventional strong reference to the value. Although I guess a custom implementation could collapse the key into the entry.
I don't understand why Sun have not added a WeakIdentityHashMap. WeakHashMap only works if you have complete control.
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4500542
> In the linear probing case, each key in the table will actually be a > reference to a WeakReference (or a subclass thereof), which can be used > to obtain a reference to the actual key. In the chaining case, I can > use a subclass of WeakReference as my list node, so the number of > objects will be the same, as will the number of references that need to > be traversed per comparison. The current custom weak identity hash table in ThreadLocal has a WeakReference for an entry, but does linear probing. I can't quite work out why. Neither that implementation nor my (leak-free) cyclic-list reimplementation are nice when it comes to expired WeakReferences.
> If Java would allow me to create an actual array of objects, I think > your logic would hold, but since objects are *always* accessed through > references in Java, I think that the chained table should perform just > as well. Creating (resettable) arrays of WeakReferences with good performance would, I think, be challenging.
It's perhaps surprising that HashMap is not implemented like IdentityHashMap. Even implemented with chains using offsets into the array would appear to be better.
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Ian Pilcher - 10 Dec 2005 01:59 GMT > As your key? Conventionally entries subclass WeakReferences to point to > your key, with a conventional strong reference to the value. Although I > guess a custom implementation could collapse the key into the entry. If I follow the linear probe path, I will likely just use an Object[] twice as large as the capacity of the hash table. The "keys," at even indices, would be references to WeakReference objects (or a simple sub- class); the "values," at odd indices, would be references to the value objects.
> I don't understand why Sun have not added a WeakIdentityHashMap. > WeakHashMap only works if you have complete control. > > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4500542 LMAO. The best part is that their own evaluation, from back in 2001 or so, agrees with your assessment.
> Creating (resettable) arrays of WeakReferences with good performance > would, I think, be challenging. Not sure I follow, but I don't think it matters. My point was that the "locality" argument for a linear probe table only holds water when the keys are actually *in* the array. When the array only holds references to the keys, examining all the keys that correspond to a particular hash code might actually be marginally faster in a chained hash table.
 Signature ======================================================================== Ian Pilcher i.pilcher@comcast.net ========================================================================
Arash Partow - 10 Dec 2005 13:55 GMT Hi Ian,
The fact of the matter is that memory is cheap these days so USE it!. Implement chaining, as a resolution to hash collisions, DO NOT use linked lists, instead use dynamic arrays (in the case of java use a vector as your bucket). Assess the data or types of data you will be hashing with various types of hash functions to find a balance between data, hash function and initial bucket sizes. A good hash function would have an average of 1-2 in chaining length with an upper 1 STD chaining length of about 5 (aka less that 0.1 % of the buckets should be of that chain length).
DO NOT use linear probing - as it is only a mental exercise for computational theorists and is in the over-whelming majority of cases totally and utterly useless.
You also mentioned weak identity? could this mean you are only after existence information and not the data that is represented by the key? If so, a bloom filter is a much faster and less processor and memory intensive way of achieving such functionality.
The following url contains information regarding these topics: http://www.partow.net/programming/hashfunctions/index.html
Arash Partow ________________________________________________________ Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net
Thomas Hawtin - 10 Dec 2005 20:54 GMT > The fact of the matter is that memory is cheap these days so USE it!. Memory is cheap, but it is not free. Cache is relatively expensive. The crush point these days seems to be memory bandwidth. Look at the Niagra. Each core has four hardware threads per core to smooth over memory latency (Niagra II is planned to have eight, apparently). The chip has four DDR2 channels. I assume the designers did not do that just because it was fun and inexpensive.
If you are witting a class for wide and frequent use, you should be thinking about making it fast and compact.
> Implement chaining, as a resolution to hash collisions, DO NOT use > linked lists, instead use dynamic arrays (in the case of java use [quoted text clipped - 4 lines] > length with an upper 1 STD chaining length of about 5 (aka less > that 0.1 % of the buckets should be of that chain length). Short chains are good. With a short length, there is little advantage of an array over a linked list. A Vector would require even more overhead (plus synchronisation, did you mean ArrayList?). When it comes to rehashing, a linked list requires no extra allocations. Vectors would take even more memory.
Fortunately the mean bucket length should be less than an entry (with 0.75 load factor threshold; actual load should be between 0.375 and 0.75 with resizing doubling table length).
> DO NOT use linear probing - as it is only a mental exercise for > computational theorists and is in the over-whelming majority > of cases totally and utterly useless. Write once. Improve performance everywhere.
> You also mentioned weak identity? Weak references to keys, with only object identity of importance, yes.
> could this mean you are only > after existence information and not the data that is represented > by the key? No. We are talking maps. For instance, I mentioned a thread-local implementation that has one-per-thread maps from thread-local object to value. If the thread-local object disappears, then its entries should be removed.
> If so, a bloom filter is a much faster and less > processor and memory intensive way of achieving such functionality. But isn't much cop where objects can disappear without notice, or where an exact mapping is required.
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Russell Shaw - 10 Dec 2005 22:39 GMT > Hi Ian, > > The fact of the matter is that memory is cheap these days so USE it!. > Implement chaining, as a resolution to hash collisions, DO NOT use > linked lists, instead use dynamic arrays Why? Realloc'ing an array to make a longer one often means a new malloc followed by a behind-the-scenes memmove() which is all awfully expensive compared to doing a linked list.
> (in the case of java use > a vector as your bucket). Assess the data or types of data you [quoted text clipped - 15 lines] > The following url contains information regarding these topics: > http://www.partow.net/programming/hashfunctions/index.html Rob Thorpe - 12 Dec 2005 12:20 GMT > > Hi Ian, > > [quoted text clipped - 5 lines] > followed by a behind-the-scenes memmove() which is all awfully expensive > compared to doing a linked list. It is, but it should not happen often. For example, lets say that the implementation initially allocates a 5 element array. It may at some point in the future expand it in the inefficient way you mention, but only if 5 keys have hit the same slot. If this has happened then the hash table is probably too small for the data and should be rehashed, in this case many methods become slow, especially linear probing.
The best argument against using variable length arrays is that the intial length in many languages is probably more like 255 elements, most of these will be wasted and would be much better used making the hash table bigger.
Rob Thorpe - 12 Dec 2005 13:49 GMT > Hi Ian, > [quoted text clipped - 7 lines] > length with an upper 1 STD chaining length of about 5 (aka less > that 0.1 % of the buckets should be of that chain length). Sounds reasonable.
> DO NOT use linear probing - as it is only a mental exercise for > computational theorists and is in the over-whelming majority > of cases totally and utterly useless. I'm not sure about that. The point about memory is that if you use Y bytes outside the hash table in arrays you have to know that it is more worthwhile than using those bytes to make a bigger hash table.
In the example above, lets say that a 5 element array is attached to each bucket. This may well work very well, but does it work better than a non-chained hash that is 5 times the size? (Obviously you could improve this comparison by only allocating an array when it's needed).
I expect the answer depends on the workload.
> You also mentioned weak identity? could this mean you are only > after existence information and not the data that is represented [quoted text clipped - 3 lines] > The following url contains information regarding these topics: > http://www.partow.net/programming/hashfunctions/index.html
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|