Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / December 2005

Tip: Looking for answers? Try searching our database.

Hash table: linear probe vs. chaining

Thread view: 
Ian Pilcher - 09 Dec 2005 18:43 GMT
Background:

I'm working on a Java program, and I need to associate some additional
data with a large number of objects.  I do not have the freedom to
modify or wrap the existing object classes.  Additionally, I don't want
to interfere with the garbage collection of the objects.

I've come to the conclusion that I need a "WeakIdentityHashMap," a data
structure which combines the features of an IdentityHashMap and a
WeakHashMap.  After a couple of attempts to build such a thing on top of
one of the existing classes, I don't think that enough of the internals
are exposed, and I'm going to have to create my own hash table.

The question:

Sun's documentation states that they use a linear-probe hash table for
the IdentityHashMap; the GNU Classpath does the same thing.  As far as I
can tell, the other hash table-based collections use chaining.

It seems to me that a linear-probe hash table is a bad idea if mappings
are going to be removed from the table very often.  When searching for a
given key, the search must continue until the key itself is found or a
slot which has NEVER been occupied is found.  (Finding a slot which is
CURRENTLY empty is not enough, since it may have been occupied when the
key being sought was inserted into the map.)

Anyone see a flaw in my reasoning?

Thanks!

Signature

========================================================================
Ian Pilcher                                        i.pilcher@comcast.net
========================================================================

Ben Pfaff - 09 Dec 2005 19:40 GMT
> It seems to me that a linear-probe hash table is a bad idea if mappings
> are going to be removed from the table very often.  When searching for a
[quoted text clipped - 4 lines]
>
> Anyone see a flaw in my reasoning?

You are overlooking the algorithm for removing an item from a
linear-probing hash table.  In Knuth this is Algorithm 6.4R
(Deletion with linear probing).
Signature

Ben Pfaff
email: blp@cs.stanford.edu
web: http://benpfaff.org

Ian Pilcher - 09 Dec 2005 21:42 GMT
> You are overlooking the algorithm for removing an item from a
> linear-probing hash table.  In Knuth this is Algorithm 6.4R
> (Deletion with linear probing).

Only because I don't know about it.  (I'm not a professional programmer;
what knowledge I have about hash tables comes from Google and the GNU
Classpath source code.)

Can you point me to a description of this algorithm on the web?

Trying to think it through myself, the goal would presumably be to
eliminate "holes" caused by removing a mapping that has previously
caused an entry to "bounce" to a subsequent slot.  How might one go
about this...

When an entry is removed, search "forward" in the table for an entry
that could occupy the newly freed slot.  ("Forward" being the direction
in which entries "bounce".)  In this case, the search can end at any
empty slot.

If an eligible entry is found, move it to the newly freed slot and
recursively remove it from its original slot.

How does that sound?

Signature

========================================================================
Ian Pilcher                                        i.pilcher@comcast.net
========================================================================

Ben Pfaff - 10 Dec 2005 19:14 GMT
>> You are overlooking the algorithm for removing an item from a
>> linear-probing hash table.  In Knuth this is Algorithm 6.4R
[quoted text clipped - 5 lines]
>
> Can you point me to a description of this algorithm on the web?

I couldn't easily find a description of it on the web.  There is
an implementation of it in a hash table of mine (in C), which can
be viewed here:
       http://cvs.savannah.gnu.org/viewcvs/pspp/pspp/src/hash.c?rev=1.17&view=markup
Signature

Ben Pfaff
email: blp@cs.stanford.edu
web: http://benpfaff.org

Chuck F. - 12 Dec 2005 15:25 GMT
>>> You are overlooking the algorithm for removing an item from a
>>> linear-probing hash table.  In Knuth this is Algorithm 6.4R
[quoted text clipped - 10 lines]
> be viewed here:
>   http://cvs.savannah.gnu.org/viewcvs/pspp/pspp/src/hash.c?rev=1.17&view=markup

He can also look at the coding for my hashlib package, which allows
deletions.  See:

  <http://cbfalconer.home.att.net/download/hashlib.zip>

Signature

Read about the Sony stealthware that is a security leak, phones
home, and is generally illegal in most parts of the world.  Also
the apparent connivance of the various security software firms.
http://www.schneier.com/blog/archives/2005/11/sonys_drm_rootk.html

Ben Pfaff - 12 Dec 2005 23:32 GMT
>>>> You are overlooking the algorithm for removing an item from a
>>>> linear-probing hash table.  In Knuth this is Algorithm 6.4R
[quoted text clipped - 14 lines]
>
>   <http://cbfalconer.home.att.net/download/hashlib.zip>

hashlib doesn't implement the algorithm in question.
Signature

"The fact is, technical people are better off not looking at patents. If
you don't know what they cover and where they are, you won't be knowingly
infringing on them. If somebody sues you, you change the algorithm or you
just hire a hit-man to whack the stupid git." --Linus Torvalds

Chuck F. - 13 Dec 2005 05:16 GMT
>>>>> You are overlooking the algorithm for removing an item from a
>>>>> linear-probing hash table.  In Knuth this is Algorithm 6.4R
[quoted text clipped - 16 lines]
>
> hashlib doesn't implement the algorithm in question.

True.  However it does implement deletions in a linear probing
table, which is not exactly especially common.

Signature

Read about the Sony stealthware that is a security leak, phones
home, and is generally illegal in most parts of the world.  Also
the apparent connivance of the various security software firms.
http://www.schneier.com/blog/archives/2005/11/sonys_drm_rootk.html

Chuck F. - 13 Dec 2005 05:16 GMT
>>>>> You are overlooking the algorithm for removing an item from a
>>>>> linear-probing hash table.  In Knuth this is Algorithm 6.4R
[quoted text clipped - 16 lines]
>
> hashlib doesn't implement the algorithm in question.

True.  However it does implement deletions in a linear probing
table, which is not exactly especially common.

Signature

Read about the Sony stealthware that is a security leak, phones
home, and is generally illegal in most parts of the world.  Also
the apparent connivance of the various security software firms.
http://www.schneier.com/blog/archives/2005/11/sonys_drm_rootk.html

Charles Richmond - 13 Dec 2005 05:27 GMT
> >>> You are overlooking the algorithm for removing an item from a
> >>> linear-probing hash table.  In Knuth this is Algorithm 6.4R
[quoted text clipped - 15 lines]
>
>    <http://cbfalconer.home.att.net/download/hashlib.zip>

Welcome back, CBF!!!

I have been kicking around the idea of creating a hash table myself. I *need* to
learn more about hashing, so I plan to study Knuth on the subject. Of course, I
intend to study *your* hashlib also, CBF.

I do *not* need deletion, but I plan to use a re-hash function in case of a
collision.
Charles Richmond - 13 Dec 2005 05:27 GMT
> >>> You are overlooking the algorithm for removing an item from a
> >>> linear-probing hash table.  In Knuth this is Algorithm 6.4R
[quoted text clipped - 15 lines]
>
>    <http://cbfalconer.home.att.net/download/hashlib.zip>

Welcome back, CBF!!!

I have been kicking around the idea of creating a hash table myself. I *need* to
learn more about hashing, so I plan to study Knuth on the subject. Of course, I
intend to study *your* hashlib also, CBF.

I do *not* need deletion, but I plan to use a re-hash function in case of a
collision.
MSCHAEF.COM - 09 Dec 2005 20:37 GMT
   ...
>It seems to me that a linear-probe hash table is a bad idea if mappings
>are going to be removed from the table very often.

As Ben Pfaff pointed out, there are ways to solve this.

One advantage of linear probing is that it can have better performance on
modern hardware. Linear scans through memory are significantly cheaper
than linked list traversals, even if the linked list nodes are adjacent in
memory. The linked list really has two disadvantages:

* It requires a both a pointer access and a comparison for each node
 checked. This means that morr data has to be loaded from  memory to
 check a list node.

* In linear probing, adjacent nodes are guaranteed to be adjacent to each
 other in the address space. This makes it more likely that the memory
 subsystem can predict the access pattern and prefetch data before
 you need it.

I've done some simple tests on my Centrino laptop: A linked list scan of
adjacent nodes was half the performance [1] of a linear array scan.
Randomly ordering the list elements in memory could drop performance to
(not by) 4-5% of the array scan.

-Mike

1] This was a simple microbenchmark involving adding lists of 32-bit
integers.  Pointers were also 32-bits, so a linked list having half the
performance of an array is not too suprising.
Signature

http://www.mschaef.com

Ian Pilcher - 09 Dec 2005 21:51 GMT
> One advantage of linear probing is that it can have better performance on
> modern hardware. Linear scans through memory are significantly cheaper
[quoted text clipped - 9 lines]
>   subsystem can predict the access pattern and prefetch data before
>   you need it.

I think the fact that this is Java, may reduce the impact of these
points, particularly since my keys are going to be Java WeakReference
objects.

In the linear probing case, each key in the table will actually be a
reference to a WeakReference (or a subclass thereof), which can be used
to obtain a reference to the actual key.  In the chaining case, I can
use a subclass of WeakReference as my list node, so the number of
objects will be the same, as will the number of references that need to
be traversed per comparison.

If Java would allow me to create an actual array of objects, I think
your logic would hold, but since objects are *always* accessed through
references in Java, I think that the chained table should perform just
as well.

What do you think?

Thanks for the feedback, BTW!

Signature

========================================================================
Ian Pilcher                                        i.pilcher@comcast.net
========================================================================

Thomas Hawtin - 09 Dec 2005 23:53 GMT
> I think the fact that this is Java, may reduce the impact of these
> points, particularly since my keys are going to be Java WeakReference
> objects.

As your key? Conventionally entries subclass WeakReferences to point to
your key, with a conventional strong reference to the value. Although I
guess a custom implementation could collapse the key into the entry.

I don't understand why Sun have not added a WeakIdentityHashMap.
WeakHashMap only works if you have complete control.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4500542

> In the linear probing case, each key in the table will actually be a
> reference to a WeakReference (or a subclass thereof), which can be used
> to obtain a reference to the actual key.  In the chaining case, I can
> use a subclass of WeakReference as my list node, so the number of
> objects will be the same, as will the number of references that need to
> be traversed per comparison.

The current custom weak identity hash table in ThreadLocal has a
WeakReference for an entry, but does linear probing. I can't quite work
out why. Neither that implementation nor my (leak-free) cyclic-list
reimplementation are nice when it comes to expired WeakReferences.

> If Java would allow me to create an actual array of objects, I think
> your logic would hold, but since objects are *always* accessed through
> references in Java, I think that the chained table should perform just
> as well.

Creating (resettable) arrays of WeakReferences with good performance
would, I think, be challenging.

It's perhaps surprising that HashMap is not implemented like
IdentityHashMap. Even implemented with chains using offsets into the
array would appear to be better.

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Ian Pilcher - 10 Dec 2005 01:59 GMT
> As your key? Conventionally entries subclass WeakReferences to point to
> your key, with a conventional strong reference to the value. Although I
> guess a custom implementation could collapse the key into the entry.

If I follow the linear probe path, I will likely just use an Object[]
twice as large as the capacity of the hash table.  The "keys," at even
indices, would be references to WeakReference objects (or a simple sub-
class); the "values," at odd indices, would be references to the value
objects.

> I don't understand why Sun have not added a WeakIdentityHashMap.
> WeakHashMap only works if you have complete control.
>
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4500542

LMAO.  The best part is that their own evaluation, from back in 2001 or
so, agrees with your assessment.

> Creating (resettable) arrays of WeakReferences with good performance
> would, I think, be challenging.

Not sure I follow, but I don't think it matters.  My point was that the
"locality" argument for a linear probe table only holds water when the
keys are actually *in* the array.  When the array only holds references
to the keys, examining all the keys that correspond to a particular hash
code might actually be marginally faster in a chained hash table.

Signature

========================================================================
Ian Pilcher                                        i.pilcher@comcast.net
========================================================================

Arash Partow - 10 Dec 2005 13:55 GMT
Hi Ian,

The fact of the matter is that memory is cheap these days so USE it!.
Implement chaining, as a resolution to hash collisions, DO NOT use
linked lists, instead use dynamic arrays (in the case of java use
a vector as your bucket). Assess the data or types of data you
will be hashing with various types of hash functions to find a
balance between data, hash function and initial bucket sizes.
A good hash function would have an average of 1-2 in chaining
length with an upper 1 STD chaining length of about 5 (aka less
that 0.1 % of the buckets should be of that chain length).

DO NOT use linear probing - as it is only a mental exercise for
computational theorists and is in the over-whelming majority
of cases totally and utterly useless.

You also mentioned weak identity? could this mean you are only
after existence information and not the data that is represented
by the key? If so, a bloom filter is a much faster and less
processor and memory intensive way of achieving such functionality.

The following url contains information regarding these topics:
http://www.partow.net/programming/hashfunctions/index.html

Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
http://www.partow.net
Thomas Hawtin - 10 Dec 2005 20:54 GMT
> The fact of the matter is that memory is cheap these days so USE it!.

Memory is cheap, but it is not free. Cache is relatively expensive. The
crush point these days seems to be memory bandwidth. Look at the Niagra.
Each core has four hardware threads per core to smooth over memory
latency (Niagra II is planned to have eight, apparently). The chip has
four DDR2 channels. I assume the designers did not do that just because
it was fun and inexpensive.

If you are witting a class for wide and frequent use, you should be
thinking about making it fast and compact.

> Implement chaining, as a resolution to hash collisions, DO NOT use
> linked lists, instead use dynamic arrays (in the case of java use
[quoted text clipped - 4 lines]
> length with an upper 1 STD chaining length of about 5 (aka less
> that 0.1 % of the buckets should be of that chain length).

Short chains are good. With a short length, there is little advantage of
an array over a linked list. A Vector would require even more overhead
(plus synchronisation, did you mean ArrayList?). When it comes to
rehashing, a linked list requires no extra allocations. Vectors would
take even more memory.

Fortunately the mean bucket length should be less than an entry (with
0.75 load factor threshold; actual load should be between 0.375 and 0.75
with resizing doubling table length).

> DO NOT use linear probing - as it is only a mental exercise for
> computational theorists and is in the over-whelming majority
> of cases totally and utterly useless.

Write once. Improve performance everywhere.

> You also mentioned weak identity?

Weak references to keys, with only object identity of importance, yes.

>                                   could this mean you are only
> after existence information and not the data that is represented
> by the key?

No. We are talking maps. For instance, I mentioned a thread-local
implementation that has one-per-thread maps from thread-local object to
value. If the thread-local object disappears, then its entries should be
removed.

>             If so, a bloom filter is a much faster and less
> processor and memory intensive way of achieving such functionality.

But isn't much cop where objects can disappear without notice, or where
an exact mapping is required.

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Russell Shaw - 10 Dec 2005 22:39 GMT
> Hi Ian,
>
> The fact of the matter is that memory is cheap these days so USE it!.
> Implement chaining, as a resolution to hash collisions, DO NOT use
> linked lists, instead use dynamic arrays

Why? Realloc'ing an array to make a longer one often means a new malloc
followed by a behind-the-scenes memmove() which is all awfully expensive
compared to doing a linked list.

> (in the case of java use
> a vector as your bucket). Assess the data or types of data you
[quoted text clipped - 15 lines]
> The following url contains information regarding these topics:
> http://www.partow.net/programming/hashfunctions/index.html
Rob Thorpe - 12 Dec 2005 12:20 GMT
> > Hi Ian,
> >
[quoted text clipped - 5 lines]
> followed by a behind-the-scenes memmove() which is all awfully expensive
> compared to doing a linked list.

It is, but it should not happen often.  For example, lets say that the
implementation initially allocates a 5 element array.  It may at some
point in the future expand it in the inefficient way you mention, but
only if 5 keys have hit the same slot.  If this has happened then the
hash table is probably too small for the data and should be rehashed,
in this case many methods become slow, especially linear probing.

The best argument against using variable length arrays is that the
intial length in many languages is probably more like 255 elements,
most of these will be wasted and would be much better used making the
hash table bigger.
Rob Thorpe - 12 Dec 2005 13:49 GMT
> Hi Ian,
>
[quoted text clipped - 7 lines]
> length with an upper 1 STD chaining length of about 5 (aka less
> that 0.1 % of the buckets should be of that chain length).

Sounds reasonable.

> DO NOT use linear probing - as it is only a mental exercise for
> computational theorists and is in the over-whelming majority
> of cases totally and utterly useless.

I'm not sure about that.
The point about memory is that if you use Y bytes outside the hash
table in arrays you have to know that it is more worthwhile than using
those bytes to make a bigger hash table.

In the example above, lets say that a 5 element array is attached to
each bucket.  This may well work very well, but does it work better
than a non-chained hash that is 5 times the size? (Obviously you could
improve this comparison by only allocating an array when it's needed).

I expect the answer depends on the workload.

> You also mentioned weak identity? could this mean you are only
> after existence information and not the data that is represented
[quoted text clipped - 3 lines]
> The following url contains information regarding these topics:
> http://www.partow.net/programming/hashfunctions/index.html


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.