Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / February 2008

Tip: Looking for answers? Try searching our database.

HashMap vs linear table lookup

Thread view: 
Roedy Green - 15 Feb 2008 19:31 GMT
For sufficiently small tables, a linear array lookup to find a string
would be faster than a HashMap. Has anyone experimented to get a rough
idea where the break-even point is?
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Mark Space - 15 Feb 2008 20:14 GMT
> For sufficiently small tables, a linear array lookup to find a string
> would be faster than a HashMap. Has anyone experimented to get a rough
> idea where the break-even point is?

If you try to measure this, I think it will depend on the length of the
Strings.  The hashing algorithm for Strings used to only pick a few
characters to "save time."  This didn't work well for large hash tables
since many similar strings would have the letters chosen in common,
resulting in poor hash distribution.  So now I think every letter in a
String is used when computing the hash.

So anyway, yeah, don't forget about string size.
Lord Zoltar - 15 Feb 2008 21:08 GMT
> > For sufficiently small tables, a linear array lookup to find a string
> > would be faster than a HashMap. Has anyone experimented to get a rough
[quoted text clipped - 8 lines]
>
> So anyway, yeah, don't forget about string size.

It would probably also be possible to write your own hash function, if
you know more about the input (such as avg length of strings, if the
strings only use a certain subset of all characters, etc...). Doing
that might provide better performance, in certain cases. You'd just
have to try it to see ;)
Stefan Ram - 15 Feb 2008 21:18 GMT
>It would probably also be possible to write your own hash function,

 Here is an example, using a lookup indexed by the final
 two letters of the month TLA. (This might actually be a
 perfect hash, I am not sure about this.)

 A small table might be used instead of the »switch« in »show«.
 This is no hash map, but a direct array look up by a special
 hash function, I believe O(1).

public class Main
{
 int[] v;

 public Main()
 { v = new int[ 256 ]; for( int i = 0; i < 256; ++i )v[ i ]=15;
   v[  97 ]= 0; v[  98 ]= 9; v[  99 ]= 4; v[ 101 ]= 2; v[ 103 ]= 9;
   v[ 108 ]= 8; v[ 110 ]= 2; v[ 111 ]= 4; v[ 112 ]= 3; v[ 114 ]= 1;
   v[ 116 ]= 3; v[ 117 ]= 1; v[ 118 ]= 4; v[ 121 ]= 0; }

 int hash( final java.lang.String s )
 { return v[ s.charAt( 1 )]+v[ s.charAt( 2 )] ; }

 void show( final java.lang.String s )
 { switch ( hash( s ))
   { case 4: case 3: case 5: case 8:
     java.lang.System.out.println( 30 ); break;
     case 11: break;
     default: java.lang.System.out.println( 31 ); break; }}

 public static void main( final java.lang.String[] args )
 { new Main().show( "Jun" ); new Main().show( "Oct" ); }}

 It prints:

30
31
Christian - 15 Feb 2008 20:15 GMT
Roedy Green schrieb:
> For sufficiently small tables, a linear array lookup to find a string
> would be faster than a HashMap. Has anyone experimented to get a rough
[quoted text clipped - 4 lines]
> The Java Glossary
> http://mindprod.com

well Intresting.. hope you accept this as valid test:

public class TestString {

/**
 * @param args
 */
public static void main(String[] args) {
   
  for (int size=1;size < 100; size++) {
    if (testHashMap(size) <= testArray(size)) {
      System.out.println("hashmap wins at: "+size);
    }
  }
       

}

   
private static long testHashMap(int size) {
  Set<String> map = new HashSet<String>();
  for (int i=0 ; i < size; i++) {
     map.add(i+"+"+(-i)+"= 0");
  }
       
  long time = System.nanoTime();
  for (int repetitions=0; repetitions < 10000; repetitions++ ) {
    for (int i=0;i < size; i++) {
      if (!map.contains(i+"+"+(-i)+"= 0")) {
        throw new IllegalStateException();
      }
    }
  }
  return System.nanoTime()-time;
}
   
private static long testArray(int size) {
  String[] array = new String[size];
  for (int i=0 ; i < size; i++) {
    array[i] =i+"+"+(-i)+"= 0";
  }
       
  long time = System.nanoTime();
  for (int repetitions=0; repetitions < 10000; repetitions++ ) {
    for (int i=0;i < size; i++) {
      String onSearch = i+"+"+(-i)+"= 0";
      for (int x=0; x < size; x++) {
        if (onSearch.equals(array[x])) {
     break;
    }
      }
    }
  }
  return System.nanoTime()-time;
}
}

on my machine with 5 elements it seems about even .. sometimes the
HashSet wins .. and sometimes the array.. after that HashSet is the winner.

Christian
Roedy Green - 15 Feb 2008 20:29 GMT
On Fri, 15 Feb 2008 19:31:58 GMT, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>For sufficiently small tables, a linear array lookup to find a string
>would be faster than a HashMap. Has anyone experimented to get a rough
>idea where the break-even point is?
>--

I wrote this code to experiment. Turns out binary search is the pits.
HashSet starts to get better around 11 items.
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Roedy Green - 15 Feb 2008 20:31 GMT
On Fri, 15 Feb 2008 20:29:30 GMT, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>I wrote this code to experiment. Turns out binary search is the pits.
>HashSet starts to get better around 11 items.

The sample keys are nicely scrambled to make HashSet happier than it
deserves to be.

package com.mindprod.example;

import java.util.Arrays;
import java.util.HashSet;
import java.util.Random;

/*
* Test relative speed of 3 methods of looking up a key to see if it is
is the list of legal keys.
* HashSet, binary search, linear search for  different numbers of keys
n.
* <p/>
* composed with IntelliJ IDEA
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0, 2008-02-15
*/
public class TestFinding
   {
   // ------------------------------ FIELDS
------------------------------

   /**
    * lookup the random strings
    */
   private static HashSet<String> h;

   /**
    * random number generator
    */
   private final static Random wheel = new Random();

   /**
    * list of strings we want to look up
    */
   private static String[] keys;

   // -------------------------- STATIC METHODS
--------------------------

   /**
    * build  lookup structure
    */
   private static void build()
       {
       // build HashSet
       h = new HashSet<String>( Arrays.asList( keys ) );

       // build binary search
       Arrays.sort( keys );
       }

   /**
    * find all the keys by BinarySearch
    */
   private static void findAllViaBinarySearch()
       {
       for ( String lookFor : keys )
           {
           int whereFound = Arrays.binarySearch( keys, lookFor );
           // -ve means not found +ve tells where found.
           }
       }

   /**
    * find all the keys by HashSet
    */
   private static void findAllViaHashSet()
       {
       for ( String lookFor : keys )
           {
           boolean found = h.contains( lookFor );
           }
       }

   /**
    * find all the keys by linear search
    */
   private static void findAllViaLinearSearch()
       {
       for ( String lookFor : keys )
           {
           boolean found = false;
           for ( String candidate : keys )
               {
               if ( lookFor.equals( candidate ) )
                   {
                   found = true;
                   break;
                   }
               }
           }
       }

   /**
    * generate N random keys and their lookup structure
    */
   private static void generate( int n )
       {
       keys = new String[n];
       for ( int i = 0; i < n; i++ )
           {
           keys[ i ] = Long.toString( wheel.nextLong() );
           }
       }

   // --------------------------- main() method
---------------------------

   /**
    * main method
    *
    * @param args not used
    */
   public static void main( String[] args )
       {

       for ( int n = 1; n <= 10000; n *= 10 )
           {
           generate( n );
           build();
           final long t1 = System.nanoTime();
           findAllViaBinarySearch();
           final long t2 = System.nanoTime();
           findAllViaHashSet();
           final long t3 = System.nanoTime();
           findAllViaLinearSearch();
           final long t4 = System.nanoTime();

           System.out
                   .println( "n:" +
                             n +
                             " binary search: " +
                             ( t2 - t1 ) +
                             " HashSet: " +
                             ( t3 - t2 ) +
                             " Linear: " +
                             ( t4 - t3 ) );
           }
       // Here in output on my machine With JDK 1.6.0_04
       //Linear is best for 10 or fewer, then HashSet.
       // Binary search is never optimal.
       // n:1 binary search: 21600 HashSet: 9720 Linear: 6400*
       // n:10 binary search: 58720 HashSet: 11360 Linear: 11120*
       // n:100 binary search: 707320 HashSet: 97520* Linear: 695960
       // n:1000 binary search: 1294360 HashSet: 1255480* Linear:
10911040
       // n:10000 binary search: 9823600 HashSet: 3920200* Linear:
1513846600

       // Here in output on my machine with Jet    
       // n:1 binary search: 5520 HashSet: 4760 Linear: 1240*
       // n:10 binary search: 3560 HashSet: 2480 Linear: 2400*
       // n:100 binary search: 28160 HashSet: 5480* Linear: 44840
       // n:1000 binary search: 408840 HashSet: 43560* Linear:
7862960
       // n:10000 binary search: 9142440 HashSet: 2295320* Linear:
1852690520
       }
   }
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Patricia Shanahan - 15 Feb 2008 21:11 GMT
...
> The sample keys are nicely scrambled to make HashSet happier than it
> deserves to be.
...

Although the approximate break-even point is interesting, if I had a
performance problem involving this sort of lookup I would do my own
experiments, using typical table contents and probe sequences from my
application.

Patricia
Roedy Green - 16 Feb 2008 09:45 GMT
On Fri, 15 Feb 2008 20:31:19 GMT, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>   for ( int n = 1; n <= 10000; n *= 10 )

at n =  100,000 the benchmark becomes impossibly slow. The linear
search total time is o(n^2).
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Boris Stumm - 21 Feb 2008 08:47 GMT
>             keys[ i ] = Long.toString( wheel.nextLong() );

What about intern'ed strings, i.e. Long.toString().intern()?

If string creation is rare and lookup is often needed, one
would probably do this, and then an .equals() would nearly
behave like an ==, which should much faster for the linear
case.
Mike Schilling - 16 Feb 2008 06:28 GMT
> For sufficiently small tables, a linear array lookup to find a
> string
> would be faster than a HashMap. Has anyone experimented to get a
> rough
> idea where the break-even point is?

I presume that it's small enough that, in all nromal cases, they're
none - 18 Feb 2008 21:10 GMT
> For sufficiently small tables, a linear array lookup to find a string
> would be faster than a HashMap. Has anyone experimented to get a rough
[quoted text clipped - 4 lines]
> The Java Glossary
> http://mindprod.com

HashSet have to compute the hashCode of all Strings on the whole
content, but hashCode is compute only once per String.
String.equals first use length and then each caracter in order to return
result. You have to do it each time in linear comparison.

Basically,
 if n is the total number of strings
    p is the avg length,
  hashset is O(n*p) to O(pn²), worse case if hash function is poor.
  linear is O(pn²) to O(n²), best case if strings are very different
                                               lengths or beginings.

Hence you have to profile your dataset (avg length, number of unique,
similarity, etc.) to know the break point.
Lew - 19 Feb 2008 00:52 GMT
> HashSet have to compute the hashCode of all Strings on the whole
> content, but hashCode is compute only once per String.
> String.equals first use length and then each caracter in order to return
> result. You have to do it each time in linear comparison.

The HashSet has to compute the hash exactly once for each entry when it's
inserted, as you seem to say.

>  if n is the total number of strings
>     p is the avg length,
>   hashset [sic] is O(n*p) to O(pn²), worse case if hash function is poor.

The String hashCode() computation should be reasonable for most purposes.  But
even if the hash were worst-case (e.g., returned 17 for every String), there'd
still only be 'n' comparisons when doing a lookup.

And in big-O notation the 'p' factor goes away.

>   linear is O(pn²) to O(n²), best case if strings are very different
>                                                lengths or beginings.

How are you getting n squared?  A linear lookup only needs to make n
comparisons to find a single input.

> Hence you have to profile your dataset (avg length, number of unique,
> similarity, etc.) to know the break point.

On lookup, i.e., to compare if a given input is already in the Set, you only
compute the hash for the input, not for the elements in the Set already.  The
range for HashSet lookups should be O(1) to O(n) depending on the hash
quality.  (O(n) occurring when all entries hash to the same value, which
devolves the HashSet into the linear lookup.)

Signature

Lew

Roedy Green - 21 Feb 2008 02:13 GMT
>HashSet have to compute the hashCode of all Strings on the whole
>content, but hashCode is compute only once per String.

So you will compute it to put the strings in the HashSet, and you will
compute it for each key to look them up.  You don't need to compute it
when you chase a HashSet chain to compare colliding strings.  They are
pre-computed.
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
EJP - 21 Feb 2008 10:01 GMT
>> HashSet have to compute the hashCode of all Strings on the whole
>> content, but hashCode is compute only once per String.
>
> So you will compute it to put the strings in the HashSet, and you will
> compute it for each key to look them up.
As the poster said, the hashCode for a String is computed once per
String. It is not recomputed every time you call String.hashCode().
Lew - 21 Feb 2008 14:43 GMT
> As the poster said, the hashCode for a String is computed once per
> String. It is not recomputed every time you call String.hashCode().

Are you sure of that?  The Javadocs don't mention that.

However, it is true that the Map does not invoke hashCode() but once for each
key, upon insertion, and once for the query object upon lookup.

P.S., please attribute your citations.

Signature

Lew

Thomas Schodt - 21 Feb 2008 15:19 GMT
>> As the poster said, the hashCode for a String is computed once per
>> String. It is not recomputed every time you call String.hashCode().
>
> Are you sure of that?  The Javadocs don't mention that.

I guess java implementations are not required to
but the sun implementation caches the hash
(kinda changing the internal state of an immutable object...)

It is only recomputed every time for the empty String and for any other
String that has a hash of 0(zero).
Lew - 21 Feb 2008 15:31 GMT
>>> As the poster said, the hashCode for a String is computed once per
>>> String. It is not recomputed every time you call String.hashCode().
[quoted text clipped - 4 lines]
> but the sun implementation caches the hash
> (kinda changing the internal state of an immutable object...)

Actually, the immutability of an object only applies to its external state.
Lazy instantiation of things like hashCode() or (for classes other than
String) the toString() result is a standard idiom.  One hopes that the String
implementation of hashCode() is suitably thread-safe, otherwise they are
changing the external state.

Signature

Lew

Mike Schilling - 21 Feb 2008 16:33 GMT
>>>> As the poster said, the hashCode for a String is computed once
>>>> per String. It is not recomputed every time you call
[quoted text clipped - 11 lines]
> hopes that the String implementation of hashCode() is suitably
> thread-safe, otherwise they are changing the external state.

What would "suitably thread-safe" require here?  The external behavior
is that all calls to hashCode() should return the same value.  I think
this algorithm works fine:

1. define a private int _hashCode;
2. on a call to hashCode(), check the value of _hashCode.  If it's
non-zero, return it.
3.  otherwise, calculate the hash code of the string.
4. store this value in _hashcode
5. return _hashcode

The only thread-safety issue is that the store of _hashcode in step 4
be atomic, and I think that that's guaranteed by the JVM..  You could
minimize the number of hash code calculations by locking between steps
2 and 4, ensuring that all threads will see the changed value of
_hashcode once it's ready, but that's merely an optimization.
Lew - 22 Feb 2008 01:58 GMT
> What would "suitably thread-safe" require here?  The external behavior
> is that all calls to hashCode() should return the same value.  I think
[quoted text clipped - 12 lines]
> 2 and 4, ensuring that all threads will see the changed value of
> _hashcode once it's ready, but that's merely an optimization.

You're right, because hashCode() is an idempotent calculation.  If it gets
calculated twice it does no harm.

The safety in String's hashCode() case has nothing to do with whether storing
the value is atomic or not.  Checking the value, then calculating it, then
storing it is not atomic.

Lazy initialization is a check-and-set, so there's no guarantee that the
calculation will occur only once, absent synchronization.  It works all right
for idempotent operations but if the calculation were for something that could
change, you'd potentially lose more than time to failure to synchronize.

Nevertheless, it works for calculation of hashCode() for immutable instances
like String's, because the same value is calculated even if repeated.

Thus, "suitably thread-safe" is guaranteed for String hashCode(), but not
always for all calculations for all classes.  I thank you for pointing out the
hole in my reasoning - thread safety really doesn't apply to String hashCode().

Signature

Lew

Mike Schilling - 22 Feb 2008 02:10 GMT
>> What would "suitably thread-safe" require here?  The external
>> behavior is that all calls to hashCode() should return the same
[quoted text clipped - 23 lines]
> storing the value is atomic or not.  Checking the value, then
> calculating it, then storing it is not atomic.

There would be an issue if the value were a long or a double.  There
is no guarantee that the whole 64 bits is stored or fetched as an
atomic operation, so that if one thread does

   _hashCode64 = l;
   return _hashCode64;

as another does

   if (_hashCode64 != 0)
       return _hashCode;

The return might contain partly the correct result and partly the
original 0.

At least under the original Java memory model.  This might have
changed with the current one.
Patricia Shanahan - 22 Feb 2008 02:13 GMT
...
> The safety in String's hashCode() case has nothing to do with whether
> storing the value is atomic or not.  Checking the value, then
> calculating it, then storing it is not atomic.
...

It does depend on atomic stores. Suppose, for example, Java were
implemented on a machine with a maximum store width of 16 bits, so that
the only way to change the memory copy of an int is by doing two stores.
Suppose the actual hash code for a given String is 0xABCD.

The hash field would go through the following series of values:

0000 (initial value)
AB00 (store one half of hash field)
ABCD (store other half of hash field)

There is a window during which another thread, executing the hashCode
method in this String, could see AB00, a non-zero value, and use it as
the hashCode() result.

On a machine with an atomic 32 bit store, which includes all systems on
which Java runs, any read of the hash field will either see 0x0000 or
0xABCD, and consistently use 0xABCD as hash code.

Patricia
Lew - 22 Feb 2008 04:27 GMT
Lew wrote:
>> The safety in String's hashCode() case has nothing to do with whether
>> storing the value is atomic or not.  Checking the value, then
>> calculating it, then storing it is not atomic.

> On a machine with an atomic 32 bit store, which includes all systems on
> which Java runs, any read of the hash field will either see 0x0000 or
> 0xABCD, and consistently use 0xABCD as hash code.

All right.  Atomicity is important, but not sufficient for the safety of the
lazy initialization.  I also pointed out that if the value is not
synchronized, the value might be set more than once.  The safety also stems
from the idempotency.  Without the idempotency, the atomicity wouldn't
guarantee correct reads without synchronization.

The point is that before there is a store, there is a read.  One thread could
read zero and set the hashCode().  Let's say the calculated hashCode() is
0xCAFEBABE.  So thread 1 sets the hashCode to that value.  Meanwhile, another
thread reads the value.  Because there is no synchronization, thread 2 also
sees zero as the value.  Because the calculation is deterministic, it also
sets the hashCode to 0xCAFEBABE.  Meanwhile thread 3 wants to use the
hashCode().  It, too, sees zero and calculates the value.

The store is atomic.  The cycle of read, calculate and store is not atomic.
Because of that, a String without synchronization, such as by calculating the
hashCode during construction, might have its calculated hashCode() set more
than once.

It is also true, as Patricia and others point out, that a non-atomic write
would break lazy initialization without synchronization.

Signature

Lew

Patricia Shanahan - 22 Feb 2008 05:02 GMT
> Lew wrote:
>>> The safety in String's hashCode() case has nothing to do with whether
[quoted text clipped - 10 lines]
> stems from the idempotency.  Without the idempotency, the atomicity
> wouldn't guarantee correct reads without synchronization.

Agreed. I'm only claiming that atomic store is necessary, not
sufficient, for the String code to work without synchronization.

Patricia
EJP - 21 Feb 2008 22:59 GMT
> Are you sure of that?

Yes.

> P.S., please attribute your citations.

The source code.

Or do you think I just make this stuff up?
Peter Duniho - 21 Feb 2008 23:16 GMT
>> Are you sure of that?
>
[quoted text clipped - 5 lines]
>
> Or do you think I just make this stuff up?

Well, to be fair: there is in fact a difference between a general  
statement such as yours ("the hashCode for a String is computed once per  
String") and a qualified statement such as the one Thomas offers ("the sun  
implementation caches the hash").

Looking at the source code for the Sun implementation doesn't tell you  
what all implementations do, and unless the Java specification  
specifically says that all implementations must cache the hash code,  
there's no guarantee that they do.

You seem offended by Lew's question, but I think it's a reasonable one.  
Is there in fact a genuine guarantee that every implementation of String  
will always cache the hash code?  Or is this an implementation-dependent  
behavior?

Granted, the answer may be academic.  It's not like one has much, if any,  
control over the behavior and I don't think it's the sort of thing that  
should, all else being equal, cause someone to dismiss a HashMap as a  
solution.

But we're all programmers here, and as such we often have a need for being  
very specific and literal in our understanding of what's being said.  
Unless you have some reason to believe that _all_ Java implementations are  
_guaranteed_ to cache the hash code, a statement about what the actual  
behavior of String is should IMHO be qualified as Thomas's was, if for no  
other reason than to offer enhanced clarity regarding what you're actually  
saying.

Pete
Mike Schilling - 21 Feb 2008 23:32 GMT
>>> Are you sure of that?
>>
[quoted text clipped - 34 lines]
> as Thomas's was, if for no other reason than to offer enhanced
> clarity regarding what you're actually saying.

I agree 100% with the points you're making.  Anything not guaranteed
by the documentation is implementation-specific, and cannot be relied
on.

In this case, though, I'll qualifiy that:  the main implementation, on
which all others not developed in a clean room are based, has AFAIK
always cached the hash code.  Moreover, doing so is almost free, and
there are no disadvantages beyond the extra two allocated bytes.  (I'm
not such an expert in JVM implementation that I know what the actual
cost in space is: it may be zero, or it may be as much as N bytes
where all objects are allocated on N-byte boundaries.)  So I'm
willing, *in this case*, to act as if that all implementations do it
until I'm shown one that doesn't.
Peter Duniho - 21 Feb 2008 23:57 GMT
> [...] So I'm
> willing, *in this case*, to act as if that all implementations do it
> until I'm shown one that doesn't.

I would as well.  Frankly, it's such a no-brainer to cache the result that  
even if I did run into an implementation that didn't, and as a result  
caused some sort of serious performance problem, my first inclination  
would be to tell whatever user ran across the problem "go use a decent  
Java implementation".  :)

But I still think it's reasonable to question whether this is documented  
as guaranteed behavior, or is simply implementation-dependent, even if any  
sensible implementation would in fact do it.

But that's just me.  I've been known to be extra-particular about such  
things, sometimes for no apparent reason except for the sake of being  
particular.  :)

Pete
Mike Schilling - 22 Feb 2008 00:26 GMT
>> [...] So I'm
>> willing, *in this case*, to act as if that all implementations do
[quoted text clipped - 18 lines]
> being
> particular.  :)

Well, hell, if we agree on this any more violently, one of us is going
to propose, and I'm already married, so I'll stop here.
Peter Duniho - 22 Feb 2008 00:38 GMT
> Well, hell, if we agree on this any more violently, one of us is going
> to propose, and I'm already married, so I'll stop here.

It wouldn't have worked out anyway.  One of the reasons I love my wife so  
much is that she argues with me.  :)
Lew - 22 Feb 2008 02:01 GMT
>> Are you sure of that?
>
[quoted text clipped - 5 lines]
>
> Or do you think I just make this stuff up?

I was referring to your citations of other people's Usenet posts.  I didn't
think you had made anything up, but I wasn't clear who said what in your
quotations.

As far as the one-time calculation of String hashCode(), the source code for
one JVM is not authoritative.  Another JVM could do it differently.  Couldn't it?

Furthermore, in the presence of multithreaded code, if String hashCode() is
not synchronized, the value could be calculated more than once.  I'm not
familiar with the source you're using - is the hashCode() calculation
synchronized, or done in String construction?

Signature

Lew

EJP - 23 Feb 2008 12:06 GMT
> I was referring to your citations of other people's Usenet posts.  I
> didn't think you had made anything up, but I wasn't clear who said what
> in your quotations.

Sorry, I misunderstood: point taken.

> As far as the one-time calculation of String hashCode(), the source code
> for one JVM is not authoritative.  Another JVM could do it differently.  
> Couldn't it?

I suppose so, but given the discussion here it's not very likely, is it?
It would about the first thing that any sane tuning process would show up.

> Furthermore, in the presence of multithreaded code, if String hashCode()
> is not synchronized, the value could be calculated more than once.  I'm
> not familiar with the source you're using - is the hashCode()
> calculation synchronized, or done in String construction?

In the 1.5.0 source it's done in hashCode() which is not synchronized.
So yes, multiple threads could compute it lazily.

I first noticed the lazy evaluation in about 1997 so I don't think it's
changed much between versions ;-)
Arne Vajhøj - 24 Feb 2008 17:17 GMT
>> As far as the one-time calculation of String hashCode(), the source
>> code for one JVM is not authoritative.  Another JVM could do it
>> differently.  Couldn't it?
>
> I suppose so, but given the discussion here it's not very likely, is it?
> It would about the first thing that any sane tuning process would show up.

Does not really matter.

It is still a very bad habit to code assuming "any sane tuning process".

Arne
EJP - 26 Feb 2008 00:20 GMT
> It is still a very bad habit to code assuming "any sane tuning process".
On its own, perhaps. It's debatable. I don't think it's a bad habit to
code assuming that String.hashCode is evaluated once per String, which
is what we are really talking about.
Hendrik Maryns - 26 Feb 2008 16:08 GMT
EJP schreef:
>> It is still a very bad habit to code assuming "any sane tuning process".
> On its own, perhaps. It's debatable. I don't think it's a bad habit to
> code assuming that String.hashCode is evaluated once per String, which
> is what we are really talking about.

I think it is good habit to code what is most clearly doing what you
intend and then, when your program turns out to be a hog, check where
things are slowing down.  You might as well stumble upon a ‘bad’ string
hash implementation (but probably won’t).  You know, the premature
optimization thingy.

H.
Signature

Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

EJP - 28 Feb 2008 06:21 GMT
I agree. On further consideration the first program that would slow to a
crawl if String.hashCode() was re-evaluated would be 'javac' ... so the
likelihood of such an implementation is genuinenly pretty small.
Lew - 29 Feb 2008 14:19 GMT
> I agree. On further consideration the first program that would slow to a
> crawl if String.hashCode() was re-evaluated would be 'javac' ... so the
> likelihood of such an implementation is genuinenly pretty small.

Just out of curiosity, how would that affect javac?

Signature

Lew

Mike Schilling - 29 Feb 2008 18:11 GMT
>> I agree. On further consideration the first program that would slow
>> to a crawl if String.hashCode() was re-evaluated would be 'javac'
[quoted text clipped - 3 lines]
>
> Just out of curiosity, how would that affect javac?

I'd guess that string-valued tokens (keywords, class and method names,
declared variables, etc.) are stored in hash tables [1], so that a
slowdown in looking them up would directly affect compiler speed.

1. At least, when I've written parsers, that's how I've done it.
Lasse Reichstein Nielsen - 21 Feb 2008 18:08 GMT
>> So you will compute it to put the strings in the HashSet, and you
>> will compute it for each key to look them up.

> As the poster said, the hashCode for a String is computed once per
> String. It is not recomputed every time you call String.hashCode().

Even if they are computed every time, you don't need to know the hashCode
of elements in the HashSet to find them.

When you add an element to a HashSet/HashMap, the hash code is used
once, to pick a bucket, and the element is put into the linked list in
that bucket.

To look up an element, you use the hash code of the key/element to
search for, find a bucket based on that, and traverse the linked list
to find one that .equals() what you search for.

If the HashSet/HashMap grows its internal table, it needs the hash
codes again to put elements into new buckets.

The implementation of String in the Sun standard library does
cache the hashCode (unless it happens to be 0, which they take
as an uninitialized cache - but that's documented :).

/L
Signature

Lasse Reichstein Nielsen  -  lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
 'Faith without judgement merely degrades the spirit divine.'

Eric Sosman - 21 Feb 2008 18:42 GMT
> Even if they are computed every time, you don't need to know the hashCode
> of elements in the HashSet to find them.
[quoted text clipped - 3 lines]
> to find one that .equals() what you search for.
> [...]

    You don't *need* the hash code after the bucket is
located, but it's useful to store the hashes anyhow.  As
you traverse the list, you can compare the key's hash to
the entry's hash and call .equals() only if they happen
to match.  Comparing two already-computed int values is
likely to be faster than calling .equals(), so this can
save some time.  HashMap uses this optimization, storing
each item's hash code in its HashMap.Entry.

<topicality level="marginal">

    I once did some experiments to determine whether this
optimization was worth while in an "open addressed" hash
table (not the chained style used by HashMap).  Storing a
hash value with every item pointer made the table twice as
big as one that contained just the unadorned pointers -- or
to look at it another way, it allowed for only half as many
table slots for a given amount of memory.  I wondered whether
avoiding the .equals() calls was enough of a benefit to repay
doubling the table's "load factor" (entry count divided by
slot count); higher load factors lead to longer searches.

    In my tests, at least, storing the hash was a clear
winner.  Even when a search visited twice as many entries,
it almost never called .equals() in an unsuccessful search
or called it more than once in a successful search (when at
least one .equals() call is unavoidable).  The pointers-only
table visited fewer entries, but had to call .equals() on
every single one of them, every time.  Pointers-with-hashes
was so much faster that the memory investment to store the
hash values was clearly worth while.

</topicality>

Signature

Eric.Sosman@sun.com

Mike Schilling - 21 Feb 2008 19:46 GMT
>> Even if they are computed every time, you don't need to know the
>> hashCode
[quoted text clipped - 38 lines]
>
> </topicality>

Cool.  I wonder, does String.equals() make the same optimization?
(Looks....)  As of 1.5, it does not.  It checks the length, and if
that matches, it immediately begins marching down the array of bytes.
I suppose it doesn't want to take the hit of calculating the hash
codes unnecessarily.

One other oddity: a string whose hash code is 0 [1] recalculates its
hash code every times it's requested.  I'd probably have defined
hashCode() for such a string to return -1, just to avoid that.

1. Say, one which contains all zero bytes.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.