Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / January 2008

Tip: Looking for answers? Try searching our database.

Hash Code Compression

Thread view: 
j1mb0jay - 11 Jan 2008 22:11 GMT
I am currently working on a dictionary populating program. I currently
have a socket connection my local news server and am trawling through
all of the articles looking for new words. Java's String class has a
method that hashes strings. I was wondering if i should still be using
these even though I have over two million words in the hash table.
Although the hash table is currently Big 0(4).

I am using the Multiply Add and Divide (MAD) method for the compression
of the hash code, does Java have any built in functions(methods) that
will do this for me, or does anyone know of a more efficient way?

j1mb0jay
Eric Sosman - 11 Jan 2008 22:38 GMT
> I am currently working on a dictionary populating program. I currently
> have a socket connection my local news server and am trawling through
> all of the articles looking for new words. Java's String class has a
> method that hashes strings. I was wondering if i should still be using
> these even though I have over two million words in the hash table.
> Although the hash table is currently Big 0(4).

    This makes no sense.  O(4) = O(1) = O(0.01) = O(1000000),
by definition.  What do you really mean?

> I am using the Multiply Add and Divide (MAD) method for the compression
> of the hash code, does Java have any built in functions(methods) that
> will do this for me, or does anyone know of a more efficient way?

    The value delivered by hashCode -- for any class, not
just for String -- is a Java int, 32 bits wide.  How (and why)
are you "compressing" this value?

Signature

Eric Sosman
esosman@ieee-dot-org.invalid

j1mb0jay - 11 Jan 2008 23:05 GMT
>> I am currently working on a dictionary populating program. I currently
>> have a socket connection my local news server and am trawling through
[quoted text clipped - 14 lines]
> just for String -- is a Java int, 32 bits wide.  How (and why)
> are you "compressing" this value?

My hash table is made up of an array of n LinkedLists (where n is a
prime number that is roughly double the number of words in the dictionary).

I firstly use the String.hashCode() method on a given word. I then
compress this number so that i can use it as a index into the array of
LinkedList; as this 32bit number is often far to large. I then insert
the word into the LinkedList array at the compressed value index(The
fact the hashTable is an array of LinkedLists is so that it handles
collisions)

After inserting all of the words into the dictionary the largest
LinkedList in the array has only four elements. I thought Big O(4) was
the correct way of describing this.

Would it help if i posted my classes on here, or offer you a place to
download the program.

j1mb0jay
Daniel Pitts - 11 Jan 2008 23:13 GMT
> >> I am currently working on a dictionary populating program. I currently
> >> have a socket connection my local news server and am trawling through
[quoted text clipped - 33 lines]
>
> j1mb0jay

Why aren't you using the existing HashMap class?

If you want a compact representation of the words you come across,
consider a prefix tree data structure instead.

Just so you know, Big O measures the dominant term without
multipliers,  For instance, if your algorithm takes N *n + N + 4
steps, then it is O(N*N).  If it takes 4*n*n steps, it is still O(N*N)
j1mb0jay - 11 Jan 2008 23:20 GMT
>>>> I am currently working on a dictionary populating program. I currently
>>>> have a socket connection my local news server and am trawling through
[quoted text clipped - 38 lines]
> multipliers,  For instance, if your algorithm takes N *n + N + 4
> steps, then it is O(N*N).  If it takes 4*n*n steps, it is still O(N*N)

I have been asked to create my own data structures to help aid
understanding for the course material for my degree module.(Check
article header)

Because i am currently building the dictionary file by trawling news
articles each word I pull from an article needs to be checked in the
dictionary to see if we already have it(I don't want to store each word
more than once). My current methodology means I only have to look at a
maximum of 4 words(out of 2.5 million) to see if I already have this
word stored in memory. Does this still mean my retrieval method is Big(N
Squared)

j1mb0jay
Patricia Shanahan - 11 Jan 2008 23:26 GMT
>>>>> I am currently working on a dictionary populating program. I currently
>>>>> have a socket connection my local news server and am trawling through
[quoted text clipped - 43 lines]
> understanding for the course material for my degree module.(Check
> article header)

That is indeed a good reason to avoid using the standard classes.

Perhaps you should try a few different hash code to bucket number
mappings, and compare performance. In some situations I have found that
a really simple, quickly calculated mapping such as reduction modulo a
power of two had about the same collision rate as more complicated,
slower to compute, functions.

> Because i am currently building the dictionary file by trawling news
> articles each word I pull from an article needs to be checked in the
[quoted text clipped - 3 lines]
> word stored in memory. Does this still mean my retrieval method is Big(N
> Squared)

Note that the java.util hash-based classes do offer ways of controlling
the number of buckets.

Big-O is about limits as the problem size tends to infinity. If you only
have to look at 4 words regardless of the number of words, then your
lookup performance would be O(1) (Equivalent to O(42), O(1e100) etc. but
O(1) is more conventional). If the number of words you have to examine
depends on an upper bound on the number of words you process, you would
need to examine the effect of increasing the number of words to get a
computational complexity.

Patricia
Daniel Pitts - 11 Jan 2008 23:28 GMT
> >>>> I am currently working on a dictionary populating program. I currently
> >>>> have a socket connection my local news server and am trawling through
[quoted text clipped - 52 lines]
>
> j1mb0jay

Actually, it means it has Big O(1) (As hash tables tend to)

Don't use the hash for the index in the linked list.  Don't even
BOTHER with indexes in the linked list.  The hash should be an index
into an array of lists (of whatever sort, linked or otherwise).

Then each list should be relatively small, so trivial to search/
insert.
j1mb0jay - 11 Jan 2008 23:40 GMT
>>>>>> I am currently working on a dictionary populating program. I currently
>>>>>> have a socket connection my local news server and am trawling through
[quoted text clipped - 53 lines]
> Then each list should be relatively small, so trivial to search/
> insert.

My lists are currently smallish, i was just hoping someone would know a
better hash Code function or a better compression method which would
make each linked list have 1 element rather than up to 4 elements. So
that would mean each linked list could be replaced with a simple
KeyValuePair.

Perfection means better marks :D

j1mb0jay
Patricia Shanahan - 11 Jan 2008 23:43 GMT
>>>>>>> I am currently working on a dictionary populating program. I
>>>>>>> currently
[quoted text clipped - 66 lines]
>
> Perfection means better marks :D

In that case, do a search for "perfect hash". :-)

If you have a fixed list of words, it may be possible to construct a
hash function such that every member of the list has a different code. I
don't know whether it is practical for 2.5 million words - perfect
hashing is usually used for small sets of keywords.

Patricia
j1mb0jay - 12 Jan 2008 00:01 GMT
>>>>>>>> I am currently working on a dictionary populating program. I
>>>>>>>> currently
[quoted text clipped - 79 lines]
>
> Patricia

I have two main ways of using the dictionary, one for inserting words
for which i use the hash table, lucky for me i have access to servers
with lots of RAM at university, this is why the hash table is double the
size of the words in the dictionary. The second is when I am trying to
break different encryption types. For which I load the words in a binary
sorted tree where the comparable data is the popularity of each word
during the last trawl of the newsgroups.
Roedy Green - 12 Jan 2008 05:50 GMT
>My lists are currently smallish, i was just hoping someone would know a
>better hash Code function or a better compression method which would
>make each linked list have 1 element rather than up to 4 elements. So
>that would mean each linked list could be replaced with a simple
>KeyValuePair.

There is an algorithm to create an algorithm to get perfect
disbursement IF you know the values in advance, called GPerf.  There
is a link to it at http://mindprod.com/jgloss/hashcode.html
Signature

Roedy Green, Canadian Mind Products
The Java Glossary, http://mindprod.com

Chris - 12 Jan 2008 22:01 GMT
>>>>>>> I am currently working on a dictionary populating program. I
>>>>>>> currently
[quoted text clipped - 68 lines]
>
> j1mb0jay

Your choice of data structure should depend on what you're trying to
optimize for.

If you're trying to optimize for speed, then a traditional hash table is
a good choice. You're doing it right.

If you're trying to optimize for space, I would try to reduce your
object count. Each object takes up a lot of memory. You could try to
keep the words, and the pointers to them, in big, possibly paged arrays.
(I've done this with a big int [] array and pages of char [] arrays. It
works and it's pretty quick). Another way to optimize for space is to
implement some kind of trie, like a patricia trie. Yet another way would
be to do something btree-like, and implement key prefix compression.
(It's a ton of work, but really efficient).

If you're trying to optimize for marks, then implement a Bloom Filter
for dictionary lookup. :) You'll blow them away...
Eric Sosman - 12 Jan 2008 16:07 GMT
>>>>> I am currently working on a dictionary populating program. I currently
>>>>> have a socket connection my local news server and am trawling through
[quoted text clipped - 18 lines]
>>> compress this number so that i can use it as a index into the array of
>>> LinkedList; as this 32bit number is often far to large.

    Aha!  Your "compression" is the computation of a bucket
index from the hash code.  Here's a hint about one reasonable
way to do it: Ask yourself why you chose a prime number as the
bucket count.  Was it because prime numbers swirl around hash
tables like a mystical numerological fog, or was there a concrete
reason?  If the latter, what was it?

> Because i am currently building the dictionary file by trawling news
> articles each word I pull from an article needs to be checked in the
[quoted text clipped - 3 lines]
> word stored in memory. Does this still mean my retrieval method is Big(N
> Squared)

    It depends what you're counting, and how you're counting it.

    If you're counting the number of String comparisons involved
in looking up one String value in a table that already holds N
values, the effort is O(N).  You choose one of M lists, which holds
(on the average) N/M items, and you search the list.  The search
effort is proportional to the list length, hence (since M is a
constant and since Big-Oh swallows proportionality constants) the
search effort is O(N).  With a very small multiplier, to be sure,
but still O(N).

    But maybe M isn't constant after all; you said you chose it
to be roughly twice the size of the expected N.  If you regard M
as a function of N rather than a constant, then the average list
length N/M is N/(2*N) is 0.5, itself a constant, and the search
effort is O(1).  Different ground rules, different outcomes.

    Or maybe you're looking at the total effort to insert all N
items into an initially empty table.  Then under the first model
the effort is proportional to 0/M + 1/M + ... + (N-1)/M, which
is N*(N-1)/(2*M) or O(N*N).  Under the second model, the effort
is 0.5 + 0.5 + ... + 0.5 = N/2 which is O(N).

    Or maybe you're noting that each word is inserted only once
but looked up many times, in which case things get much more
complex and you need to start worrying about effects like "common"
words entering into the table pretty early in the game while the
table is fairly small, and "rare" words appearing later on when
the table has grown large.  Also, searches for "common" words are
almost always successful, while those for "rare" words have more
chance of being unsuccessful.  A thorny thicket, to be sure.

    Or maybe you're also trying to account for the time spent in
computing the hash codes, in which case the lengths of the Strings
enter into the picture, and not just their quantity.

    Big-Oh exists for the purpose of sweeping unimportant details
under the rug, deliberately discarding precision.  But you can only
use Big-Oh well if you're precise about what you're approximating,
about what "N" means, about what things you model as constants and
what you consider variable, and so on.  You can only throw away
precision if you're precise about throwing it away!

Signature

Eric Sosman
esosman@ieee-dot-org.invalid

Patricia Shanahan - 11 Jan 2008 23:19 GMT
>>> I am currently working on a dictionary populating program. I
>>> currently have a socket connection my local news server and am
[quoted text clipped - 31 lines]
> Would it help if i posted my classes on here, or offer you a place to
> download the program.

This is very similar to the design of java.util.HashSet, except it
already has methods for mapping from hashCode to bucket number that have
been tested with Java String.

Is there some particular reason for rolling your own rather than using
the java.util class?

Patricia
j1mb0jay - 11 Jan 2008 23:27 GMT
>>>> I am currently working on a dictionary populating program. I
>>>> currently have a socket connection my local news server and am
[quoted text clipped - 41 lines]
>
> Patricia

I have to roll out my own for coursework for my data structures module.

j1mb0jay
Roedy Green - 11 Jan 2008 23:40 GMT
>I have over two million words in the hash table

That is not very much compared with the size of the hashCode. However,
if you needed a bigger hashCode for some reason, there are several
popular digest algorithms that will give you various sized results.
See http://mindprod.com/jgloss/digest.html

The traditional way to prune a hash code to size without losing
"randomness" is to take the absolute value of the modulus.

see http://mindprod.com/jgloss/hashcode.html
http://mindprod.com/jgloss/hashtable.html
http://mindprod.com/jgloss/hashmap.html
for a discussion.
Signature

Roedy Green, Canadian Mind Products
The Java Glossary, http://mindprod.com

j1mb0jay - 11 Jan 2008 23:43 GMT
>> I have over two million words in the hash table
>
[quoted text clipped - 10 lines]
> http://mindprod.com/jgloss/hashmap.html
> for a discussion.

Thank you, I am looking to prune the hash code rather than make it
larger and will take a look at mindprod.

Thanks again

j1mb0jay
Lew - 12 Jan 2008 00:30 GMT
> Thank you, I am looking to prune the hash code rather than make it
> larger and will take a look at mindprod.

Patricia's suggestion of taking hashCode = value % k, where k is the current
size of your hash table, is probably the best.

Signature

Lew

j1mb0jay - 12 Jan 2008 00:43 GMT
>> Thank you, I am looking to prune the hash code rather than make it
>> larger and will take a look at mindprod.
>
> Patricia's suggestion of taking hashCode = value % k, where k is the
> current size of your hash table, is probably the best.

Am i correct is saying this will work best if the current size of the
hash table is a prime number ?

j1mb0jay
Patricia Shanahan - 12 Jan 2008 01:10 GMT
>>> Thank you, I am looking to prune the hash code rather than make it
>>> larger and will take a look at mindprod.
[quoted text clipped - 4 lines]
> Am i correct is saying this will work best if the current size of the
> hash table is a prime number ?

That seems like a good subject for experiments. It is far from obvious,
especially given the design of the String hashCode method. I would,
however, avoid 31.

Patricia
Roedy Green - 12 Jan 2008 05:51 GMT
>Am i correct is saying this will work best if the current size of the
>hash table is a prime number ?

You preserve the most high order randomness that way. This is
discussed in the links I gave you earlier.
Signature

Roedy Green, Canadian Mind Products
The Java Glossary, http://mindprod.com

Charles - 12 Jan 2008 04:53 GMT
> I am currently working on a dictionary populating program. I currently
> have a socket connection my local news server and am trawling through
[quoted text clipped - 8 lines]
>
> j1mb0jay

Hi Group:

I don't know much about the MAD method, but LZW is another useful
compression technique. I consider compression akin to folding spaces.
I also think of greatest common divisor algorithms.

I am thinking of it, but I am not showing how to use it. Sorry.

If you have time implement two solutions and continuously improve.

You could also look up compression patents.

:-)

Good luck.
Lew - 12 Jan 2008 05:45 GMT
> I don't know much about the MAD method, but LZW is another useful
> compression technique. I consider compression akin to folding spaces.
[quoted text clipped - 3 lines]
>
> If you have time implement two solutions and continuously improve.

Since the data to be "compressed" is exactly four bytes long, LZW and the like
are way too heavy.  Patricia's advice to use the remainder operator % is still
the best.

> You could also look up compression patents.

There are at least two U.S. patents for compression algorithms that are not
mathematically feasible.  Be careful if you are searching patents.

Signature

Lew

j1mb0jay - 12 Jan 2008 12:10 GMT
>> I don't know much about the MAD method, but LZW is another useful
>> compression technique. I consider compression akin to folding spaces.
[quoted text clipped - 12 lines]
> There are at least two U.S. patents for compression algorithms that are
> not mathematically feasible.  Be careful if you are searching patents.

I am not allowed to use any patents for my coursework it states this in
the student hand book very clearly, and as stated i feel it is over the
top for what i need.

j1mb0jay
Roedy Green - 12 Jan 2008 05:54 GMT
On Fri, 11 Jan 2008 20:53:12 -0800 (PST), Charles
<charlesalopez@gmail.com> wrote, quoted or indirectly quoted someone
who said :

>I don't know much about the MAD method, but LZW is another useful
>compression technique. I consider compression akin to folding spaces.
>I also think of greatest common divisor algorithms.
>
>I am thinking of it, but I am not showing how to use it. Sorry.

see http://mindprod.com/jgloss/gzip.html

The FileIO amanuensis will show you how to do an purely in-RAM
implementation.

See http://mindprod.com/jgloss/applet/fileio.html

As Patricia said, do experiments and measure.  It is hard to know in
advance which algorithms will perform best.
Signature

Roedy Green, Canadian Mind Products
The Java Glossary, http://mindprod.com

j1mb0jay - 12 Jan 2008 12:02 GMT
>> I am currently working on a dictionary populating program. I currently
>> have a socket connection my local news server and am trawling through
[quoted text clipped - 24 lines]
>
> Good luck.

I am not trying to compress a file or image so Lempel-Ziv-Welch
Compression wont work for me (or so i think), i just want to compress an
integer so i can use it as a index to a fixed size array. I seem to have
improved my current method by by simply using the % of the 2 number and
the size of the array!!!

j1mb0jay
Patricia Shanahan - 12 Jan 2008 15:48 GMT
>>> I am currently working on a dictionary populating program. I currently
>>> have a socket connection my local news server and am trawling through
[quoted text clipped - 32 lines]
>
> j1mb0jay

I think the situation would be clearer if you thought of it, and
described it, as the mapping from hash code to hash table bucket number.

In the current situation, with a hash function that returns a 32 bit int
and a relatively small hash table, the mapping will be a compression
algorithm in the sense that the range is smaller than the domain, but
the important properties are different.

The main property you need is a tendency for the mapping from a String
representing a word, through the String's hash code, to the bucket
number, to distribute words as evenly as possible among buckets. That
does not correspond to any normal requirement on compression algorithms.

Patricia


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.