Java Forum / General / January 2008
Hash Code Compression
j1mb0jay - 11 Jan 2008 22:11 GMT I am currently working on a dictionary populating program. I currently have a socket connection my local news server and am trawling through all of the articles looking for new words. Java's String class has a method that hashes strings. I was wondering if i should still be using these even though I have over two million words in the hash table. Although the hash table is currently Big 0(4).
I am using the Multiply Add and Divide (MAD) method for the compression of the hash code, does Java have any built in functions(methods) that will do this for me, or does anyone know of a more efficient way?
j1mb0jay
Eric Sosman - 11 Jan 2008 22:38 GMT > I am currently working on a dictionary populating program. I currently > have a socket connection my local news server and am trawling through > all of the articles looking for new words. Java's String class has a > method that hashes strings. I was wondering if i should still be using > these even though I have over two million words in the hash table. > Although the hash table is currently Big 0(4). This makes no sense. O(4) = O(1) = O(0.01) = O(1000000), by definition. What do you really mean?
> I am using the Multiply Add and Divide (MAD) method for the compression > of the hash code, does Java have any built in functions(methods) that > will do this for me, or does anyone know of a more efficient way? The value delivered by hashCode -- for any class, not just for String -- is a Java int, 32 bits wide. How (and why) are you "compressing" this value?
 Signature Eric Sosman esosman@ieee-dot-org.invalid
j1mb0jay - 11 Jan 2008 23:05 GMT >> I am currently working on a dictionary populating program. I currently >> have a socket connection my local news server and am trawling through [quoted text clipped - 14 lines] > just for String -- is a Java int, 32 bits wide. How (and why) > are you "compressing" this value? My hash table is made up of an array of n LinkedLists (where n is a prime number that is roughly double the number of words in the dictionary).
I firstly use the String.hashCode() method on a given word. I then compress this number so that i can use it as a index into the array of LinkedList; as this 32bit number is often far to large. I then insert the word into the LinkedList array at the compressed value index(The fact the hashTable is an array of LinkedLists is so that it handles collisions)
After inserting all of the words into the dictionary the largest LinkedList in the array has only four elements. I thought Big O(4) was the correct way of describing this.
Would it help if i posted my classes on here, or offer you a place to download the program.
j1mb0jay
Daniel Pitts - 11 Jan 2008 23:13 GMT > >> I am currently working on a dictionary populating program. I currently > >> have a socket connection my local news server and am trawling through [quoted text clipped - 33 lines] > > j1mb0jay Why aren't you using the existing HashMap class?
If you want a compact representation of the words you come across, consider a prefix tree data structure instead.
Just so you know, Big O measures the dominant term without multipliers, For instance, if your algorithm takes N *n + N + 4 steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)
j1mb0jay - 11 Jan 2008 23:20 GMT >>>> I am currently working on a dictionary populating program. I currently >>>> have a socket connection my local news server and am trawling through [quoted text clipped - 38 lines] > multipliers, For instance, if your algorithm takes N *n + N + 4 > steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N) I have been asked to create my own data structures to help aid understanding for the course material for my degree module.(Check article header)
Because i am currently building the dictionary file by trawling news articles each word I pull from an article needs to be checked in the dictionary to see if we already have it(I don't want to store each word more than once). My current methodology means I only have to look at a maximum of 4 words(out of 2.5 million) to see if I already have this word stored in memory. Does this still mean my retrieval method is Big(N Squared)
j1mb0jay
Patricia Shanahan - 11 Jan 2008 23:26 GMT >>>>> I am currently working on a dictionary populating program. I currently >>>>> have a socket connection my local news server and am trawling through [quoted text clipped - 43 lines] > understanding for the course material for my degree module.(Check > article header) That is indeed a good reason to avoid using the standard classes.
Perhaps you should try a few different hash code to bucket number mappings, and compare performance. In some situations I have found that a really simple, quickly calculated mapping such as reduction modulo a power of two had about the same collision rate as more complicated, slower to compute, functions.
> Because i am currently building the dictionary file by trawling news > articles each word I pull from an article needs to be checked in the [quoted text clipped - 3 lines] > word stored in memory. Does this still mean my retrieval method is Big(N > Squared) Note that the java.util hash-based classes do offer ways of controlling the number of buckets.
Big-O is about limits as the problem size tends to infinity. If you only have to look at 4 words regardless of the number of words, then your lookup performance would be O(1) (Equivalent to O(42), O(1e100) etc. but O(1) is more conventional). If the number of words you have to examine depends on an upper bound on the number of words you process, you would need to examine the effect of increasing the number of words to get a computational complexity.
Patricia
Daniel Pitts - 11 Jan 2008 23:28 GMT > >>>> I am currently working on a dictionary populating program. I currently > >>>> have a socket connection my local news server and am trawling through [quoted text clipped - 52 lines] > > j1mb0jay Actually, it means it has Big O(1) (As hash tables tend to)
Don't use the hash for the index in the linked list. Don't even BOTHER with indexes in the linked list. The hash should be an index into an array of lists (of whatever sort, linked or otherwise).
Then each list should be relatively small, so trivial to search/ insert.
j1mb0jay - 11 Jan 2008 23:40 GMT >>>>>> I am currently working on a dictionary populating program. I currently >>>>>> have a socket connection my local news server and am trawling through [quoted text clipped - 53 lines] > Then each list should be relatively small, so trivial to search/ > insert. My lists are currently smallish, i was just hoping someone would know a better hash Code function or a better compression method which would make each linked list have 1 element rather than up to 4 elements. So that would mean each linked list could be replaced with a simple KeyValuePair.
Perfection means better marks :D
j1mb0jay
Patricia Shanahan - 11 Jan 2008 23:43 GMT >>>>>>> I am currently working on a dictionary populating program. I >>>>>>> currently [quoted text clipped - 66 lines] > > Perfection means better marks :D In that case, do a search for "perfect hash". :-)
If you have a fixed list of words, it may be possible to construct a hash function such that every member of the list has a different code. I don't know whether it is practical for 2.5 million words - perfect hashing is usually used for small sets of keywords.
Patricia
j1mb0jay - 12 Jan 2008 00:01 GMT >>>>>>>> I am currently working on a dictionary populating program. I >>>>>>>> currently [quoted text clipped - 79 lines] > > Patricia I have two main ways of using the dictionary, one for inserting words for which i use the hash table, lucky for me i have access to servers with lots of RAM at university, this is why the hash table is double the size of the words in the dictionary. The second is when I am trying to break different encryption types. For which I load the words in a binary sorted tree where the comparable data is the popularity of each word during the last trawl of the newsgroups.
Roedy Green - 12 Jan 2008 05:50 GMT >My lists are currently smallish, i was just hoping someone would know a >better hash Code function or a better compression method which would >make each linked list have 1 element rather than up to 4 elements. So >that would mean each linked list could be replaced with a simple >KeyValuePair. There is an algorithm to create an algorithm to get perfect disbursement IF you know the values in advance, called GPerf. There is a link to it at http://mindprod.com/jgloss/hashcode.html
 Signature Roedy Green, Canadian Mind Products The Java Glossary, http://mindprod.com
Chris - 12 Jan 2008 22:01 GMT >>>>>>> I am currently working on a dictionary populating program. I >>>>>>> currently [quoted text clipped - 68 lines] > > j1mb0jay Your choice of data structure should depend on what you're trying to optimize for.
If you're trying to optimize for speed, then a traditional hash table is a good choice. You're doing it right.
If you're trying to optimize for space, I would try to reduce your object count. Each object takes up a lot of memory. You could try to keep the words, and the pointers to them, in big, possibly paged arrays. (I've done this with a big int [] array and pages of char [] arrays. It works and it's pretty quick). Another way to optimize for space is to implement some kind of trie, like a patricia trie. Yet another way would be to do something btree-like, and implement key prefix compression. (It's a ton of work, but really efficient).
If you're trying to optimize for marks, then implement a Bloom Filter for dictionary lookup. :) You'll blow them away...
Eric Sosman - 12 Jan 2008 16:07 GMT >>>>> I am currently working on a dictionary populating program. I currently >>>>> have a socket connection my local news server and am trawling through [quoted text clipped - 18 lines] >>> compress this number so that i can use it as a index into the array of >>> LinkedList; as this 32bit number is often far to large. Aha! Your "compression" is the computation of a bucket index from the hash code. Here's a hint about one reasonable way to do it: Ask yourself why you chose a prime number as the bucket count. Was it because prime numbers swirl around hash tables like a mystical numerological fog, or was there a concrete reason? If the latter, what was it?
> Because i am currently building the dictionary file by trawling news > articles each word I pull from an article needs to be checked in the [quoted text clipped - 3 lines] > word stored in memory. Does this still mean my retrieval method is Big(N > Squared) It depends what you're counting, and how you're counting it.
If you're counting the number of String comparisons involved in looking up one String value in a table that already holds N values, the effort is O(N). You choose one of M lists, which holds (on the average) N/M items, and you search the list. The search effort is proportional to the list length, hence (since M is a constant and since Big-Oh swallows proportionality constants) the search effort is O(N). With a very small multiplier, to be sure, but still O(N).
But maybe M isn't constant after all; you said you chose it to be roughly twice the size of the expected N. If you regard M as a function of N rather than a constant, then the average list length N/M is N/(2*N) is 0.5, itself a constant, and the search effort is O(1). Different ground rules, different outcomes.
Or maybe you're looking at the total effort to insert all N items into an initially empty table. Then under the first model the effort is proportional to 0/M + 1/M + ... + (N-1)/M, which is N*(N-1)/(2*M) or O(N*N). Under the second model, the effort is 0.5 + 0.5 + ... + 0.5 = N/2 which is O(N).
Or maybe you're noting that each word is inserted only once but looked up many times, in which case things get much more complex and you need to start worrying about effects like "common" words entering into the table pretty early in the game while the table is fairly small, and "rare" words appearing later on when the table has grown large. Also, searches for "common" words are almost always successful, while those for "rare" words have more chance of being unsuccessful. A thorny thicket, to be sure.
Or maybe you're also trying to account for the time spent in computing the hash codes, in which case the lengths of the Strings enter into the picture, and not just their quantity.
Big-Oh exists for the purpose of sweeping unimportant details under the rug, deliberately discarding precision. But you can only use Big-Oh well if you're precise about what you're approximating, about what "N" means, about what things you model as constants and what you consider variable, and so on. You can only throw away precision if you're precise about throwing it away!
 Signature Eric Sosman esosman@ieee-dot-org.invalid
Patricia Shanahan - 11 Jan 2008 23:19 GMT >>> I am currently working on a dictionary populating program. I >>> currently have a socket connection my local news server and am [quoted text clipped - 31 lines] > Would it help if i posted my classes on here, or offer you a place to > download the program. This is very similar to the design of java.util.HashSet, except it already has methods for mapping from hashCode to bucket number that have been tested with Java String.
Is there some particular reason for rolling your own rather than using the java.util class?
Patricia
j1mb0jay - 11 Jan 2008 23:27 GMT >>>> I am currently working on a dictionary populating program. I >>>> currently have a socket connection my local news server and am [quoted text clipped - 41 lines] > > Patricia I have to roll out my own for coursework for my data structures module.
j1mb0jay
Roedy Green - 11 Jan 2008 23:40 GMT >I have over two million words in the hash table That is not very much compared with the size of the hashCode. However, if you needed a bigger hashCode for some reason, there are several popular digest algorithms that will give you various sized results. See http://mindprod.com/jgloss/digest.html
The traditional way to prune a hash code to size without losing "randomness" is to take the absolute value of the modulus.
see http://mindprod.com/jgloss/hashcode.html http://mindprod.com/jgloss/hashtable.html http://mindprod.com/jgloss/hashmap.html for a discussion.
 Signature Roedy Green, Canadian Mind Products The Java Glossary, http://mindprod.com
j1mb0jay - 11 Jan 2008 23:43 GMT >> I have over two million words in the hash table > [quoted text clipped - 10 lines] > http://mindprod.com/jgloss/hashmap.html > for a discussion. Thank you, I am looking to prune the hash code rather than make it larger and will take a look at mindprod.
Thanks again
j1mb0jay
Lew - 12 Jan 2008 00:30 GMT > Thank you, I am looking to prune the hash code rather than make it > larger and will take a look at mindprod. Patricia's suggestion of taking hashCode = value % k, where k is the current size of your hash table, is probably the best.
 Signature Lew
j1mb0jay - 12 Jan 2008 00:43 GMT >> Thank you, I am looking to prune the hash code rather than make it >> larger and will take a look at mindprod. > > Patricia's suggestion of taking hashCode = value % k, where k is the > current size of your hash table, is probably the best. Am i correct is saying this will work best if the current size of the hash table is a prime number ?
j1mb0jay
Patricia Shanahan - 12 Jan 2008 01:10 GMT >>> Thank you, I am looking to prune the hash code rather than make it >>> larger and will take a look at mindprod. [quoted text clipped - 4 lines] > Am i correct is saying this will work best if the current size of the > hash table is a prime number ? That seems like a good subject for experiments. It is far from obvious, especially given the design of the String hashCode method. I would, however, avoid 31.
Patricia
Roedy Green - 12 Jan 2008 05:51 GMT >Am i correct is saying this will work best if the current size of the >hash table is a prime number ? You preserve the most high order randomness that way. This is discussed in the links I gave you earlier.
 Signature Roedy Green, Canadian Mind Products The Java Glossary, http://mindprod.com
Charles - 12 Jan 2008 04:53 GMT > I am currently working on a dictionary populating program. I currently > have a socket connection my local news server and am trawling through [quoted text clipped - 8 lines] > > j1mb0jay Hi Group:
I don't know much about the MAD method, but LZW is another useful compression technique. I consider compression akin to folding spaces. I also think of greatest common divisor algorithms.
I am thinking of it, but I am not showing how to use it. Sorry.
If you have time implement two solutions and continuously improve.
You could also look up compression patents.
:-) Good luck.
Lew - 12 Jan 2008 05:45 GMT > I don't know much about the MAD method, but LZW is another useful > compression technique. I consider compression akin to folding spaces. [quoted text clipped - 3 lines] > > If you have time implement two solutions and continuously improve. Since the data to be "compressed" is exactly four bytes long, LZW and the like are way too heavy. Patricia's advice to use the remainder operator % is still the best.
> You could also look up compression patents. There are at least two U.S. patents for compression algorithms that are not mathematically feasible. Be careful if you are searching patents.
 Signature Lew
j1mb0jay - 12 Jan 2008 12:10 GMT >> I don't know much about the MAD method, but LZW is another useful >> compression technique. I consider compression akin to folding spaces. [quoted text clipped - 12 lines] > There are at least two U.S. patents for compression algorithms that are > not mathematically feasible. Be careful if you are searching patents. I am not allowed to use any patents for my coursework it states this in the student hand book very clearly, and as stated i feel it is over the top for what i need.
j1mb0jay
Roedy Green - 12 Jan 2008 05:54 GMT On Fri, 11 Jan 2008 20:53:12 -0800 (PST), Charles <charlesalopez@gmail.com> wrote, quoted or indirectly quoted someone who said :
>I don't know much about the MAD method, but LZW is another useful >compression technique. I consider compression akin to folding spaces. >I also think of greatest common divisor algorithms. > >I am thinking of it, but I am not showing how to use it. Sorry. see http://mindprod.com/jgloss/gzip.html
The FileIO amanuensis will show you how to do an purely in-RAM implementation.
See http://mindprod.com/jgloss/applet/fileio.html
As Patricia said, do experiments and measure. It is hard to know in advance which algorithms will perform best.
 Signature Roedy Green, Canadian Mind Products The Java Glossary, http://mindprod.com
j1mb0jay - 12 Jan 2008 12:02 GMT >> I am currently working on a dictionary populating program. I currently >> have a socket connection my local news server and am trawling through [quoted text clipped - 24 lines] > > Good luck. I am not trying to compress a file or image so Lempel-Ziv-Welch Compression wont work for me (or so i think), i just want to compress an integer so i can use it as a index to a fixed size array. I seem to have improved my current method by by simply using the % of the 2 number and the size of the array!!!
j1mb0jay
Patricia Shanahan - 12 Jan 2008 15:48 GMT >>> I am currently working on a dictionary populating program. I currently >>> have a socket connection my local news server and am trawling through [quoted text clipped - 32 lines] > > j1mb0jay I think the situation would be clearer if you thought of it, and described it, as the mapping from hash code to hash table bucket number.
In the current situation, with a hash function that returns a 32 bit int and a relatively small hash table, the mapping will be a compression algorithm in the sense that the range is smaller than the domain, but the important properties are different.
The main property you need is a tendency for the mapping from a String representing a word, through the String's hash code, to the bucket number, to distribute words as evenly as possible among buckets. That does not correspond to any normal requirement on compression algorithms.
Patricia
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|