Java Forum / General / September 2007
Java algorithm (diff) question.
nebiyou1@gmail.com - 08 Sep 2007 04:36 GMT I have two CSV files (file1 and file2). Each file contains 50,000 + lines. Each line is very long. The first entry in each line is a key as shown in the example below.
Example: key1,val1,val2,val3.... key2,val1,val2,val3..... etc
I would like to do a "diff"(unix like) between file1 and file2 and find out the lines that are different. The entries in each file are not necessarily sorted by key.
Approaches I considered. 1. To load all the entries into a hashmap using the key's as a key and the rest of the line as a value. This is a problem because of memory.
2. To generate the hashcode for each line and use the key and the hashcode to store in a hash map and do the comparison. The problem here is that two different strings could have the same hashcode (false positive).
Any suggestions would be appreciated.
nebiyou1@gmail.com - 08 Sep 2007 04:38 GMT On Sep 7, 11:36 pm, nebiy...@gmail.com wrote:
> I have two CSV files (file1 and file2). > Each file contains 50,000 + lines. Each line is very long. [quoted text clipped - 20 lines] > > Any suggestions would be appreciated. made a clarification to the original post.
nebiyou1@gmail.com - 08 Sep 2007 05:14 GMT On Sep 7, 11:38 pm, nebiy...@gmail.com wrote:
> On Sep 7, 11:36 pm, nebiy...@gmail.com wrote: > [quoted text clipped - 24 lines] > > made a clarification to the original post. actually i can use the key+hashcode. Thanks for reading:)
Patricia Shanahan - 08 Sep 2007 06:07 GMT > On Sep 7, 11:38 pm, nebiy...@gmail.com wrote: >> On Sep 7, 11:36 pm, nebiy...@gmail.com wrote: [quoted text clipped - 21 lines] > > actually i can use the key+hashcode. Thanks for reading:) I find the final remark a little bit worrying. If the key alone is sufficient, why use hash code at all? On the other hand, if the key is not unique, what prevents a hash code collision among lines with the same key?
Patricia
Daniel Pitts - 10 Sep 2007 16:33 GMT On Sep 7, 9:14 pm, nebiy...@gmail.com wrote:
> On Sep 7, 11:38 pm, nebiy...@gmail.com wrote: > [quoted text clipped - 28 lines] > > actually i can use the key+hashcode. Thanks for reading:) My suggestion is to actually sort the files based on key, and then compare the files one line at a time, only moving to the next line in a particular file if the key for that line is less than the key for the current line in the other file.
Ofcourse, this only works if the key's have some sort of natural ordering, and you can manage to sort the files in the first place.
Roedy Green - 08 Sep 2007 06:43 GMT >I would like to do a "diff"(unix like) between file1 and file2 and >find out >the lines that are different. The entries in each file are not >necessarily sorted by key. If these files are too huge to put in RAM, here is a rough outline of a diff algorithm.
Create entries like this: line#, offset, len, digest-64
for each file. sort by digest, line#
If the files are identical you should see
two identical lists of quads.
If they are not identical, you can identify the parts of the files that are identical, and also the parts that are identical, but shifted.
You can then identify insertions and deletions. You can then build the traditional DIFF display of any portion of the two files.
The method assumes all lines in the original file are unique. If you have blank lines, or other non-unique lines, they need to be made unique
see http://mindprod.com/jgloss/digest.html
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
NebiyouGirma@gmail.com - 08 Sep 2007 19:34 GMT On Sep 8, 1:43 am, Roedy Green <see_webs...@mindprod.com.invalid> wrote:
> On Fri, 07 Sep 2007 20:36:16 -0700, nebiy...@gmail.com wrote, quoted > or indirectly quoted someone who said : [quoted text clipped - 32 lines] > Roedy Green Canadian Mind Products > The Java Glossaryhttp://mindprod.com Thanks. The lines in each file are guaranteed to be unique. Further, each line has also a unique key to identify it. My requirement is only to compare if lines with the same key in file1 and file2.
For this I can build, HashMap<Key,Key-hashcode> for each line and do the comparison. My worry here is that there is a chance that I could get a false positive for two strings of the same Key.
Does digest-64 guarantee that two different strings always hash to different values? How would it help?
NebiyouGirma@gmail.com - 08 Sep 2007 19:42 GMT On Sep 8, 2:34 pm, NebiyouGi...@gmail.com wrote:
> On Sep 8, 1:43 am, Roedy Green <see_webs...@mindprod.com.invalid> > wrote: [quoted text clipped - 48 lines] > Does digest-64 guarantee that two different strings always hash to > different values? How would it help? Thanks. The lines in each file are guaranteed to be unique. Further, each line has also a unique key to identify it. The lines are not sorted by key, however. My requirement is only to compare if lines with the same key in file1 and file2 are different.
For this I can build, HashMap<Key,Key-hashcode> for a file (key= the key, value= hashcode). My worry here is that there is a chance that I could get a false positive for two strings of the same Key.
Does digest-64 guarantee that two different strings always hash to different values? How would it help?
can you elaborate on ur algorithm.
Lew - 08 Sep 2007 19:48 GMT > Thanks. The lines in each file are guaranteed to be unique. Further, > each line has also a unique key to identify it. [quoted text clipped - 5 lines] > My worry here is that there is a chance that I could get a false > positive for two strings of the same Key. You just said that the key is unique; doesn't that mean that there cannot be two Strings with the same "Key"?
Note that HashMap uses the hashCode() of each key to locate the linked list of possible values for that key. Note also that any Map promises exactly one entry for each key that is equivalent according to equals(). In other words, in any Map the key provides a unique lookup.
It is pathologically unusual to store the hash code of an item as the value of that item in a Map. What do you think this will allow you to accomplish?
> Does digest-64 guarantee that two different strings always hash to > different values? By definition, a hash guarantees that at least two items from the hash function domain will map to the same hash code.
> How would it help?
 Signature Lew
Roedy Green - 09 Sep 2007 10:16 GMT >Does digest-64 guarantee that two different strings always hash to >different values? How would it help? A 64 bit digest collision with a properly done hash function is 1 in 2^64 which is roughly one in 10,000,000,000,000,000,000 Your odds of winning the lottery are considerably better than that.
Even with 32 bit hashes, (properly implemented) the odds of a collision are one in 1 billion for any given pair. but the odds get worse as you compare more pairs.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Lew - 09 Sep 2007 13:21 GMT >> Does digest-64 guarantee that two different strings always hash to >> different values? How would it help? [quoted text clipped - 6 lines] > collision are one in 1 billion for any given pair. but the odds get > worse as you compare more pairs. I, too, am curious how the digest will help.
 Signature Lew
Rogan Dawes - 10 Sep 2007 18:00 GMT >>> Does digest-64 guarantee that two different strings always hash to >>> different values? How would it help? [quoted text clipped - 8 lines] > > I, too, am curious how the digest will help. Perhaps I misunderstand the OP, but IIUC, he wants to be able to identify new lines and deleted lines between two large unsorted files. In particular, each line is very long.
Each line begins with a unique key (but I think that the rest of the line MAY change?). So, rather than storing the unique key, and the whole line (which is large), store the unique key, and the hash of the line, which is a fixed length (128 - 256 bits, depending on hash). Depending on the collision space, you may want to choose a better hash.
e.g. HashMap<UniqueKey, MD5(line-UniqueKey)> HashMap<UniqueKey, SHA256(line-UniqueKey)>
So, if the hashes for the unique key change, then report the difference, otherwise move to the next item.
Your diff report will be in the order chosen by the hash's iterator, of course. If you have specific requirements, use a TreeMap with a suitable Comparator.
Rogan
Lew - 11 Sep 2007 00:39 GMT > So, if the hashes for the unique key change, then report the difference, > otherwise move to the next item. But if the hashes are the same, it's not a guarantee that the item has not changed. It may, unlikely but it may have changed to another value that hashes to the same value.
The odds of this might be acceptable, but it does require very careful choice of hashing algorithm.
 Signature Lew
Rogan Dawes - 12 Sep 2007 10:58 GMT >> So, if the hashes for the unique key change, then report the >> difference, otherwise move to the next item. [quoted text clipped - 5 lines] > The odds of this might be acceptable, but it does require very careful > choice of hashing algorithm. Of course. That is why I mentioned using something like SHA-256, which would be VERY unlikely to collide. 1 in 2^256 = 1.15792089 × 10^77
Even SHA-1 is unlikely to collide: 1 in 2^160 = 1.46150164 × 10^48
Rogan
Ingo R. Homann - 10 Sep 2007 15:13 GMT Hi,
> I have two CSV files (file1 and file2). > Each file contains 50,000 + lines. Each line is very long. [quoted text clipped - 20 lines] > > Any suggestions would be appreciated. IMHO the only clean solution is to store the data in a database and then iterate over all keys.
Ciao, Ingo
bugbear - 19 Sep 2007 09:08 GMT > I have two CSV files (file1 and file2). > Each file contains 50,000 + lines. Each line is very long. [quoted text clipped - 9 lines] > the lines that are different. The entries in each file are not > necessarily sorted by key. Start here:
http://citeseer.ist.psu.edu/myers86ond.html
BugBear
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|