Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / September 2007

Tip: Looking for answers? Try searching our database.

Java algorithm (diff) question.

Thread view: 
nebiyou1@gmail.com - 08 Sep 2007 04:36 GMT
I have two CSV files (file1 and file2).
Each file contains 50,000 + lines. Each line is very long.
The first entry in each line is a key as shown in the example below.

Example:
key1,val1,val2,val3....
key2,val1,val2,val3.....
etc

I would like to do a "diff"(unix like) between file1 and file2 and
find out
the lines that are different. The entries in each file are not
necessarily sorted by key.

Approaches I considered.
1. To load all the entries into a hashmap using the key's as a key and
the rest of the line as a value. This is a problem because of memory.

2. To generate the hashcode for each line and use the key and the
hashcode to
store in a hash map and do the comparison. The problem here is that
two different strings could have the same hashcode (false positive).

Any suggestions would be appreciated.
nebiyou1@gmail.com - 08 Sep 2007 04:38 GMT
On Sep 7, 11:36 pm, nebiy...@gmail.com wrote:
> I have two CSV files (file1 and file2).
> Each file contains 50,000 + lines. Each line is very long.
[quoted text clipped - 20 lines]
>
> Any suggestions would be appreciated.

made a clarification to the original post.
nebiyou1@gmail.com - 08 Sep 2007 05:14 GMT
On Sep 7, 11:38 pm, nebiy...@gmail.com wrote:
> On Sep 7, 11:36 pm, nebiy...@gmail.com wrote:
>
[quoted text clipped - 24 lines]
>
> made a clarification to the original post.

actually i can use the key+hashcode. Thanks for reading:)
Patricia Shanahan - 08 Sep 2007 06:07 GMT
> On Sep 7, 11:38 pm, nebiy...@gmail.com wrote:
>> On Sep 7, 11:36 pm, nebiy...@gmail.com wrote:
[quoted text clipped - 21 lines]
>
> actually i can use the key+hashcode. Thanks for reading:)

I find the final remark a little bit worrying. If the key alone is
sufficient, why use hash code at all? On the other hand, if the key is
not unique, what prevents a hash code collision among lines with the
same key?

Patricia
Daniel Pitts - 10 Sep 2007 16:33 GMT
On Sep 7, 9:14 pm, nebiy...@gmail.com wrote:
> On Sep 7, 11:38 pm, nebiy...@gmail.com wrote:
>
[quoted text clipped - 28 lines]
>
> actually i can use the key+hashcode. Thanks for reading:)

My suggestion is to actually sort the files based on key, and then
compare the files one line at a time, only moving to the next line in
a particular file if the key for that line is less than the key for
the current line in the other file.

Ofcourse, this only works if the key's have some sort of natural
ordering, and you can manage to sort the files in the first place.
Roedy Green - 08 Sep 2007 06:43 GMT
>I would like to do a "diff"(unix like) between file1 and file2 and
>find out
>the lines that are different. The entries in each file are not
>necessarily sorted by key.

If these files are too huge to put in RAM, here is a rough outline of
a diff algorithm.

Create entries like this:
line#, offset, len, digest-64

for each file.
sort by digest, line#

If the files are identical you should see

two identical lists of quads.

If they are not identical, you can identify the parts of the files
that are identical, and also the parts that are identical, but
shifted.

You can then identify insertions and deletions.  You can then
build the traditional DIFF display  of any portion of the two files.

The method assumes all lines in the original file are unique. If you
have blank lines, or other non-unique lines, they need to be made
unique

see http://mindprod.com/jgloss/digest.html
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

NebiyouGirma@gmail.com - 08 Sep 2007 19:34 GMT
On Sep 8, 1:43 am, Roedy Green <see_webs...@mindprod.com.invalid>
wrote:
> On Fri, 07 Sep 2007 20:36:16 -0700, nebiy...@gmail.com wrote, quoted
> or indirectly quoted someone who said :
[quoted text clipped - 32 lines]
> Roedy Green Canadian Mind Products
> The Java Glossaryhttp://mindprod.com

Thanks. The lines in each file are guaranteed to be unique. Further,
each line has also a unique key to identify it.
My requirement is only to compare if lines with the same key in file1
and file2.

For this I can build,
HashMap<Key,Key-hashcode> for each line and do the comparison.
My worry here is that there is a chance that I could get a false
positive for two strings of the same Key.

Does digest-64 guarantee that two different strings always hash to
different values? How would it help?
NebiyouGirma@gmail.com - 08 Sep 2007 19:42 GMT
On Sep 8, 2:34 pm, NebiyouGi...@gmail.com wrote:
> On Sep 8, 1:43 am, Roedy Green <see_webs...@mindprod.com.invalid>
> wrote:
[quoted text clipped - 48 lines]
> Does digest-64 guarantee that two different strings always hash to
> different values? How would it help?

Thanks. The lines in each file are guaranteed to be unique. Further,
each line has also a unique key to identify it. The lines are not
sorted by key, however.
My requirement is only to compare if lines with the same key in file1
and file2 are different.

For this I can build, HashMap<Key,Key-hashcode> for a file (key= the
key, value= hashcode).
My worry here is that there is a chance that I could get a false
positive for two strings of the same Key.

Does digest-64 guarantee that two different strings always hash to
different values? How would it help?

can you elaborate on ur algorithm.
Lew - 08 Sep 2007 19:48 GMT
> Thanks. The lines in each file are guaranteed to be unique. Further,
> each line has also a unique key to identify it.
[quoted text clipped - 5 lines]
> My worry here is that there is a chance that I could get a false
> positive for two strings of the same Key.

You just said that the key is unique; doesn't that mean that there cannot be
two Strings with the same "Key"?

Note that HashMap uses the hashCode() of each key to locate the linked list of
possible values for that key.  Note also that any Map promises exactly one
entry for each key that is equivalent according to equals().  In other words,
in any Map the key provides a unique lookup.

It is pathologically unusual to store the hash code of an item as the value of
that item in a Map.  What do you think this will allow you to accomplish?

> Does digest-64 guarantee that two different strings always hash to
> different values?

By definition, a hash guarantees that at least two items from the hash
function domain will map to the same hash code.

> How would it help?

Signature

Lew

Roedy Green - 09 Sep 2007 10:16 GMT
>Does digest-64 guarantee that two different strings always hash to
>different values? How would it help?

A 64 bit digest collision with a properly done hash function is 1 in
2^64 which is roughly one in 10,000,000,000,000,000,000 Your odds of
winning the lottery are considerably better than that.

Even with 32 bit hashes, (properly implemented)  the odds of a
collision are one in 1 billion for any given pair.  but the odds get
worse as you compare more pairs.
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Lew - 09 Sep 2007 13:21 GMT
>> Does digest-64 guarantee that two different strings always hash to
>> different values? How would it help?
[quoted text clipped - 6 lines]
> collision are one in 1 billion for any given pair.  but the odds get
> worse as you compare more pairs.

I, too, am curious how the digest will help.

Signature

Lew

Rogan Dawes - 10 Sep 2007 18:00 GMT
>>> Does digest-64 guarantee that two different strings always hash to
>>> different values? How would it help?
[quoted text clipped - 8 lines]
>
> I, too, am curious how the digest will help.

Perhaps I misunderstand the OP, but IIUC, he wants to be able to
identify new lines and deleted lines between two large unsorted files.
In particular, each line is very long.

Each line begins with a unique key (but I think that the rest of the
line MAY change?). So, rather than storing the unique key, and the whole
line (which is large), store the unique key, and the hash of the line,
which is a fixed length (128 - 256 bits, depending on hash). Depending
on the collision space, you may want to choose a better hash.

e.g.
HashMap<UniqueKey, MD5(line-UniqueKey)>
HashMap<UniqueKey, SHA256(line-UniqueKey)>

So, if the hashes for the unique key change, then report the difference,
otherwise move to the next item.

Your diff report will be in the order chosen by the hash's iterator, of
course. If you have specific requirements, use a TreeMap with a suitable
Comparator.

Rogan
Lew - 11 Sep 2007 00:39 GMT
> So, if the hashes for the unique key change, then report the difference,
> otherwise move to the next item.

But if the hashes are the same, it's not a guarantee that the item has not
changed.  It may, unlikely but it may have changed to another value that
hashes to the same value.

The odds of this might be acceptable, but it does require very careful choice
of hashing algorithm.

Signature

Lew

Rogan Dawes - 12 Sep 2007 10:58 GMT
>> So, if the hashes for the unique key change, then report the
>> difference, otherwise move to the next item.
[quoted text clipped - 5 lines]
> The odds of this might be acceptable, but it does require very careful
> choice of hashing algorithm.

Of course. That is why I mentioned using something like SHA-256, which
would be VERY unlikely to collide. 1 in 2^256 = 1.15792089 × 10^77

Even SHA-1 is unlikely to collide: 1 in 2^160 = 1.46150164 × 10^48

Rogan
Ingo R. Homann - 10 Sep 2007 15:13 GMT
Hi,

> I have two CSV files (file1 and file2).
> Each file contains 50,000 + lines. Each line is very long.
[quoted text clipped - 20 lines]
>
> Any suggestions would be appreciated.

IMHO the only clean solution is to store the data in a database and then
 iterate over all keys.

Ciao,
Ingo
bugbear - 19 Sep 2007 09:08 GMT
> I have two CSV files (file1 and file2).
> Each file contains 50,000 + lines. Each line is very long.
[quoted text clipped - 9 lines]
> the lines that are different. The entries in each file are not
> necessarily sorted by key.

Start here:

http://citeseer.ist.psu.edu/myers86ond.html

  BugBear


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.