Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / January 2008

Tip: Looking for answers? Try searching our database.

Trying to somehow measure similarity of text files

Thread view: 
lbrtchx@gmail.com - 07 Jan 2008 05:59 GMT
~
It is easy for files which are binary equal or can be somehow match
line by line with some diff-like code, but other than I couldn't
really find anything in java or otherwise
~
There is the dump3.sourceforge.net project, but I find it a little
two much and way too demanding to be run in a server. It appeared to
be concived with pictures as data format in mind
~
Do you know of any C, CPP or java project or a good well-out white
paper on this problematic?
~
Thanks
lbrtchx
Roedy Green - 07 Jan 2008 12:25 GMT
> It is easy for files which are binary equal or can be somehow match
>line by line with some diff-like code, but other than I couldn't
>really find anything in java or otherwise

If you had a way of chunking the file. e.g. sentences in a text file,
newlines in a CSV file ,then you could compute a hashCode for each
"sentence".

You could then process your two files and create a list of hashcodes.
Then sort each list. Then compare counting matches.   That gives you a
rough idea of how many sentences they have in common and how many are
unique to each.  Compute a ratio of common/total unique sentences.

Ignore collisions (two sentences (either same of different) in same
file producing same hash code.

It is rude to ask questions in one group with followup to another.

I was thinking of some logic like this for creating delta files, that
could efficiently transmit changes to text files that have mainly been
reordered.
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.