Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / June 2006

Tip: Looking for answers? Try searching our database.

Merge strategy

Thread view: 
Chris - 29 Jun 2006 20:26 GMT
This isn't Java-specific, but I this would be a good group to ask...

I have directory containing many files of varying sizes. An external
process adds small files periodically. Small files need to merged into
large files so as to reduce the total number of files in the directory.
We want to reduce the total number of files as quickly as possible.

What's the best merge strategy, assuming that the time require to merge
two or more files is roughly proportional to their size?

Clearly it makes sense to merge small files together first before going
after large files. One strategy would be to just merge the smallest X
files on each pass. This is not optimal, though, if the smallest X files
contains one or more very large files.

Assume that 1) there is a maximum number of files we want to involve in
any given merge operation, and 2) there is a maximum file size.

Thoughts?
Eric Sosman - 29 Jun 2006 21:06 GMT
Chris wrote On 06/29/06 15:26,:
> This isn't Java-specific, but I this would be a good group to ask...
>
[quoted text clipped - 5 lines]
> What's the best merge strategy, assuming that the time require to merge
> two or more files is roughly proportional to their size?

   Under that assumption, an N-way merge of all N files
in one pass is the best strategy.  Yes, an N-way merge takes
more computation time than a merge of lower order, but the
CPU is manymanymany orders of magnitude faster than the I/O.
(In the time for one ten-millisecond seak-and-read, the clock
on your 3GHz CPU ticks thirty million times.)

   If N is extremely large, the proportionality assumption
may not hold.  You may get non-linear slowdowns from effects
like the file system reducing the amount of read-ahead to combat
overcrowding in its caches and things.  It's a good deal slower
to perform a hundred 10kB reads than one 1MB read, even though
the same volume of data is transferred.

> Clearly it makes sense to merge small files together first before going
> after large files.

   Not clear at all.  Merging in multiple stages requires
making multiple passes over at least some of the data (reading
it, writing it, and reading it back in agagin), and this is
more work than if you only read and wrote each item once.  If
proportionality holds, doing everything in one stage wins.  You
will need to make some measurements on your system to see how
well or poorly the proportionality assumption matches reality.

> One strategy would be to just merge the smallest X
> files on each pass. This is not optimal, though, if the smallest X files
> contains one or more very large files.
>
> Assume that 1) there is a maximum number of files we want to involve in
> any given merge operation, and 2) there is a maximum file size.

   Suppose you can merge K files simultaneously without breaking
the proportionality assumption.  Then a reasonable strategy for
merging your N files is

    while (N > 1) {
       M = min(N, K, N+1-K)
       merge the M smallest files into one
       delete (or forget) the merge inputs
       N = N - M + 1
    }

   This is probably not optimal because it has only limited
"foresight."  For some ideas about better patterns, take a
look at Knuth TAOCP section 5.4.9 -- he's discussing merge
patterns for external sorting and the circumstances don't seem
quite like yours, but I'm only suggesting that you seek "ideas,"
not "recipes."

Signature

Eric.Sosman@sun.com

Eric Sosman - 29 Jun 2006 21:19 GMT
Eric Sosman wrote On 06/29/06 16:06,:

> [...]  Then a reasonable strategy for
> merging your N files is
>
>     while (N > 1) {
>        M = min(N, K, N+1-K)

   Oh, drat: that's not right.  Try

    M = (N <= K) ? N : min(K, N+1-K)

Signature

Eric.Sosman@sun.com



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.