Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / July 2008

Tip: Looking for answers? Try searching our database.

Efficiently concatenating contents of multiple files

Thread view: 
sasuke - 02 Jul 2008 17:51 GMT
Hello to all Java programmers out there. :-)

I was just wondering what would be the most time / space efficient way
of concatenating contents of different files to a single file. Sample
usage would be:
java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

Using threads to open a stream to the source files is out of question
since the data needs to be written in a ordered manner in which it
exists in the source files i.e. no ad hoc writing. Reading the entire
contents of the file into memory (by using a StingBuffer /
StringBuilder) also isn't a good choice considering that we can come
across really large text files (~10 MB, typical for db dumps). Reading
the source file line by line doesn't seem attractive given that it
would increase I/O and again for really large files might turn out to
be a I/O bottleneck. One solution which comes to mind is to read the
file in chunks; i.e. read the data in char array of 8KB or a string
array of size 100.

My question here is -»  Is there any ideal solution which comes to
mind when solving this problem or does the solution really depend on
the domain in consideration and the kind of sacrifices we are ready to
make (e.g. lose the ordering of data, memory trade off when reading
entire file in a buffer, I/O hit)?

Pardon me for asking such trivial / silly question but just a
thought. :-)

Regards,
/~sasuke
RedGrittyBrick - 02 Jul 2008 18:18 GMT
> Hello to all Java programmers out there. :-)
>
> I was just wondering what would be the most time / space efficient way
> of concatenating contents of different files to a single file. Sample
> usage would be:
> java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

The most efficient usage of your time is not to reinvent wheels.

> Using threads to open a stream to the source files is out of question
> since the data needs to be written in a ordered manner in which it
> exists in the source files i.e. no ad hoc writing.

Having multiple threads doing I/O to the same disk is likely to slow
things down.

> Reading the entire
> contents of the file into memory (by using a StingBuffer /
> StringBuilder) also isn't a good choice considering that we can come
> across really large text files (~10 MB, typical for db dumps).

I see no benefit in reading a whole file into memory.

> Reading
> the source file line by line doesn't seem attractive given that it
> would increase I/O and again for really large files might turn out to
> be a I/O bottleneck.

You don't need the JVM to be doing conversion to UTC-16, or pointless
line-oriented processing (e.g. scanning for line-endings).

> One solution which comes to mind is to read the
> file in chunks; i.e. read the data in char array of 8KB or a string
> array of size 100.
>
> My question here is -»  Is there any ideal solution which comes to
> mind when solving this problem

:-)

cat  sourceFileOne.txt sourceFileTwo.txt ... targetFile.txt

or

copy sourceFileOne.txt+sourceFileTwo.txt ... targetFile.txt

depending on operating system

> or does the solution really depend on
> the domain in consideration and the kind of sacrifices we are ready to
> make (e.g. lose the ordering of data, memory trade off when reading
> entire file in a buffer, I/O hit)?

I wouldn't reinvent this wheel but if you are doing it I suggest you
treat the files as binary not as text (especially not using anything
that translates encodings). Reading in large fixed-size chunks would
seem to be sensible. Given that the task is I/O bound I wouldn't try too
hard to optimise anything else.

Signature

RGB

Zig - 02 Jul 2008 21:32 GMT
> Hello to all Java programmers out there. :-)
>
> I was just wondering what would be the most time / space efficient way
> of concatenating contents of different files to a single file. Sample
> usage would be:
> java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

What encoding are your text files in? If the source and target files are  
in the same encoding, and do not have a BOM character at the beginning of  
the file, then a binary transfer is the way to go. Take a look at  
java.nio.channels.FileChannel.transferTo / transferFrom
http://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.html#transfe
rTo(long
,  
long, java.nio.channels.WritableByteChannel)

As those methods should give you very fast file content transferal for  
binary data.

> One solution which comes to mind is to read the
> file in chunks; i.e. read the data in char array of 8KB or a string
> array of size 100.

If you need to deal with different encodings (from your example usage, you  
might check to see if your source files were using different BOMs), then  
reading a block of characters (decoding from source), and writing them  
back to the target (encoding them with the target file's encoding) may be  
more appropriate. If they all have the same encoding, but use BOMs, then  
you can use a binary transfer, skipping the BOM character from all but the  
first source file.

Reading & decoding blocks of data will also give you the flexiblity to  
support more options, such as reading seperately gzip'ed log files, and  
writing them out as a single gzip'ed text file.

HTH,

-Zig
sasuke - 05 Jul 2008 19:09 GMT
Thanks to all for their replies. True, when programming we must seek
real life solutions to real world problems and the only efficient way
here seems to be making use of platform specific trickery.

I also completely agree with the general consensus that reading /
writing raw bytes in much more faster than reading in bytes,
converting them into string for a given or default encoding, writing
the string to the target file which will again be decoded into a byte
array based on the encoding.

A few queries though:

> What encoding are your text files in? If the source and target files are
> in the same encoding, and do not have a BOM character at the beginning of
> the file, then a binary transfer is the way to go. Take a look at
> java.nio.channels.FileChannel.transferTo / transferFromhttp://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.h...,
> long, java.nio.channels.WritableByteChannel)

Isn't this method an abstract method? So it implies that I need to
subclass this class and create my own specialized class which deals
with the content transfer? I wonder how that is any different from
doing it the raw way...

> If you need to deal with different encodings (from your example usage, you
> might check to see if your source files were using different BOMs), then
[quoted text clipped - 3 lines]
> you can use a binary transfer, skipping the BOM character from all but the
> first source file.

BOM? Googling says that this is some sort of Byte order mark but I
don't think I have ever worked with BOM files before. If this is some
special byte which occurs at the start of every file (like some sort
of header) I wonder how you can call them plain text files?

Your inputs are much appreciated.

Thanks and regards,
/sasuke
Eric Sosman - 02 Jul 2008 22:57 GMT
> Hello to all Java programmers out there. :-)
>
[quoted text clipped - 3 lines]
> java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...
> [...]

    The fastest and most efficient way of all is -- Don't Do That.
Do you really *need* a second copy of the contents of all those
files?  Or could you use a java.io.SequenceInputStream to read
the originals /in situ/?

    If you actually do need to concatenate, it's highly unlikely
that anything you can do in Java will be as fast as the platform's
own file-concatenation utility.  Those beasts tend to be heavily
optimized, using platform-specific trickery and undocumented API's
to move the data from hither to yon at great speed.  If it's speed
you care about, spend your time figuring out how to launch the
native utility instead of spending it trying to optimize an
alternative that's hobbled by portability concerns.

Signature

Eric.Sosman@sun.com

Abhijat Vatsyayan - 03 Jul 2008 00:30 GMT
> Hello to all Java programmers out there. :-)
>
[quoted text clipped - 26 lines]
> Regards,
> /~sasuke
Why not use concat task that comes with ant? Or if you can use shell on
a nix box, use "cat". Or install cat binary from cygwin on the windows
box (the list goes on). There are many solutions out there, the least
recommended being writing something like this from scratch (unless you
are doing this just for learning or for fun).
Abhijat
Roedy Green - 05 Jul 2008 23:05 GMT
On Wed, 2 Jul 2008 09:51:55 -0700 (PDT), sasuke
<database666@gmail.com> wrote, quoted or indirectly quoted someone who
said :

>I was just wondering what would be the most time / space efficient way
>of concatenating contents of different files to a single file. Sample
>usage would be:
>java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

1. If you want a platform-specific solution, you could spawn a command
processor shell.

2. The simplest code would just be to read each file with a
BufferedReader using a whacking huge buffersize and write in turn to a
bufered  output. see http://mindprod.com/applet/fileio.html for
sameple code. That has needless overhead for converting from bytes to
char and back, though it theory you could concatenate files of
different encodings if you knew what they were.

3.  if you read the files as raw bytes rather than chars, you know
their precise lengths, and the offset where they will fit in the final
file.  You could use random access to implement your thread idea.
However, I doubt the game will be worth the candle unless the files to
be gathered  live on different _physical_ drives. All you will succeed
in doing is jerking the heads all over.

4. If you want a canned solution, use the FileTransfer class
downloadable from
http://mindprod.com/products.html#FILETRANSFER

It does it rapidly in large raw-byte chunks.

// test FileTransfer.append
import com.mindprod.filetransfer.FileTransfer;
import java.io.File;
public class Concat
  {
  /**
   * test harness to concatenate c onto the end of b, leaving the
result in a.
   *
   * @param args not used
   */
  public static void main ( String[] args )
     {
     File a = new File ("C:/temp/temp.txt");  // does not exist yet
     File b = new File ("E:/mindprod/feedback/peaceincorrect.html");
     File c = new File ("E:/mindprod/jgloss/j.html");

     FileTransfer ft = new FileTransfer ( 50000 /* buffsize */ );
     // source, target
     ft.append( b, a );
     ft.append( c, a );
     }
  }

Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Tom Anderson - 06 Jul 2008 03:27 GMT
> On Wed, 2 Jul 2008 09:51:55 -0700 (PDT), sasuke
> <database666@gmail.com> wrote, quoted or indirectly quoted someone who
[quoted text clipped - 14 lines]
> back, though it theory you could concatenate files of different
> encodings if you knew what they were.

I think what i'd do is memory-map the input file using NIO, and then write
the entire thing to the output in one go. And then cross my fingers and
hope that the OS was smart enough to do the right thing here, rather than
attempting to load the whole input file into memory first. If it does do
the right thing, this avoids a lot of copying of bytes to and from java,
and might even avoid any copying across the kernel/userspace border.

But yeah, running 'cat' is the right solution here.

tom

Signature

Linux is like a FreeBSD fork maintained by 10 year old retards. --
Encyclopedia Dramatica



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.