Java Forum / General / January 2006
Fell Swoop I/O
Roedy Green - 14 Jan 2006 05:52 GMT Writing or reading a byte[] in one fell swoop to write or read a file should be extremely efficient. In theory, the bytes could go straight from your array to the hard disk controller.
I wonder if that is indeed true, for unbuffered files. Or are they copied some sub-chunk size at a time. Has anyone peeked under the hood or done some experiments to deduce what happens from timings.
Encoding though, even when you have a 1-1 char > byte encoding requires Java to allocate some sort of transparent intermediate byte buffer, even for unbuffered Writers. How does Java decide how big to make it? Does it make it big enough to contain the entire String?
Has anyone peeked under the hood or experimented.
A practical way of asking this question is:
It is better write an entire file unbuffered or write an entire file with a buffer? If buffered, what is a reasonable buffer size? Making it too big causes more frequent GC. Making it too small causes more physical i/os.
Here is a place I would like tweakers where you could write your code and let the tweaker optimiser AT THE CLIENT SITE home in the optimum settings for his platform.
see http://mindprod.com/jgloss/tweakable.html
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
NOBODY - 14 Jan 2006 07:07 GMT > Writing or reading a byte[] in one fell swoop to write or read a file > should be extremely efficient. In theory, the bytes could go straight [quoted text clipped - 17 lines] > it too big causes more frequent GC. Making it too small causes more > physical i/os. Let's think about what sun did since '95... FileOutputStream has 2 native methods: write(byte) write(byte[], off, len) that thousands of classes depend on. Even the NIO channels are slower as I heard, since they were designed for Selectable and locks, not so much for performance. So, yeah, safe to say it is fast enough.
Optimal byte[] buffer size come from one thing: TESTING.
Keep it a power of your cluster size to be friendly, trust HDD controllers and i/o schedulers pull at least the cluster size with all sorts of 'read' or 'write' prediction, exploiting the disk cache.
Understand that 2 long writes at the same time on a single hdd will make its head jump all over and drop to much less than just half the performance. Your tests could be biased is your are swapping of other disk activities.
The largest chunk possible, to reduce the i/o scheduling pieces and reassembly and hope the i/o scheduler will thank you for a big contiguous array of bytes.
Roedy Green - 14 Jan 2006 07:41 GMT >The largest chunk possible, to reduce the i/o scheduling pieces and >reassembly and hope the i/o scheduler will thank you for a big contiguous >array of bytes. There are some complications from the traditional wisdom.
1. Java's buffering can be inserted at various layers. Only the lowest layer offers any help for I/O.
2. Java does encoding transformations. This implies hidden buffers of which you have no control.
I need to do some experiments, but I think the fastest way to read a file of chars will be:
1. find the length in bytes. This is not necessarily the length in chars.
2. read the entire file in one read (buffered or unbuffered?) onto a byte[].
3. use a new String which has a built in encoding conversion.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Andrey Kuznetsov - 14 Jan 2006 13:18 GMT >>The largest chunk possible, to reduce the i/o scheduling pieces and >>reassembly and hope the i/o scheduler will thank you for a big contiguous [quoted text clipped - 4 lines] > 1. Java's buffering can be inserted at various layers. Only the lowest > layer offers any help for I/O. Roedy,
just put Unified I/O in lowest layer and forget about performance.
I memorize that you asked me about tutorial.
However I don't have it yet, but I can give you some advices:
Unified I/O interface looks just like from RandomAccessFile (with some extras).
Important thing is RandomAccessFactory.
It has following methods:
RandomAccess create(); RandomAccessRO createRO(); RandomAccessBuffer createBuffered(); RandomAccessBufferRO createBufferedRO();
(RO means read only)
It was difficult part.
Easy part is that you can create InputStream from RandomAccessRO or OutputStream from RandomAccess and use it as usual without changing your code. See com.imagero.uio.io.RandomAccessInputStream and com.imagero.uio.io.RandomAccessOutputStream.
 Signature Andrey Kuznetsov http://uio.imagero.com Unified I/O for Java http://reader.imagero.com Java image reader http://jgui.imagero.com Java GUI components and utilities
NOBODY - 14 Jan 2006 16:50 GMT > RandomAccess create(); > RandomAccessRO createRO(); > RandomAccessBuffer createBuffered(); > RandomAccessBufferRO createBufferedRO(); Simpler: knowing that a seek on a RAF will move the FD with it, you can reposition buffered streams on it. Here: (I was too lazy to implement the DataInput and DataOutput, but you get the point)
-----------
import java.io.*;
public class SuperRAF { public final RandomAccessFile raf; public final MyBIS bis; public final BufferedOutputStream bos; public final DataInputStream dis; public final DataOutputStream dos; public SuperRAF(RandomAccessFile raf, int bufsize) throws IOException { this.raf = raf; bis = new MyBIS(new FileInputStream(raf.getFD()), bufsize); bos = new BufferedOutputStream(new FileOutputStream(raf.getFD ()), bufsize); dis = new DataInputStream(bis); dos = new DataOutputStream(bos); } public void flush() throws IOException { bos.flush(); } public void seek(long pos) throws IOException { bos.flush(); bis.clear(); raf.seek(pos); } //======= static class MyBIS extends BufferedInputStream { MyBIS(InputStream is, int size) { super(is, size); } MyBIS(InputStream is) { super(is); } void clear() { super.count = 0; super.markpos = -1; super.pos = 0; super.marklimit = 0; //super.buf = don't waste that } } }
Andrey Kuznetsov - 14 Jan 2006 17:32 GMT > Simpler: knowing that a seek on a RAF will move the FD with it, > you can reposition buffered streams on it. oh yes, and with raf.seek(0) you can just revind your IS.
 Signature Andrey Kuznetsov http://uio.imagero.com Unified I/O for Java http://reader.imagero.com Java image reader http://jgui.imagero.com Java GUI components and utilities
NOBODY - 14 Jan 2006 16:13 GMT >>The largest chunk possible, to reduce the i/o scheduling pieces and >>reassembly and hope the i/o scheduler will thank you for a big [quoted text clipped - 4 lines] > 1. Java's buffering can be inserted at various layers. Only the lowest > layer offers any help for I/O. To me a simple file output stream is the closest to the i/o chunk. Just do your buffering yourself is layer of uncontrolled buffering scares you. But you did say you had files, not streams. So you control how it is read.
My i/o test:
----- import java.io.File; import java.io.FileOutputStream;
public class IOSizer { public static void main(String[] args) throws Exception { File f = File.createTempFile("_IOSizer_",".tmp",new File (".")); f.deleteOnExit(); FileOutputStream fos = new FileOutputStream(f); try { fos.write(1); fos.write(new byte[Integer.parseInt(args[0])]); } finally { fos.close(); } } }
---- and trace system write calls ---- /usr/bin/strace -x -e write java IOSizer 33333
[...] write(5, "\x01", 1) = 1 write(5, "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"..., 33333) = 33333 [...]
> 2. Java does encoding transformations. This implies hidden buffers of > which you have no control. If your first stream is a bufferedinputstream (over file inputstream) of a buffer size of your choice, the only buffers are supposed to be a few bytes long, most probably reused, for the longest charset sequence, for which I know utf-8 is probably one of at least 6 bytes (31 bits payload, 4 bytes for 20 bit unicode).
> I need to do some experiments, but I think the fastest way to read a > file of chars will be: [quoted text clipped - 6 lines] > > 3. use a new String which has a built in encoding conversion. How were you intending to read a unique string otherwise? :-/ But if you can process your html in chunks (tabs, spaces, and all you mentionned), You can probably just use a buffered reader over a intputstream reader over the bufferedinputstream. Read a pack of lines (like 200, or when you reached a string length threshold), and process it in smaller pieces, keeping a stateful engine of where you are (opened tags and such annoying things.)
Thomas Hawtin - 14 Jan 2006 18:08 GMT > Writing or reading a byte[] in one fell swoop to write or read a file > should be extremely efficient. In theory, the bytes could go straight > from your array to the hard disk controller. Almost certainly the biggest overhead here is going to be with the disc drive. Depending on circumstances the seek time or transfer time for long files. Possibly if buffering causes a spike in memory usage, there could be other problems.
There will be at least one additional copy for your operating system's file cache. Also you aren't going to want your byte[] pinned while the file system blocks, direct allocated ByteBuffers may be a win (for the careful, or carefree).
> It is better write an entire file unbuffered or write an entire file > with a buffer? If buffered, what is a reasonable buffer size? Making > it too big causes more frequent GC. Making it too small causes more > physical i/os. I suspect there is a huge middle ground, where the exact size doesn't matter.
Memory mapping is another way to go.
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Dimitri Maziuk - 14 Jan 2006 18:42 GMT Roedy Green sez:
> Writing or reading a byte[] in one fell swoop to write or read a file > should be extremely efficient. In theory, the bytes could go straight > from your array to the hard disk controller. There are a couple of buffering stages involved even before the data gets to JVM:
1. HD read and writes are done in chunks (> 1 byte, configurable on some systems).
2. Assuming a single disk, the slowest part of file copy process is positioning disk head to write to destination file and then re-positioning it back to read from the source. So OS and/or HD controller buffer I/O requests and schedule them for optimal head movement.
3. File data is buffered by OS (size depends on OS, available RAM, number of open files, etc.)
(Now add concurrent I/O requests coming from multiple processes on a time-sharing system to the mix.)
4. Then the data gets to JVM which (or may not) do still more buffering.
5. Finally, you code yet another buffer -- your byte[] -- on top of all that.
In theory, if you could read the entire file into byte[] and then write the entire thing out, it should be the fastest: let JVM, OS, and hardware optimize the actual disk I/O. In practice you seldom have enough RAM for that.
In practice, with all that stuff going on behind the scenes (that you have no control over), I wouldn't worry about it at all: code what makes sense for your application. I tend to use buffered readers when I need line-based reads -- not because it's supposed to be faster but because I need readLine().
Dima
 Signature Q276304 - Error Message: Your Password Must Be at Least 18770 Characters and Cannot Repeat Any of Your Previous 30689 Passwords -- RISKS 21.37
Andrey Kuznetsov - 14 Jan 2006 18:58 GMT > In practice, with all that stuff going on behind the scenes > (that you have no control over), I wouldn't worry about it > at all: code what makes sense for your application. I tend > to use buffered readers when I need line-based reads -- not > because it's supposed to be faster but because I need readLine(). For small files you can safely ignore buffering. For huge files buffering can significantly speed up I/O.
 Signature Andrey Kuznetsov http://uio.imagero.com Unified I/O for Java http://reader.imagero.com Java image reader http://jgui.imagero.com Java GUI components and utilities
Raymond DeCampo - 14 Jan 2006 18:51 GMT > It is better write an entire file unbuffered or write an entire file > with a buffer? If buffered, what is a reasonable buffer size? Making > it too big causes more frequent GC. Making it too small causes more > physical i/os. Roedy,
What is your reasoning behind saying that a large buffer causes more frequent garbage collection?
Thanks, Ray
 Signature This signature intentionally left blank.
Roedy Green - 14 Jan 2006 19:16 GMT On Sat, 14 Jan 2006 18:51:55 GMT, Raymond DeCampo <nospam@twcny.rr.com> wrote, quoted or indirectly quoted someone who said :
>What is your reasoning behind saying that a large buffer causes more >frequent garbage collection? Imagine a case where you had 1000 files each 100 bytes long and you allocated 64K buffers. You will fill up ram faster than had you use no buffering or 100 byte buffers.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Raymond DeCampo - 14 Jan 2006 23:51 GMT > On Sat, 14 Jan 2006 18:51:55 GMT, Raymond DeCampo > <nospam@twcny.rr.com> wrote, quoted or indirectly quoted someone who [quoted text clipped - 6 lines] > allocated 64K buffers. You will fill up ram faster than had you use > no buffering or 100 byte buffers. I see; I thought you meant in the case where there was one buffer and I could not imagine how that applied.
Ray
 Signature This signature intentionally left blank.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|