Java Forum / General / February 2006
Fastest way to split a file by columns?
Kevin - 03 Feb 2006 23:23 GMT Just want to know what is the best way for this course coding task. :-)
Task: to split a big file into many files by its columns. Each resulting file consists one column of the original big file.
The original file can be as large as Gbytes (so that we can not hold it in main memory). Each line has exact 200 columns, each column is "," separated. The file is plain text file.
For example, if the input file is:
abc, ee, ef, ww, ee, wwe. aas, we, 64, www, w3, 46. qw, 35, qg, d4, q3, a34. ..... .....
We need to break it into 6 files, first file is: abc aas qw ...
Second file is: ee we 35 ...
Third file is: df 64 qg ... etc.
My current method is: 1) create 200 file writers (each with 0.5M file buffer), each corresponding to one column. 2) read in the original file (with 20M file buffer) line by line, 3) break each line with "," into tokens. 4) write out each token to its corresponding file writer. This method, on a 1.2G input file, with 200 columns, will use about 14 minutes. The physical memory on the machine is 512M, so the above file buffers should fit into the main memory.
Is that the fastest method we can get?
How about the task with reverse goal: if we have above resulting files, and we need to merge them into one big file so that each column is from each file. My code using a similar way as above needs 18 minutes.
I can't think out a better way. Any suggestions?
Andrey Kuznetsov - 04 Feb 2006 00:29 GMT > My current method is: > 1) create 200 file writers (each with 0.5M file buffer), each [quoted text clipped - 13 lines] > > I can't think out a better way. Any suggestions? at-first, you don't need such huge buffers - they can make you slow. at-second, you can try to use Unified I/O - it makes all buffering for you. In some cases Unified I/O may give you speedup up to 120 times. If you use readLine, you become String, this can make it slow. With Unified I/O you can read line to byte array. Its free and open source.
 Signature Andrey Kuznetsov http://uio.imagero.com Unified I/O for Java http://reader.imagero.com Java image reader http://jgui.imagero.com Java GUI components and utilities
Rhino - 04 Feb 2006 05:16 GMT > Just want to know what is the best way for this course coding task. :-) > [quoted text clipped - 49 lines] > > I can't think out a better way. Any suggestions? In 25 years of working with computers professionally, a good bit of it as an instructor of computer courses, I've never seen a requirement to do anything like what you are talking about, namely slicing a file vertically. I suppose the main intent is just to make you learn to handle memory and file I/O but you'd think the instructor of your course could come up with a more realistic problem that would help you develop those same skills....
--- Rhino
Jeffrey Schwab - 04 Feb 2006 05:48 GMT >>Just want to know what is the best way for this course coding task. :-) >> >>Task: to split a big file into many files by its columns. Each >>resulting file consists one column of the original big file. ...
> In 25 years of working with computers professionally, a good bit of it as an > instructor of computer courses, I've never seen a requirement to do anything > like what you are talking about, namely slicing a file vertically. I suppose > the main intent is just to make you learn to handle memory and file I/O but > you'd think the instructor of your course could come up with a more > realistic problem that would help you develop those same skills.... ?!
So you've never used /bin/cut? You've never had to parse one column out of a table stored in a text file? You've never had to use blockwise selection in nedit, emacs, or vi (control-v)?
For almost anyone who does any serious data munging, this comes up all the time. Files are frequently stored with one record per line, and each line divided into columns either according to fixed widths, or by separators: commas in CSV, colons in /etc/passwd, etc. This is actually one of the most useful and relevant examples I've ever heard of being assigned to students. It's a heck of a lot more useful than implementing a Red/Black Tree, or the Game of Life, or most of the other busy-work I was given as an undergraduate.
Rhino - 04 Feb 2006 15:08 GMT >>>Just want to know what is the best way for this course coding task. :-) >>> [quoted text clipped - 24 lines] > Tree, or the Game of Life, or most of the other busy-work I was given as > an undergraduate. Okay, maybe I've taken the assignment too narrowly.
Yes, of course, I've often had to parse lines in files and use the relevant bits for something; I've also had to load CSV files into databases and deal with fixed-width columns on many occasions. And I agree that is a useful and realistic thing to learn.
I just meant that I've never had to slice a file vertically so that each of the many columns was split into separate files and then joined back together at some point. I could see that it might be useful to slice out the primary key column(s) into one file and then put the rest of the columns (combined) into a second file and then have the key point to the data, as would be the case in a hash table. But slicing each column into a separate file? I don't see much point in that.
--- Rhino
Jeffrey Schwab - 04 Feb 2006 18:50 GMT >>>>Just want to know what is the best way for this course coding task. :-) >>>> [quoted text clipped - 40 lines] > case in a hash table. But slicing each column into a separate file? I don't > see much point in that. OK, I hear what you're saying. I actually have had to deal with this case quite a bit, though.
I used to work with a lot of file formats representing digital electronic signals. In some formats, each column represented a vector of binary values; for example, a simulation of an AND gate might produce a file like this, where C is the clock, A and B are inputs, and O is the output:
CABO 0000 1010 0100 1111
This is how the Avanti/Synopsys .VEC digital format works, except for a header at the top of the file giving more information, e.g. the number of picoseconds per clock phase. I think the idea is that it's supposed to look like a truth table.
Of course, other formats want all values in a given vector to be on the same line:
C0101 A0011 B0101 O0001
So I ended up doing a lot of column-by-column extraction. Often, different subsets of the signals would be grouped into separate files, e.g. all inputs in stimulus file for a SPICE simulation.
And in case you're wondering, no, I'm not making this stuff up as I go. :)
Roedy Green - 04 Feb 2006 21:36 GMT On Sat, 4 Feb 2006 10:08:00 -0500, "Rhino" <no.offline.contact.please@nospam.com> wrote, quoted or indirectly quoted someone who said :
>I just meant that I've never had to slice a file vertically so that each of >the many columns was split into separate files and then joined back together >at some point. That is the sort of thing you did all the time with unit record equipment (punch cards) and later the programs that simulated it.
I wrote a suite of utilities that among other things simulated all the unit record operations, letting you combine files by selecting columns and merging, sorting, updating, filtering etc.
To my astonishment lawyers would glue these together 50 steps or so long to accomplish their ends. They would think of each file operation the way you or I would think of a local variable addition. Their primitives worked on whole files of 200,000 to 4 million records.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Chris Uppal - 04 Feb 2006 11:48 GMT > > Task: to split a big file into many files by its columns. Each > > resulting file consists one column of the original big file.
> In 25 years of working with computers professionally, a good bit of it as > an instructor of computer courses, I've never seen a requirement to do > anything like what you are talking about, namely slicing a file > vertically. It the kind of thing that comes up quite often in my experience.
For instance.
Extracting data from almost any kind of log, discarding unwanted fields.
Extracting the useful information from compiler error messages.
Pulling relevant information from formatted files (/etc/passwd...).
...
-- chris
Rhino - 04 Feb 2006 15:08 GMT >> > Task: to split a big file into many files by its columns. Each >> > resulting file consists one column of the original big file. [quoted text clipped - 15 lines] > > ... See my reply to Jeffery Schwab for a clarification of what I meant.
--- Rhino
Chris Uppal - 04 Feb 2006 12:28 GMT > Task: to split a big file into many files by its columns. Each > resulting file consists one column of the original big file.
> My current method is: > 1) create 200 file writers (each with 0.5M file buffer), each [quoted text clipped - 7 lines] > > Is that the fastest method we can get? I suspect that, as Andrey has already said, your buffers are oversized. If you arrived at those sizes by experiment then kudos to you, and please ignore the rest of the paragraph. If not then I'd try using 1M for all of them, or even half that. I've never seen a significant performance gain by using buffers over a M (except when writing to a tape drive).
One thing to consider is that as you write the 200 output files, you will be making the disk heads move a /lot/ and not sequentially, and that may well be slowing your program down substantially -- or there again, it may not. I'd try doing this in N passes, each extracting 200/N columns. I don't know what the best value of N is, it may be 1 as you are currently assuming ;-) but I'd try increasing it to -- say -- 20 and see what happens. It's even possible that the optimum will be 200. Considering disk movements can be a big win -- I once rewrote a single pass program that would have taken >10 hours to run, into a multi-pass (near-)equivalent that ran in about 3 minutes...
A more sophisticated multi-pass version would be to read in the input file in giant chunks (100 M, say), and then do multiple passes over that, appending the data from that chunk to each output file in turn. You might find it worthwhile splitting up the data in the chunk at the beginning rather than duplicating that effort on each pass, but that would massively increase your object counts and thus reduce the size of the chunk that you could use. A tradeoff and only experiment (or careful /quantative/ analysis) can tell you which is better.
You may also be able to micro-optimise the splitting of the file into lines and sub-strings, but I find it hard to believe that that would have any significant effect in comparison with IO costs.
-- chris
Red Orchid - 04 Feb 2006 13:34 GMT "Kevin" <kaidizhao@yahoo.com.sg> wrote or quoted in Message-ID: <1139009028.703566.52430@z14g2000cwz.googlegroups.com>:
> Just want to know what is the best way for this course coding task. :-) > > Task: to split a big file into many files by its columns. Each > resulting file consists one column of the original big file. I assume the followings:
a) Space is valid character. b) Line divider is "\r\n" or '\n'.
Maybe, it is as like:
<code> (this codes is not tested).
public class MainFrame { static InputStream in; static final int colNum = 200; static OutputStream[] out = new OutputStream[colNum]; public static void main(String[] args) { try { initializeStream(); byte[] b = new byte[1]; ByteBuffer buf = new ByteBuffer(512); int colIndex = 0; while (true) { if (in.read() < 0) { break; } switch (b[0]) { case ',': out[colIndex++].write(buf.get(), 0, buf.size()); buf.clear(); break;
case '\r': // skip break; case '\n': colIndex = 0; break; default: buf.add(b[0]); break; } } } catch (Exception e) { // process exception. } finally { closeStream(); } } static void initializeStream() throws Exception { in = getInputStream(); Arrays.fill(out, null); for (int i = 0; i < out.length; i++) { out[i] = getOutputStream(i); } }
static void closeStream() { try { if (in != null) in.close(); for (int i = 0; i < out.length; i++) { if (out[i] != null) out[i].close(); } } catch (Exception e) { } } static InputStream getInputStream() throws Exception {
File f = new File("your input file"); return new BufferedInputStream(new FileInputStream(f)); } static OutputStream getOutputStream(int index) throws Exception { File f = new File("your output file" + index + ".txt"); return new BufferedOutputStream(new FileOutputStream(f)); } static class ByteBuffer { int pos; byte[] buf; public ByteBuffer(int initialCapacity) { buf = new byte[initialCapacity]; } void add(byte b) { if (pos >= buf.length) { resize(); } buf[pos++] = b; } void resize() { byte[] b = new byte[buf.length * 2]; System.arraycopy(buf, 0, b, 0, buf.length); buf = b; } byte[] get() { return buf; } int size() { return pos; } void clear() { pos = 0; } } }
</code>
Kevin - 04 Feb 2006 21:01 GMT This code piece is really useful. I tested, it improves the speed from 14 minutes (using FileReader and BufferedReader) to 7 minutes on my 1.2G data file. Thank you a lot. :-)
PS: just replace the beginning with: if (in.read(b) < 0) // miss the b there. { break; } and increased the buffer from default to 500k.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|