Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / February 2006

Tip: Looking for answers? Try searching our database.

Fastest way to split a file by columns?

Thread view: 
Kevin - 03 Feb 2006 23:23 GMT
Just want to know what is the best way for this course coding task. :-)

Task: to split a big file into many files by its columns. Each
resulting file consists one column of the original big file.

The original file can be as large as Gbytes (so that we can not hold it
in main memory). Each line has exact 200 columns, each column is ","
separated. The file is plain text file.

For example, if the input file is:

abc, ee, ef, ww, ee, wwe.
aas, we, 64, www, w3, 46.
qw, 35, qg, d4, q3, a34.
.....
.....

We need to break it into 6 files, first file is:
abc
aas
qw
...

Second file is:
ee
we
35
...

Third file is:
df
64
qg
...
etc.

My current method is:
1) create 200 file writers (each with 0.5M file buffer), each
corresponding to one column.
2) read in the original file (with 20M file buffer) line by line,
3) break each line with "," into tokens.
4) write out each token to its corresponding file writer.
This method, on a 1.2G input file, with 200 columns, will use about 14
minutes. The physical memory on the machine is 512M, so the above file
buffers should fit into the main memory.

Is that the fastest method we can get?

How about the task with reverse goal: if we have above resulting files,
and we need to merge them into one big file so that each column is from
each file. My code using a similar way as above needs 18 minutes.

I can't think out a better way. Any suggestions?
Andrey Kuznetsov - 04 Feb 2006 00:29 GMT
> My current method is:
> 1) create 200 file writers (each with 0.5M file buffer), each
[quoted text clipped - 13 lines]
>
> I can't think out a better way. Any suggestions?

at-first, you don't need such huge buffers - they can make you slow.
at-second, you can try to use Unified I/O - it makes all buffering for you.
In some cases Unified I/O may give you speedup up to 120 times.
If you use readLine, you become String, this can make it slow.
With Unified I/O you can read line to byte array.
Its free and open source.

Signature

Andrey Kuznetsov
http://uio.imagero.com Unified I/O for Java
http://reader.imagero.com Java image reader
http://jgui.imagero.com Java GUI components and utilities

Rhino - 04 Feb 2006 05:16 GMT
> Just want to know what is the best way for this course coding task. :-)
>
[quoted text clipped - 49 lines]
>
> I can't think out a better way. Any suggestions?

In 25 years of working with computers professionally, a good bit of it as an
instructor of computer courses, I've never seen a requirement to do anything
like what you are talking about, namely slicing a file vertically. I suppose
the main intent is just to make you learn to handle memory and file I/O but
you'd think the instructor of your course could come up with a more
realistic problem that would help you develop those same skills....

---
Rhino
Jeffrey Schwab - 04 Feb 2006 05:48 GMT
>>Just want to know what is the best way for this course coding task. :-)
>>
>>Task: to split a big file into many files by its columns. Each
>>resulting file consists one column of the original big file.
...

> In 25 years of working with computers professionally, a good bit of it as an
> instructor of computer courses, I've never seen a requirement to do anything
> like what you are talking about, namely slicing a file vertically. I suppose
> the main intent is just to make you learn to handle memory and file I/O but
> you'd think the instructor of your course could come up with a more
> realistic problem that would help you develop those same skills....

?!

So you've never used /bin/cut?  You've never had to parse one column out
of a table stored in a text file?  You've never had to use blockwise
selection in nedit, emacs, or vi (control-v)?

For almost anyone who does any serious data munging, this comes up all
the time.  Files are frequently stored with one record per line, and
each line divided into columns either according to fixed widths, or by
separators:  commas in CSV, colons in /etc/passwd, etc.  This is
actually one of the most useful and relevant examples I've ever heard of
being assigned to students.  It's a heck of a lot more useful than
implementing a Red/Black Tree, or the Game of Life, or most of the other
busy-work I was given as an undergraduate.
Rhino - 04 Feb 2006 15:08 GMT
>>>Just want to know what is the best way for this course coding task. :-)
>>>
[quoted text clipped - 24 lines]
> Tree, or the Game of Life, or most of the other busy-work I was given as
> an undergraduate.

Okay, maybe I've taken the assignment too narrowly.

Yes, of course, I've often had to parse lines in files and use the relevant
bits for something; I've also had to load CSV files into databases and deal
with fixed-width columns on many occasions. And I agree that is a useful and
realistic thing to learn.

I just meant that I've never had to slice a file vertically so that each of
the many columns was split into separate files and then joined back together
at some point. I could see that it might be useful to slice out the primary
key column(s) into one file and then put the rest of the columns (combined)
into a second file and then have the key point to the data, as would be the
case in a hash table. But slicing each column into a separate file? I don't
see much point in that.

---
Rhino
Jeffrey Schwab - 04 Feb 2006 18:50 GMT
>>>>Just want to know what is the best way for this course coding task. :-)
>>>>
[quoted text clipped - 40 lines]
> case in a hash table. But slicing each column into a separate file? I don't
> see much point in that.

OK, I hear what you're saying.  I actually have had to deal with this
case quite a bit, though.

I used to work with a lot of file formats representing digital
electronic signals.  In some formats, each column represented a vector
of binary values; for example, a simulation of an AND gate might produce
a file like this, where C is the clock, A and B are inputs, and O is the
output:

  CABO
  0000
  1010
  0100
  1111

This is how the Avanti/Synopsys .VEC digital format works, except for a
header at the top of the file giving more information, e.g. the number
of picoseconds per clock phase.  I think the idea is that it's supposed
to look like a truth table.

Of course, other formats want all values in a given vector to be on the
same line:

  C0101
  A0011
  B0101
  O0001

So I ended up doing a lot of column-by-column extraction.  Often,
different subsets of the signals would be grouped into separate files,
e.g. all inputs in stimulus file for a SPICE simulation.

And in case you're wondering, no, I'm not making this stuff up as I go. :)
Roedy Green - 04 Feb 2006 21:36 GMT
On Sat, 4 Feb 2006 10:08:00 -0500, "Rhino"
<no.offline.contact.please@nospam.com> wrote, quoted or indirectly
quoted someone who said :

>I just meant that I've never had to slice a file vertically so that each of
>the many columns was split into separate files and then joined back together
>at some point.

That is the sort of thing you did all the time with unit record
equipment (punch cards) and later the programs that simulated it.

I wrote a suite of utilities that among other things simulated all the
unit record operations, letting you combine files by selecting columns
and merging, sorting, updating, filtering etc.

To my astonishment lawyers would glue these together 50 steps or so
long to accomplish their ends.  They would think of each file
operation the way you or I would think of a local variable addition.
Their primitives worked on whole files of 200,000 to 4 million
records.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Chris Uppal - 04 Feb 2006 11:48 GMT
> > Task: to split a big file into many files by its columns. Each
> > resulting file consists one column of the original big file.

> In 25 years of working with computers professionally, a good bit of it as
> an instructor of computer courses, I've never seen a requirement to do
> anything like what you are talking about, namely slicing a file
> vertically.

It the kind of thing that comes up quite often in my experience.

For instance.

Extracting data from almost any kind of log, discarding unwanted fields.

Extracting the useful information from compiler error messages.

Pulling relevant information from formatted files (/etc/passwd...).

...

   -- chris
Rhino - 04 Feb 2006 15:08 GMT
>> > Task: to split a big file into many files by its columns. Each
>> > resulting file consists one column of the original big file.
[quoted text clipped - 15 lines]
>
> ...
See my reply to Jeffery Schwab for a clarification of what I meant.

---
Rhino
Chris Uppal - 04 Feb 2006 12:28 GMT
> Task: to split a big file into many files by its columns. Each
> resulting file consists one column of the original big file.

> My current method is:
> 1) create 200 file writers (each with 0.5M file buffer), each
[quoted text clipped - 7 lines]
>
> Is that the fastest method we can get?

I suspect that, as Andrey has already said, your buffers are oversized.  If you
arrived at those sizes by experiment then kudos to you, and please ignore the
rest of the paragraph.  If not then I'd try using 1M for all of them, or even
half that.  I've never seen a significant performance gain by using buffers
over a M (except when writing to a tape drive).

One thing to consider is that as you write the 200 output files, you will be
making the disk heads move a /lot/ and not sequentially, and that may well be
slowing your program down substantially -- or there again, it may not.  I'd try
doing this in N passes, each extracting 200/N columns.  I don't know what the
best value of N is, it may be 1 as you are currently assuming ;-) but I'd try
increasing it to -- say -- 20 and see what happens.  It's even possible that
the optimum will be 200.  Considering disk movements can be a big win -- I once
rewrote a single pass program that would have taken >10 hours to run, into a
multi-pass (near-)equivalent that ran in about 3 minutes...

A more sophisticated multi-pass version would be to read in the input file in
giant chunks (100 M, say), and then do multiple passes over that, appending the
data from that chunk to each output file in turn.  You might find it worthwhile
splitting up the data in the chunk at the beginning rather than duplicating
that effort on each pass, but that would massively increase your object counts
and thus reduce the size of the chunk that you could use.  A tradeoff and only
experiment (or careful /quantative/ analysis) can tell you which is better.

You may also be able to micro-optimise the splitting of the file into lines and
sub-strings, but I find it hard to believe that that would have any significant
effect in comparison with IO costs.

   -- chris
Red Orchid - 04 Feb 2006 13:34 GMT
"Kevin" <kaidizhao@yahoo.com.sg> wrote or quoted in
Message-ID: <1139009028.703566.52430@z14g2000cwz.googlegroups.com>:

> Just want to know what is the best way for this course coding task. :-)
>
> Task: to split a big file into many files by its columns. Each
> resulting file consists one column of the original big file.

I assume the followings:

a) Space is valid character.
b) Line divider is "\r\n" or '\n'.

Maybe,  it is as like:

<code>
(this codes is not tested).

public class MainFrame {
                                     
    static InputStream in;
    static final int colNum = 200;
    static OutputStream[] out = new OutputStream[colNum];
   
    public static void main(String[] args) {
       
        try {
            initializeStream();
           
            byte[] b = new byte[1];
            ByteBuffer buf = new ByteBuffer(512);
            int colIndex = 0;
           
            while (true) {
               
                if (in.read() < 0) {
                   
                    break;
                }
                switch (b[0]) {
                case ',':
                    out[colIndex++].write(buf.get(), 0, buf.size());
                    buf.clear();
                    break;

                case '\r': // skip
                    break;
                   
                case '\n':
                    colIndex = 0;
                    break;
                   
                default:
                    buf.add(b[0]);
                    break;
                }
            }
        }
        catch (Exception e) {
            // process exception.
        }
        finally {
            closeStream();
        }
    }
   
    static void initializeStream() throws Exception {
       
        in = getInputStream();
        Arrays.fill(out, null);
       
        for (int i = 0; i < out.length; i++) {
           
            out[i] = getOutputStream(i);
        }
    }

    static void closeStream() {
       
        try {
            if (in != null) in.close();
           
            for (int i = 0; i < out.length; i++) {
               
                if (out[i] != null) out[i].close();
            }
        }
        catch (Exception e) {
        }
    }
   
    static InputStream getInputStream() throws Exception {

        File f = new File("your input file");
        return new BufferedInputStream(new FileInputStream(f));
    }
   
    static OutputStream getOutputStream(int index) throws Exception {
       
        File f = new File("your output file" + index + ".txt");
        return new BufferedOutputStream(new FileOutputStream(f));
    }
   
    static class ByteBuffer {
       
        int pos;
        byte[] buf;
       
        public ByteBuffer(int initialCapacity) {
           
            buf = new byte[initialCapacity];
        }
       
        void add(byte b) {
           
            if (pos >= buf.length) {
               
                resize();
            }
            buf[pos++] = b;
        }
       
        void resize() {
           
            byte[] b = new byte[buf.length * 2];
            System.arraycopy(buf, 0, b, 0, buf.length);
           
            buf = b;
        }
       
        byte[] get() {
           
            return buf;
        }
       
        int size() {
           
            return pos;
        }
       
        void clear() {
           
            pos = 0;
        }
    }
}

</code>
Kevin - 04 Feb 2006 21:01 GMT
This code piece is really useful.
I tested, it improves the speed from 14 minutes (using FileReader and
BufferedReader) to 7 minutes on my 1.2G data file.
Thank you a lot. :-)

PS: just replace the beginning with:
if (in.read(b) < 0) // miss the b there.
{
  break;
}
and increased the buffer from default to 500k.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.