Java Forum / General / September 2006
Find the number of lines in a text file
Chris Brat - 13 Sep 2006 10:23 GMT Hi,
I need to find the total number of lines in a text file -so that I can skip the header and filler information and process just the body.
I've done some searching and can't find a class or method that does this directly and the solutions I've found require either :
- reading each line of the file (using the BufferedReader) and incrementing a line counter for each line, or - using the LineNumberReader directly and using its result from its getLineNumber() method once the entire file is read, or - searching for the eol characters and counting these. - using the the RandomAccessFile, seeking to the end of the file and dividing the total number of bytes by the number of bytes expected in the line (I believe this relies on guarantee that each line will have the same number of characters).
I dont like the idea of counting eol characters or having to read the entire file twice (once to get the number of line numbers and the second time to do my actual processing).
Does anyone have another or better solution?
Thanks, Chris
Ingo R. Homann - 13 Sep 2006 10:44 GMT Hi,
> Hi, > [quoted text clipped - 8 lines] > - using the LineNumberReader directly and using its result from its > getLineNumber() method once the entire file is read, or Note that this is the same. (IIRC LineNumberReader internally does exactly the same as your first suggestion)
> - searching for the eol characters and counting these. Note that this is also nearly the same - especially in 'runtime-complexity'.
> - using the the RandomAccessFile, seeking to the end of the file and > dividing the total number of bytes by the number of bytes expected in > the line (I believe this relies on guarantee that each line will have > the same number of characters). Of course this relies on this guarantee! Is it really guaranteed? If yes, this is the best idea. BTW: AFAIK you do not need a RAF for that - IIRC, java.io.File has the methode you need (getLength() or sth like that) as well.
> I dont like the idea of counting eol characters or having to read the > entire file twice (once to get the number of line numbers and the > second time to do my actual processing). > > Does anyone have another or better solution? Depends on what *exactly* you want to do, and on the format of the file, and what you mean with "skip the header and filler information and process just the body" - isn't it possible to do everything by reading the file only once? (Do "headers" and "fillers" have a certain prefix? ...?)
Ciao, Ingo
Chris Brat - 13 Sep 2006 11:50 GMT Hi Ingo,
I effectively want to skip an known number of lines in a file (immediately at the beginning of the file) and immediately at the end of a file. The header is not a problem to skip but the footer is.
Sorry, I actually meant 'footer' and not 'filller'
The scenario is :The user defines that the first 6 lines and the last 3 lines of the file are in an unknown format and must be ignored - this means that they must not be checked for validity and processed.
Regards, Chris
> Hi, > [quoted text clipped - 41 lines] > Ciao, > Ingo Ingo R. Homann - 13 Sep 2006 11:55 GMT Hi,
> The scenario is :The user defines that the first 6 lines and the last 3 > lines of the file are in an unknown format and must be ignored - this > means that they must not be checked for validity and processed. I think, the simplest idea would be to buffer 3 lines...
Ciao, Ingo
bugbear - 13 Sep 2006 12:03 GMT > Hi Ingo, > [quoted text clipped - 7 lines] > lines of the file are in an unknown format and must be ignored - this > means that they must not be checked for validity and processed. In that case you certainly don't need to count the total number of lines in the file.
Simply count the first 6 lines moving forward, then lseek to the end, and count the last 3 lines backwards.
RandomAccessFile may be useful to you.
You now have the offsets withing the file that define your "valid zone".
Either work with these, or create a IO decorator that presents the subset of the file as stream/reader object.
BugBear
Simon - 13 Sep 2006 12:24 GMT > I effectively want to skip an known number of lines in a file > (immediately at the beginning of the file) and immediately at the end [quoted text clipped - 5 lines] > lines of the file are in an unknown format and must be ignored - this > means that they must not be checked for validity and processed. Use a queue, e.g. a List<String>:
1. Skip the header 2. Create a queue containing the lines. 3. Read 3 lines into the queue 4. As long as there are more lines - read one line and append it to the queue. - take the first line out of the queue and process it 5. Throw away the 3 remaining lines in the queue.
Cheers, Simon
Chris Brat - 13 Sep 2006 12:42 GMT That's brilliant !!
Thanks Simon.
> > I effectively want to skip an known number of lines in a file > > (immediately at the beginning of the file) and immediately at the end [quoted text clipped - 18 lines] > Cheers, > Simon Andrew Thompson - 13 Sep 2006 10:44 GMT ...
> I dont like the idea of counting eol characters or having to read the > entire file twice (once to get the number of line numbers and the > second time to do my actual processing). ..
> Does anyone have another or better solution? 'Use a file-system that counts them for you'?
(Which is my way of saying. Other tools that provide a line count are doing something like "counting the EOL's" internally - even if they might imply otherwise and obscure the details.)
Note that if you *know* that further processing is required on the file(s), it probably makes more sense to read the lines into an array on the first pass.
(And a LineNumberReader or similar might be the best way to sort those EOL's)
Andrew T.
Chris Brat - 13 Sep 2006 11:56 GMT Hi Andrew,
> 'Use a file-system that counts them for you'? Unfortunately I do not maintain the environment and it is very possible that the OS and everything associated with it may change in the future without my knowlege.
> Note that if you *know* that further processing is required > on the file(s), it probably makes more sense to read the > lines into an array on the first pass. True - do you think this is a good idea with a file of 30 000+ lines though? I dont think the memory expense is worth the few extra seconds.
To be honest I was hoping that someone knew of a OSS lib (like commons IO) or a method I didn't know of in the java.io package that already did this.
Thanks for the input though.
Regards, Chris
Ingo R. Homann - 13 Sep 2006 11:59 GMT Hi,
> To be honest I was hoping that someone knew of a OSS lib (like commons > IO) or a method I didn't know of in the java.io package that already > did this. Well if the filesystem/os does not cache this information, how should a library get the information without reading the whole file? I would have to be 'magic'!
Ciao, Ingo
Andrew Thompson - 13 Sep 2006 12:54 GMT ...
> > Note that if you *know* that further processing is required > > on the file(s), it probably makes more sense to read the > > lines into an array on the first pass. > True - do you think this is a good idea with a file of 30 000+ lines > though? I would need to run some tests (as I suggest you do, since I do not 'need to know'*)
* For this current environment, in which I have no need to parse text files of such length.
> I dont think the memory expense is worth the few extra seconds. The results might surprise you (they might not, as well). In situations as fundamental is this, it pays to do a quick test, though.
OTOH - Ingo raised some interesting points re. the file format. There might be some significant 'cheating' you can do if the files are of 'fixed line length'.
Andrew T.
EJP - 13 Sep 2006 12:52 GMT File file = ...; LineNumberReader lnr = new LineNumberReader(new FileReader(file)); lnr.skip(file.length()-1); int lines = lnr.getLineNumber();
Ingo R. Homann - 13 Sep 2006 13:20 GMT Hi,
> File file = ...; > LineNumberReader lnr = new LineNumberReader(new FileReader(file)); > lnr.skip(file.length()-1); > int lines = lnr.getLineNumber(); Bad idea - that's exactly what Chris wanted to avoid! (Or what do you think this code does internally?)
Ciao, Ingo
EJP - 14 Sep 2006 06:43 GMT > Bad idea - that's exactly what Chris wanted to avoid! (Or what do you > think this code does internally?) I don't think he *can* avod it actually, and thanks, I know exactly what the code does internally too.
Ingo R. Homann - 14 Sep 2006 08:07 GMT Hi EJP,
>> Bad idea - that's exactly what Chris wanted to avoid! (Or what do you >> think this code does internally?) > > I don't think he *can* avod it actually, and thanks, I know exactly what > the code does internally too. Then, I think it would be a good idea to tell this to the OP, because I imagine that he does not know exactly what the code does internally.
I think he might find it a very interesting idea and will give it a try just to find out that it is a bad idea and that it is exactly what he wanted to avoid. ;-)
Ciao, Ingo
Chris Brat - 14 Sep 2006 08:42 GMT Ingo,
I was asking for other possibly better solutions because none of those that I found myself seemed like the best way to do it.
Please dont make comments on my behalf - I appreciate any suggestions by contributors.
EJP, I tested your solution and it gives a 300ms performance improvement on a 40 Mb file.
Regards, Chris
> Hi EJP, > [quoted text clipped - 13 lines] > Ciao, > Ingo Ingo R. Homann - 14 Sep 2006 09:07 GMT Hi,
> I was asking for other possibly better solutions... > > EJP, I tested your solution and it gives [no real] performance > improvement on a 40 Mb file. Well, internally, it does *exactly* the same what you wanted to avoid. Without asking someone and without testing anything, just with thinking a bit about the problem, I can tell you:
THERE IS NO POSSIBILITY TO GET THE NUMER OF LINES IN A FILE WITHOUT READING THE WHOLE FILE.
Sorry for shouting, but that is a fact.
However, your (*other*) problem (reading a file only once, but skipping the last three lines) can be solved otherwise, as Simon and me mentioned.
Ciao, Ingo
Michael Rauscher - 14 Sep 2006 09:24 GMT Hi Ingo ;)
Ingo R. Homann schrieb:
>> EJP, I tested your solution and it gives [no real] performance >> improvement on a 40 Mb file. [quoted text clipped - 7 lines] > > Sorry for shouting, but that is a fact. No reason to shout. It just happened what you've already predicted:
<quote> I think he might find it a very interesting idea and will give it a try just to find out that it is a bad idea and that it is exactly what he wanted to avoid. ;-) </quote>
LOL Michael
Tor Iver Wilhelmsen - 14 Sep 2006 16:03 GMT > THERE IS NO POSSIBILITY TO GET THE NUMER OF LINES IN A FILE WITHOUT > READING THE WHOLE FILE. Exception: If it is known the file has a set line (record) size in bytes, and the line separator is known, then the number of lines = file.size()/(recordSize+separatorSize)
Martin Gregorie - 14 Sep 2006 16:55 GMT >> THERE IS NO POSSIBILITY TO GET THE NUMER OF LINES IN A FILE WITHOUT >> READING THE WHOLE FILE. > > Exception: If it is known the file has a set line (record) size in > bytes, and the line separator is known, then the number of lines = > file.size()/(recordSize+separatorSize) Depends what operating system you're dealing with and how the JVM implementation gets file size from it. Some operating systems return block_size * blocks_in_file as the file size, rather than the space occupied by the file contents.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Simon - 15 Sep 2006 13:10 GMT > Depends what operating system you're dealing with and how the JVM > implementation gets file size from it. Some operating systems return > block_size * blocks_in_file as the file size, rather than the space > occupied by the file contents. I wasn't aware of this. This implies that creating a byte buffer with "new byte[file.length()]" to read the file contents into memory is not a good idea. Even worse, you won't even get an ArrayIndexOutOfBoundsException when you fill the array, because the array will always be too large and never too small.
Do you have an example where File.length() does not return the actual filesize?
Cheers, Simon
Martin Gregorie - 15 Sep 2006 16:19 GMT >> Depends what operating system you're dealing with and how the JVM >> implementation gets file size from it. Some operating systems return [quoted text clipped - 4 lines] > "new byte[file.length()]" to read the file contents into memory is not a good > idea. It's not so bad from that point of view because the buffer can be at most blocksize-1 (or clustersize-1 for a FAT32 partition) bytes too big.
> Even worse, you won't even get an ArrayIndexOutOfBoundsException when you > fill the array, because the array will always be too large and never too small. That's true, but if the file is a serial file it will normally have an end marker. Looking for it works though its scarcely portable.
> Do you have an example where File.length() does not return the actual filesize? The classic example was from the dark ages before the MS FAT filing system introduced clustering to get round disk size limitations. Text files were always terminated with ^Z and reading past that until EOF was returned picked up all the garbage left over from the last file that used that block. That's why old DOS programs check for ^Z or EOF!
Maybe somebody who knows the innards of current MS NTFS filing systems can say what they do.
I have a FAT32 filing system I can look at later: right now its being backed up from Linux. I can tell you (I looked) that file lengths in FAT32 filing systems are correctly reported by Linux but I can't remember what Win95/98/ME does.
If you want to buffer a complete file, the safest way is probably to append bytes or lines to a StringBuffer or to do the equivalent with bytes and don't take any notice of the File.length() except as information.
Here are other ways I know to get file lengths that do not match the amount of data in the file:
- File.length() returns an "unspecified" value if the file is a directory. To me this says either that data files are scanned to determine their length or that the OS is asked how long the file is and its reply is returned without further checks. Either way the value is most likely platform-dependent.
- In UNIX or Linux all files, including directories, have a length, but the length of a directory is usually longer than the data it contains because directories are not sequential files.
- Similarly, you can put gaps in a UNIX/Linux file by doing the following: create the file seek to n * 1000 /* force the file to be large */ seek to 0 write 'n' bytes /* write at the start of the file */ seek to n * 100 /* leaving a gap of n * 99 bytes write 'n' bytes /* write in the middle of the file */ close the file
Of course, this is exactly what a database manager does. A directory listing will report the file size as n * 1000 but the last 899 * n bytes will be junk.
The bottom line is that, unless you know for sure that the file was created with sequential writes *and* that the OS always returns a file length that's accurate to the exact byte, then doing anything except reading through the file is deeply suspect.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Chris Uppal - 18 Sep 2006 12:36 GMT > Maybe somebody who knows the innards of current MS NTFS filing systems > can say what they do. I don't think any version of Windows has ever had any difficulty supplying the correct size for a file (except maybe for 32-bit limits on integers -- but that's a different issue).
> If you want to buffer a complete file, the safest way is probably to > append bytes or lines to a StringBuffer or to do the equivalent with > bytes and don't take any notice of the File.length() except as > information. That is probably true, but not so much because the file length may be wrong (I don't know of any system where it could be, but I don't know much about Java on small devices or mainframe-ish machines), as because the file size may change between when you measure it and when you've finished reading.
> - Similarly, you can put gaps in a UNIX/Linux file by doing the > following: [quoted text clipped - 5 lines] > write 'n' bytes /* write in the middle of the file */ > close the file That doesn't create a file with a length other than what it claims. The size of the file is precisely as specified -- in this case it might claim there were 10,000 bytes in the file and that is precisely what you'll read from it (in Java, C, or any other language). It's just that the on-disk representation is optimised to have some "holes" in it -- but that's not visible or relevant to the application programmer any more than the fact that a file on Windows may be stored on-disk in compressed form.
-- chris
Martin Gregorie - 18 Sep 2006 23:14 GMT > That doesn't create a file with a length other than what it claims. The size > of the file is precisely as specified -- in this case it might claim there were > 10,000 bytes in the file and that is precisely what you'll read from it (in > Java, C, or any other language). Of course.
> It's just that the on-disk representation is > optimised to have some "holes" in it -- but that's not visible or relevant to > the application programmer any more than the fact that a file on Windows may be > stored on-disk in compressed form. Au contraire. The holes are decidedly relevant if you try to read the file sequentially without understanding its format or that it may contain holes.
I've seen this done, not as the artificial example I described, but by the Sculptor 4GL which can create a file to hold a nominated number of records. Doing this helps performance by preventing the file extending and fragmenting as records are added. That DB also created "holes" by writing zeros to deleted records. Again, big trouble if you don't understand what you're reading.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Chris Uppal - 19 Sep 2006 10:07 GMT > > It's just that the on-disk representation is > > optimised to have some "holes" in it -- but that's not visible or [quoted text clipped - 4 lines] > file sequentially without understanding its format or that it may > contain holes. Yes indeed. If you tar/cpio/gzip/zip up a sparse file then it'll stop being sparse (barring odd GNU extensions to tar). I think the same applies to cp and so on. But the only "problem" is that you'll end up with a file where (potentially large) stretches of nuls are represented on-disk as lots of nul-bytes -- the /semantics/ of the file are indentical, but the compression has been lost.
And, of course, a clever file copy would preserve those holes -- or would even introduce them in files which had been created with explicit nul-bytes.
-- chris
Martin Gregorie - 19 Sep 2006 16:26 GMT >>> It's just that the on-disk representation is >>> optimised to have some "holes" in it -- but that's not visible or [quoted text clipped - 14 lines] > And, of course, a clever file copy would preserve those holes -- or would even > introduce them in files which had been created with explicit nul-bytes. I think we're in agreement - copying or compressing a sparse file should retain its sparseness while (hopefully) defragmenting the file if its in a file system that can have physically fragmented files (MS FAT and OS/9 RBF file systems spring to mind). The same should apply if you use the value returned by the File.length() method to allocate a buffer and block the file image into it.
However, we seem to have both strayed somewhat from what I thought the OP was asking: namely about using the file length as an aid to extracting the data from a file, which is not a good idea IMO.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
John W. Kennedy - 21 Sep 2006 19:29 GMT > The classic example was from the dark ages before the MS FAT filing > system introduced clustering to get round disk size limitations. Text > files were always terminated with ^Z and reading past that until EOF was > returned picked up all the garbage left over from the last file that > used that block. That's why old DOS programs check for ^Z or EOF! That was a CP/M restriction that carried over into DOS 1.0's version of BASIC, even though DOS never had the problem. The BASIC in DOS 1.1 fixed it, but by then it was too late.
 Signature John W. Kennedy "The blind rulers of Logres Nourished the land on a fallacy of rational virtue." -- Charles Williams. "Taliessin through Logres: Prelude"
Martin Gregorie - 21 Sep 2006 21:26 GMT >> The classic example was from the dark ages before the MS FAT filing >> system introduced clustering to get round disk size limitations. Text [quoted text clipped - 5 lines] > BASIC, even though DOS never had the problem. The BASIC in DOS 1.1 fixed > it, but by then it was too late. IIRC I also ran into it with flavors of Borland C under DOS 4.2 (shudder) and 5.0.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
Tor Iver Wilhelmsen - 15 Sep 2006 17:48 GMT > I wasn't aware of this. This implies that creating a byte buffer > with "new byte[file.length()]" to read the file contents into memory > is not a good idea. Even worse, you won't even get an > ArrayIndexOutOfBoundsException when you fill the array, because the > array will always be too large and never too small. Yes, but of course you ALWAYS!!!!! check the return value of read(byte[]) to see how many bytes were actually read, so it's NEVER!!!! an issue if you write your code correctly. :)
Simon - 18 Sep 2006 10:44 GMT Tor Iver Wilhelmsen schrieb:
>> I wasn't aware of this. This implies that creating a byte buffer >> with "new byte[file.length()]" to read the file contents into memory [quoted text clipped - 5 lines] > read(byte[]) to see how many bytes were actually read, so it's > NEVER!!!! an issue if you write your code correctly. :) Yes, of course, but it doesn't help :-) If File.length() would return the length of the actual contents, I could do the following. Assume I implement a helper method
public static byte[] getFileContents(File file);
that is supposed to do what is the obvious thing to do for a method with this name. In the implementation, I could create the byte array of length file.length(), initialise the offset into the array to 0, make repeated calls to read(byte[], offset, byteBuffer.length-offset) incrementing the offset according to the return value if it is >= 0 and stopping when the return value is -1. This would all be correct. However, if File.length() would be "correct" (in my sense), I could assume that byteArray now contains the file's contents and return it. This, however, is not true. I would still have to create a new byte array of length "offset" now, and copy the old array into the new.
Cheers, Simon
Chris Brat - 13 Sep 2006 21:14 GMT Hi EJP,
Thanks - very interesting idea.
Will give it a try. Chris
> File file = ...; > LineNumberReader lnr = new LineNumberReader(new FileReader(file)); > lnr.skip(file.length()-1); > int lines = lnr.getLineNumber();
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|