Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / September 2006

Tip: Looking for answers? Try searching our database.

Find the number of lines in a text file

Thread view: 
Chris Brat - 13 Sep 2006 10:23 GMT
Hi,

I need to find the total number of lines in a text file -so that I can
skip the header and filler information and process just the body.

I've done some searching and can't find a class or method that does
this directly and the solutions I've found require either :

- reading each line of the file (using the BufferedReader) and
incrementing a line counter for each line, or
- using the LineNumberReader directly and using its result from its
getLineNumber() method once the entire file is read, or
- searching for the eol characters and counting these.
- using the the RandomAccessFile, seeking to the end of the file and
dividing the total number of bytes by the number of bytes expected in
the line (I believe this relies on guarantee that each line will have
the same number of characters).

I dont like the idea of counting eol characters or having to read the
entire file twice (once to get the number of line numbers and the
second time to do my actual processing).

Does anyone have another or better solution?

Thanks,
Chris
Ingo R. Homann - 13 Sep 2006 10:44 GMT
Hi,

> Hi,
>
[quoted text clipped - 8 lines]
> - using the LineNumberReader directly and using its result from its
> getLineNumber() method once the entire file is read, or

Note that this is the same. (IIRC LineNumberReader internally does
exactly the same as your first suggestion)

> - searching for the eol characters and counting these.

Note that this is also nearly the same - especially in 'runtime-complexity'.

> - using the the RandomAccessFile, seeking to the end of the file and
> dividing the total number of bytes by the number of bytes expected in
> the line (I believe this relies on guarantee that each line will have
> the same number of characters).

Of course this relies on this guarantee! Is it really guaranteed? If
yes, this is the best idea. BTW: AFAIK you do not need a RAF for that -
IIRC, java.io.File has the methode you need (getLength() or sth like
that) as well.

> I dont like the idea of counting eol characters or having to read the
> entire file twice (once to get the number of line numbers and the
> second time to do my actual processing).
>
> Does anyone have another or better solution?

Depends on what *exactly* you want to do, and on the format of the file,
and what you mean with "skip the header and filler information and
process just the body" - isn't it possible to do everything by reading
the file only once? (Do "headers" and "fillers" have a certain prefix? ...?)

Ciao,
Ingo
Chris Brat - 13 Sep 2006 11:50 GMT
Hi Ingo,

I effectively want to skip an known number of lines in a file
(immediately at the beginning of the file) and immediately at the end
of a file. The header is not a problem to skip but the footer is.

Sorry, I actually meant 'footer' and not 'filller'

The scenario is :The user defines that the first 6 lines and the last 3
lines of the file are in an unknown format and must be ignored - this
means that they must not be checked for validity and processed.

Regards,
Chris

> Hi,
>
[quoted text clipped - 41 lines]
> Ciao,
> Ingo
Ingo R. Homann - 13 Sep 2006 11:55 GMT
Hi,

> The scenario is :The user defines that the first 6 lines and the last 3
> lines of the file are in an unknown format and must be ignored - this
> means that they must not be checked for validity and processed.

I think, the simplest idea would be to buffer 3 lines...

Ciao,
Ingo
bugbear - 13 Sep 2006 12:03 GMT
> Hi Ingo,
>
[quoted text clipped - 7 lines]
> lines of the file are in an unknown format and must be ignored - this
> means that they must not be checked for validity and processed.

In that case you certainly don't need to count the total
number of lines in the file.

Simply count the first 6 lines moving forward,
then lseek to the end, and count the last 3 lines backwards.

RandomAccessFile may be useful to you.

You now have the offsets withing the file that define
your "valid zone".

Either work with these, or create
a IO decorator that presents the subset of the
file as stream/reader object.

  BugBear
Simon - 13 Sep 2006 12:24 GMT
> I effectively want to skip an known number of lines in a file
> (immediately at the beginning of the file) and immediately at the end
[quoted text clipped - 5 lines]
> lines of the file are in an unknown format and must be ignored - this
> means that they must not be checked for validity and processed.

Use a queue, e.g. a List<String>:

1. Skip the header
2. Create a queue containing the lines.
3. Read 3 lines into the queue
4. As long as there are more lines
    - read one line and append it to the queue.
    - take the first line out of the queue and process it
5. Throw away the 3 remaining lines in the queue.

Cheers,
Simon
Chris Brat - 13 Sep 2006 12:42 GMT
That's brilliant !!

Thanks Simon.

> > I effectively want to skip an known number of lines in a file
> > (immediately at the beginning of the file) and immediately at the end
[quoted text clipped - 18 lines]
> Cheers,
> Simon
Andrew Thompson - 13 Sep 2006 10:44 GMT
...
> I dont like the idea of counting eol characters or having to read the
> entire file twice (once to get the number of line numbers and the
> second time to do my actual processing).
..
> Does anyone have another or better solution?

'Use a file-system that counts them for you'?

(Which is my way of saying.  Other tools that provide
a line count are doing something like "counting the EOL's"
internally - even if they might imply otherwise and obscure
the details.)

Note that if you *know* that further processing is required
on the file(s), it probably makes more sense to read the
lines into an array on the first pass.

(And a LineNumberReader or similar might be the best
way to sort those EOL's)

Andrew T.
Chris Brat - 13 Sep 2006 11:56 GMT
Hi Andrew,

> 'Use a file-system that counts them for you'?
Unfortunately I do not maintain the environment and it is very possible
that the OS and everything associated with it may change in the future
without my knowlege.

> Note that if you *know* that further processing is required
> on the file(s), it probably makes more sense to read the
> lines into an array on the first pass.
True - do you think this is a good idea with a file of 30 000+ lines
though?
I dont think the memory expense is worth the few extra seconds.

To be honest I was hoping that someone knew of a OSS lib (like commons
IO) or a method I didn't know of in the java.io package that already
did this.

Thanks for the input though.

Regards,
Chris
Ingo R. Homann - 13 Sep 2006 11:59 GMT
Hi,

> To be honest I was hoping that someone knew of a OSS lib (like commons
> IO) or a method I didn't know of in the java.io package that already
> did this.

Well if the filesystem/os does not cache this information, how should a
library get the information without reading the whole file? I would have
to be 'magic'!

Ciao,
Ingo
Andrew Thompson - 13 Sep 2006 12:54 GMT
...
> > Note that if you *know* that further processing is required
> > on the file(s), it probably makes more sense to read the
> > lines into an array on the first pass.
> True - do you think this is a good idea with a file of 30 000+ lines
> though?

I would need to run some tests (as I suggest
you do, since I do not 'need to know'*)

* For this current environment, in which I have no need to
parse text files of such length.

> I dont think the memory expense is worth the few extra seconds.

The results might surprise you (they might not,
as well).  In situations as fundamental is this,
it pays to do a quick test, though.

OTOH - Ingo raised some interesting points re. the
file format.  There might be some significant 'cheating'
you can do if the files are of 'fixed line length'.

Andrew T.
EJP - 13 Sep 2006 12:52 GMT
File file = ...;
LineNumberReader lnr = new LineNumberReader(new FileReader(file));
lnr.skip(file.length()-1);
int lines = lnr.getLineNumber();
Ingo R. Homann - 13 Sep 2006 13:20 GMT
Hi,

> File file = ...;
> LineNumberReader lnr = new LineNumberReader(new FileReader(file));
> lnr.skip(file.length()-1);
> int lines = lnr.getLineNumber();

Bad idea - that's exactly what Chris wanted to avoid! (Or what do you
think this code does internally?)

Ciao,
Ingo
EJP - 14 Sep 2006 06:43 GMT
> Bad idea - that's exactly what Chris wanted to avoid! (Or what do you
> think this code does internally?)

I don't think he *can* avod it actually, and thanks, I know exactly what
the code does internally too.
Ingo R. Homann - 14 Sep 2006 08:07 GMT
Hi EJP,

>> Bad idea - that's exactly what Chris wanted to avoid! (Or what do you
>> think this code does internally?)
>
> I don't think he *can* avod it actually, and thanks, I know exactly what
> the code does internally too.

Then, I think it would be a good idea to tell this to the OP, because I
imagine that he does not know exactly what the code does internally.

I think he might find it a very interesting idea and will give it a try
just to find out that it is a bad idea and that it is exactly what he
wanted to avoid. ;-)

Ciao,
Ingo
Chris Brat - 14 Sep 2006 08:42 GMT
Ingo,

I was asking for other possibly better solutions because none of those
that I found myself seemed like the best way to do it.

Please dont make comments on my behalf - I appreciate any suggestions
by contributors.

EJP, I tested your solution and it gives a 300ms performance
improvement on a 40 Mb file.

Regards,
Chris

> Hi EJP,
>
[quoted text clipped - 13 lines]
> Ciao,
> Ingo
Ingo R. Homann - 14 Sep 2006 09:07 GMT
Hi,

> I was asking for other possibly better solutions...
>
> EJP, I tested your solution and it gives [no real] performance
> improvement on a 40 Mb file.

Well, internally, it does *exactly* the same what you wanted to avoid.
Without asking someone and without testing anything, just with thinking
a bit about the problem, I can tell you:

THERE IS NO POSSIBILITY TO GET THE NUMER OF LINES IN A FILE WITHOUT
READING THE WHOLE FILE.

Sorry for shouting, but that is a fact.

However, your (*other*) problem (reading a file only once, but skipping
the last three lines) can be solved otherwise, as Simon and me mentioned.

Ciao,
Ingo
Michael Rauscher - 14 Sep 2006 09:24 GMT
Hi Ingo ;)

Ingo R. Homann schrieb:
>> EJP, I tested your solution and it gives [no real] performance
>> improvement on a 40 Mb file.
[quoted text clipped - 7 lines]
>
> Sorry for shouting, but that is a fact.

No reason to shout. It just happened what you've already predicted:

<quote>
I think he might find it a very interesting idea and will give it a try
just to find out that it is a bad idea and that it is exactly what he
wanted to avoid.  ;-)
</quote>

LOL
Michael
Tor Iver Wilhelmsen - 14 Sep 2006 16:03 GMT
> THERE IS NO POSSIBILITY TO GET THE NUMER OF LINES IN A FILE WITHOUT
> READING THE WHOLE FILE.

Exception: If it is known the file has a set line (record) size in
bytes, and the line separator is known, then the number of lines =
file.size()/(recordSize+separatorSize)
Martin Gregorie - 14 Sep 2006 16:55 GMT
>> THERE IS NO POSSIBILITY TO GET THE NUMER OF LINES IN A FILE WITHOUT
>> READING THE WHOLE FILE.
>
> Exception: If it is known the file has a set line (record) size in
> bytes, and the line separator is known, then the number of lines =
> file.size()/(recordSize+separatorSize)

Depends what operating system you're dealing with and how the JVM
implementation gets file size from it. Some operating systems return
block_size * blocks_in_file as the file size, rather than the space
occupied by the file contents.

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

Simon - 15 Sep 2006 13:10 GMT
> Depends what operating system you're dealing with and how the JVM
> implementation gets file size from it. Some operating systems return
> block_size * blocks_in_file as the file size, rather than the space
> occupied by the file contents.

I wasn't aware of this. This implies that creating a byte buffer with
"new byte[file.length()]" to read the file contents into memory is not a good
idea. Even worse, you won't even get an ArrayIndexOutOfBoundsException when you
fill the array, because the array will always be too large and never too small.

Do you have an example where File.length() does not return the actual filesize?

Cheers,
Simon
Martin Gregorie - 15 Sep 2006 16:19 GMT
>> Depends what operating system you're dealing with and how the JVM
>> implementation gets file size from it. Some operating systems return
[quoted text clipped - 4 lines]
> "new byte[file.length()]" to read the file contents into memory is not a good
> idea.

It's not so bad from that point of view because the buffer can be at
most blocksize-1 (or clustersize-1 for a FAT32 partition) bytes too big.

> Even worse, you won't even get an ArrayIndexOutOfBoundsException when you
> fill the array, because the array will always be too large and never too small.

That's true, but if the file is a serial file it will normally have an
end marker. Looking for it works though its scarcely portable.

> Do you have an example where File.length() does not return the actual filesize?

The classic example was from the dark ages before the MS FAT filing
system introduced clustering to get round disk size limitations. Text
files were always terminated with ^Z and reading past that until EOF was
returned picked up all the garbage left over from the last file that
used that block. That's why old DOS programs check for ^Z or EOF!

Maybe somebody who knows the innards of current MS NTFS filing systems
can say what they do.

I have a FAT32 filing system I can look at later: right now its being
backed up from Linux. I can tell you (I looked) that file lengths in
FAT32 filing systems are correctly reported by Linux but I can't
remember what Win95/98/ME does.

If you want to buffer a complete file, the safest way is probably to
append bytes or lines to a StringBuffer or to do the equivalent with
bytes and don't take any notice of the File.length() except as information.

Here are other ways I know to get file lengths that do not match the
amount of data in the file:

- File.length() returns an "unspecified" value if the file is a
directory. To me this says either that data files are scanned to
determine their length or that the OS is asked how long the file is and
its reply is returned without further checks. Either way the value is
most likely platform-dependent.

- In UNIX or Linux all files, including directories, have a length, but
  the length of a directory is usually longer than the data it contains
  because directories are not sequential files.

- Similarly, you can put gaps in a UNIX/Linux file by doing the
  following:
    create the file
    seek to n * 1000    /* force the file to be large */
    seek to 0
    write 'n' bytes        /* write at the start of the file */
    seek to n * 100        /* leaving a gap of n * 99 bytes
    write 'n' bytes        /* write in the middle of the file */
    close the file

  Of course, this is exactly what a database manager does.
  A directory listing will report the file size as n * 1000 but the last
  899 * n bytes will be junk.

The bottom line is that, unless you know for sure that the file was
created with sequential writes *and* that the OS always returns a file
length that's accurate to the exact byte, then doing anything except
reading through the file is deeply suspect.

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

Chris Uppal - 18 Sep 2006 12:36 GMT
> Maybe somebody who knows the innards of current MS NTFS filing systems
> can say what they do.

I don't think any version of Windows has ever had any difficulty supplying the
correct size for a file (except maybe for 32-bit limits on integers -- but
that's a different issue).

> If you want to buffer a complete file, the safest way is probably to
> append bytes or lines to a StringBuffer or to do the equivalent with
> bytes and don't take any notice of the File.length() except as
> information.

That is probably true, but not so much because the file length may be wrong (I
don't know of any system where it could be, but I don't know much about Java on
small devices or mainframe-ish machines), as because the file size may change
between when you measure it and when you've finished reading.

> - Similarly, you can put gaps in a UNIX/Linux file by doing the
>    following:
[quoted text clipped - 5 lines]
> write 'n' bytes /* write in the middle of the file */
> close the file

That doesn't create a file with a length other than what it claims.  The size
of the file is precisely as specified -- in this case it might claim there were
10,000 bytes in the file and that is precisely what you'll read from it (in
Java, C, or any other language).  It's just that the on-disk representation is
optimised to have some "holes" in it -- but that's not visible or relevant to
the application programmer any more than the fact that a file on Windows may be
stored on-disk in compressed form.

   -- chris
Martin Gregorie - 18 Sep 2006 23:14 GMT
> That doesn't create a file with a length other than what it claims.  The size
> of the file is precisely as specified -- in this case it might claim there were
> 10,000 bytes in the file and that is precisely what you'll read from it (in
> Java, C, or any other language).

Of course.

> It's just that the on-disk representation is
> optimised to have some "holes" in it -- but that's not visible or relevant to
> the application programmer any more than the fact that a file on Windows may be
> stored on-disk in compressed form.

Au contraire. The holes are decidedly relevant if you try to read the
file sequentially without understanding its format or that it may
contain holes.

I've seen this done, not as the artificial example I described, but by
the Sculptor 4GL which can create a file to hold a nominated number of
records. Doing this helps performance by preventing the file extending
and fragmenting as records are added. That DB also created "holes" by
writing zeros to deleted records. Again, big trouble if you don't
understand what you're reading.

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

Chris Uppal - 19 Sep 2006 10:07 GMT
> > It's just that the on-disk representation is
> > optimised to have some "holes" in it -- but that's not visible or
[quoted text clipped - 4 lines]
> file sequentially without understanding its format or that it may
> contain holes.

Yes indeed.  If you tar/cpio/gzip/zip up a sparse file then it'll stop being
sparse (barring odd GNU extensions to tar).  I think the same applies to cp and
so on.   But the only "problem" is that you'll end up with a file where
(potentially large) stretches of nuls are represented on-disk as lots of
nul-bytes -- the /semantics/ of the file are indentical, but the compression
has been lost.

And, of course, a clever file copy would preserve those holes -- or would even
introduce them in files which had been created with explicit nul-bytes.

   -- chris
Martin Gregorie - 19 Sep 2006 16:26 GMT
>>> It's just that the on-disk representation is
>>> optimised to have some "holes" in it -- but that's not visible or
[quoted text clipped - 14 lines]
> And, of course, a clever file copy would preserve those holes -- or would even
> introduce them in files which had been created with explicit nul-bytes.

I think we're in agreement - copying or compressing a sparse file should
retain its sparseness while (hopefully) defragmenting the file if its in
a file system that can have physically fragmented files (MS FAT and OS/9
RBF file systems spring to mind). The same should apply if you use the
value returned by the File.length() method to allocate a buffer and
block the file image into it.

However, we seem to have both strayed somewhat from what I thought the
OP was asking: namely about using the file length as an aid to
extracting the data from a file, which is not a good idea IMO.

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

John W. Kennedy - 21 Sep 2006 19:29 GMT
> The classic example was from the dark ages before the MS FAT filing
> system introduced clustering to get round disk size limitations. Text
> files were always terminated with ^Z and reading past that until EOF was
> returned picked up all the garbage left over from the last file that
> used that block. That's why old DOS programs check for ^Z or EOF!

That was a CP/M restriction that carried over into DOS 1.0's version of
BASIC, even though DOS never had the problem. The BASIC in DOS 1.1 fixed
it, but by then it was too late.

Signature

John W. Kennedy
"The blind rulers of Logres
Nourished the land on a fallacy of rational virtue."
  -- Charles Williams.  "Taliessin through Logres: Prelude"

Martin Gregorie - 21 Sep 2006 21:26 GMT
>> The classic example was from the dark ages before the MS FAT filing
>> system introduced clustering to get round disk size limitations. Text
[quoted text clipped - 5 lines]
> BASIC, even though DOS never had the problem. The BASIC in DOS 1.1 fixed
> it, but by then it was too late.

IIRC I also ran into it with flavors of Borland C under DOS 4.2
(shudder) and 5.0.

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

Tor Iver Wilhelmsen - 15 Sep 2006 17:48 GMT
> I wasn't aware of this. This implies that creating a byte buffer
> with "new byte[file.length()]" to read the file contents into memory
> is not a good idea. Even worse, you won't even get an
> ArrayIndexOutOfBoundsException when you fill the array, because the
> array will always be too large and never too small.

Yes, but of course you ALWAYS!!!!! check the return value of
read(byte[]) to see how many bytes were actually read, so it's
NEVER!!!! an issue if you write your code correctly. :)
Simon - 18 Sep 2006 10:44 GMT
Tor Iver Wilhelmsen schrieb:

>> I wasn't aware of this. This implies that creating a byte buffer
>> with "new byte[file.length()]" to read the file contents into memory
[quoted text clipped - 5 lines]
> read(byte[]) to see how many bytes were actually read, so it's
> NEVER!!!! an issue if you write your code correctly. :)

Yes, of course, but it doesn't help :-)
If File.length() would return the length of the actual contents, I could do the
following. Assume I implement a helper method

public static byte[] getFileContents(File file);

that is supposed to do what is the obvious thing to do for a method with this
name. In the implementation, I could create the byte array of length
file.length(), initialise the offset into the array to 0, make repeated calls to
read(byte[], offset, byteBuffer.length-offset) incrementing the offset according
to the return value if it is >= 0 and stopping when the return value is -1. This
would all be correct. However, if File.length() would be "correct" (in my
sense), I could  assume that byteArray now contains the file's contents and
return it. This, however, is not true. I would still have to create a new byte
array of length "offset" now, and copy the old array into the new.

Cheers,
Simon
Chris Brat - 13 Sep 2006 21:14 GMT
Hi EJP,

Thanks - very interesting idea.

Will give it a try.
Chris

> File file = ...;
> LineNumberReader lnr = new LineNumberReader(new FileReader(file));
> lnr.skip(file.length()-1);
> int lines = lnr.getLineNumber();


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.