Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / November 2006

Tip: Looking for answers? Try searching our database.

read huge text file from end

Thread view: 
quickcur@yahoo.com - 31 Oct 2006 21:45 GMT
Hi,

I have very large text files and I am only interested in the last 200
lines in each file. How can I read a huge text file line by line from
the end, something line the "tail" command in Unix?

Thanks,

qq
Eric Sosman - 31 Oct 2006 22:18 GMT
quickcur@yahoo.com wrote On 10/31/06 15:45,:
> Hi,
>
> I have very large text files and I am only interested in the last 200
> lines in each file. How can I read a huge text file line by line from
> the end, something line the "tail" command in Unix?

   Do as "tail" does: Get the size of the file, seek to
a position (200 * average_line_length + safety_margin) bytes
before the end, and start reading.  Be prepared for some
glitches if you land in the middle of a multi-byte sequence;
you may need to be tolerant of a malformed line and/or
character decoding errors when you start reading.

   Of course, this simply isn't going to work for files
that contain statefully-encoded regions, or that have been
progressively compressed or encrypted.  For "very large"
files, compression is distinctly likely -- even if you're
not using it now, you might want to ponder before committing
to a strategy that would prevent using it in the future.

Signature

Eric.Sosman@sun.com

Oliver Wong - 31 Oct 2006 23:23 GMT
> quickcur@yahoo.com wrote On 10/31/06 15:45,:
>> Hi,
[quoted text clipped - 16 lines]
> not using it now, you might want to ponder before committing
> to a strategy that would prevent using it in the future.

   Hopefully, the compression would be handled by the underlying OS, and it
would all work "transparently" to your application.

   Otherwise, you're no longer dealing with text files (in the traditional
sense), and if you've got custom file formats, you could do tricks like
actually encode the offset of the 200th line from the end into the header.

   - Oliver
Eric Sosman - 31 Oct 2006 23:51 GMT
Oliver Wong wrote On 10/31/06 17:23,:

>>quickcur@yahoo.com wrote On 10/31/06 15:45,:
>>
[quoted text clipped - 17 lines]
>     Hopefully, the compression would be handled by the underlying OS, and it
> would all work "transparently" to your application.

   It might "work" in the sense of "get to the data as
desired," but only by reading and decompressing everything
before that point -- which sort of vitiates the performance
advantage of the seek, don't you think?

Signature

Eric.Sosman@sun.com

Mike Schilling - 01 Nov 2006 07:15 GMT
>>     Hopefully, the compression would be handled by the underlying OS, and
>> it
[quoted text clipped - 4 lines]
> before that point -- which sort of vitiates the performance
> advantage of the seek, don't you think?

But that's not how OS file compression works.  Generally, there's a page
size (8K or thereabouts), and each page is compressed seperately, with the
OS keeping track of where each compressed page actually starts.  A
random-access read requires figuring out where the pages containing the byte
range live and decompressing only those pages.
Eric Sosman - 01 Nov 2006 13:46 GMT
>>>    Hopefully, the compression would be handled by the underlying OS, and
>>>it
[quoted text clipped - 10 lines]
> random-access read requires figuring out where the pages containing the byte
> range live and decompressing only those pages.

    Look among the bits and pieces of snippage lying about on the
cutting-room floor, and you'll notice I wrote about files that
were "progressively compressed" or "progressively encrypted."
My terminology is probably inexact, but I meant "progressivly"
to describe the sort of compressor/encryptor whose state at a
given point in the data stream is a function of the entire history
of the stream up to that point.  gzip, for example.

Signature

Eric Sosman
esosman@acm-dot-org.invalid



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.