Java Forum / General / July 2004
Caching a large file - 2nd post
Chris - 22 Jul 2004 01:45 GMT I posted this a couple of days ago, but got no responses and it didn't show up in Google groups, so it might have gotten lost along the way. If not, sorry for the redundancy...
I need extremely fast random access to an extremely large file. The file could be 100 gigabytes in size; the machine will have at most 2 or 4 gb of memory. Certain parts of the file will be accessed more frequently than others. The question is, what is the most efficient way to handle caching?
I could organize the file in pages, and write some kind of LRU cache to hold the most active pages in JVM memory. I'm thinking, though, that it might be simpler and more efficient to let the operating system handle it through normal disk caching.
The advantages of this scheme will be 1) I won't have to write code, 2) the cache memory will be owned by the operating system and not the JVM, which means the user won't have to worry about memory configuration, 3) the operating system can release disk cache memory when it's needed for some other process, and 4) the operating system's disk caching algorithms are probably much smarter than anything I could write.
The disadvantages are 1) accessing parts of the file still has to go through the Java IO code, which will be slow compared to direct memory access, and 2) I don't know if this really works.
Thoughts?
Also -- does anyone know how Linux allocates memory to the disk cache? Does it use all available memory? Or is there some way to configure it?
Will Hartung - 22 Jul 2004 02:42 GMT > I posted this a couple of days ago, but got no responses and it didn't show > up in Google groups, so it might have gotten lost along the way. If not, [quoted text clipped - 9 lines] > simpler and more efficient to let the operating system handle it through > normal disk caching. If the files are that large, and you have a sufficiently sized sample, I'm just go with FileChannels and ByteBuffers, and let the OS deal with the details.
HOWEVER, depending on what you're doing with that data it may well behoove you to cache the RESULTS of what you derive from the data.
You don't mention the application, but you basically have two costs. One is physical I/O, and the other is actually processing the data.
Take this extreme example. Say that you are simply calculating checksums for blocks of read-only data. It is obvious that you should simply cache the resulting checksums rather than rereading and recalcuating the checksum everytime. Doesn't matter if the pages are cached, you still have to pay the price of the calculation.
That kind of thing can readily cached on an LRU, if that is practical.
Finally, before doing anything, set up some tests and do some benchmarks. Do the simplest things, and then try the more advanced tricks in some testing environments that mimic what you actually do in the app and go from there. Don't pre-optimize until you know where the problems are.
Regards,
Will Hartung (willh@msoft.com)
Roedy Green - 22 Jul 2004 03:05 GMT >The disadvantages are 1) accessing parts of the file still has to go through >the Java IO code, which will be slow compared to direct memory access, and >2) I don't know if this really works. > >Thoughts? Cascading buffering systems is a bit of a waste. You want to put all your space at one level.
Think about what would happen if you wrote 10 disk "optimisers" and chained them, giving them each 1/10 of available ram to play with to cache frequently used data. What happens? Each cache tends to fill up with the same records. You are best to use just one, then you don't have any duplicates in your cache.
Since the OS is already caching disk records, it does not pay to duplicate its efforts. Better to give the extra RAM back to it. Further, it has a more global view of what to keep in cache.
This is not completely true, since a small aux cache can still be helpful, since typically it will be quite a bit faster than an OS-level cache.
If it really matters, try it several ways and see in practice which ACTUALLY works best. It is so easy to convince yourself why some method has to be faster than another, but there are so many factors, you can easily be wrong.
 Signature Canadian Mind Products, Roedy Green. Coaching, problem solving, economical contract programming. See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
marcus - 22 Jul 2004 10:03 GMT I love to read what roedy writes and translate it into how I would say things.
Write the application in the simplest, most basic manner possible, then benchmark it on numerous systems and find out where the bottlenecks really are. Why waste your time solving a potential problem? Spend your effort solving real problems and optimize on later revs.
BTW, the harddrive (RAID, SAN) has a cache as well, which is blazingly fast. I'm thinking an end user who can afford 100 gigs of storage for your application understands things like this.
Lastly, is a single file really the best place to store 100g? if it is structured data, wouldn't a database serve your needs better? Even a trivial one you write and manage yourself? A set of files, or trees of files? Frankly, in my ignorance I cannot imagine 100g in a single file serving any reasonable purpose. My thought is always to store data in the manner in which I anticipate it will be used. (Roedy help me out here!)
Post-lastly, I beleive the optimizations required to solve your problem (for a single file) probably vary widely depending on whether your end-user is using a unix-like file structure or any of the various windows file structures, or something more exotic. Some (file systems) are built for throughput, others for searching, others for transparancy over multiple types of underlying hardware. Some customers tweak their boxes for fast IO, some for massive file transfers. All of this will affect performance, IMHO. If you don't know your end user, I would leave this one alone.
-- clh
>>The disadvantages are 1) accessing parts of the file still has to go through >>the Java IO code, which will be slow compared to direct memory access, and [quoted text clipped - 23 lines] > method has to be faster than another, but there are so many factors, > you can easily be wrong. Roedy Green - 22 Jul 2004 15:38 GMT > My thought is always to store data in >the manner in which I anticipate it will be used. (Roedy help me out here!) Generally people asking questions that imply a big budget have to be tight-lipped about what they are doing. Usually the best you can do is answer their questions literally. Of course the more they can divulge about what they are doing the more imaginative and broad-ranging the responses can be.
 Signature Canadian Mind Products, Roedy Green. Coaching, problem solving, economical contract programming. See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Chris - 22 Jul 2004 19:44 GMT > > My thought is always to store data in > >the manner in which I anticipate it will be used. (Roedy help me out here!) [quoted text clipped - 4 lines] > divulge about what they are doing the more imaginative and > broad-ranging the responses can be. You're right; I can't really say what the app is. But think search engine. Large amounts of data that needs to be searched quickly. Disk seek time and IO dominate.
It is a database-like app, but a traditional SQL database is much too slow (yes, we've benchmarked it). The file format is database-like, but mostly read-only.
Maybe these extra details will trigger a few ideas...
Chris - 22 Jul 2004 19:48 GMT > > > My thought is always to store data in > > >the manner in which I anticipate it will be used. (Roedy help me out [quoted text clipped - 15 lines] > > Maybe these extra details will trigger a few ideas... And one other thing: having 100 separate 1 Gb files would *definitely* be more manageable. I just don't think that 100 operations could possibly be as fast as one large operation. But we will do the benchmarks.
Chris Uppal - 23 Jul 2004 11:04 GMT > And one other thing: having 100 separate 1 Gb files would *definitely* be > more manageable. I just don't think that 100 operations could possibly be > as fast as one large operation. But we will do the benchmarks. Do that. The time taken will, amongst other things, be affected by the number of levels of indirection in the on-disk datastructure necessary for the file system to manage the blocks of data in the file. To me it seems quite feasible that the extra level(s) of disk access needed to find block X in a 100Gb file could make the operation (whatever it is ;-) slower.
Don't forget that this will be different according the file system in use.
If (as you say) disk seek time is going to be an issue, and if your application is (or can be) multi-threaded, then you should get higher throughput by putting the data on many physical disks.
If (as you hint) IO bandwidth is going to be an issue then it may be worth considering compressing the data. In some cases this can be a biggish win.
-- chris
Roedy Green - 23 Jul 2004 01:07 GMT >Maybe these extra details will trigger a few ideas... Here's a little free brainstorming for you...
Silicon Graphics years ago developed special disk controllers to do some processing as the data was flying by the heads. Perhaps the idea may be revived for use by the giant search engines.
In the olden days, we used to always think in terms of cylinders and tracks, and rotational latency, and would plot the trajectories of the disk heads to do each transaction. You would then place files at absolute positions and interleave files to minimize head motion. Disk I/O works much better on cylinder and track boundaries.
You would place different files on different drives so that heads could be seeking while i/o was happening on other drives. Back then only one drive at a time could be transferring.
Since the physical drives are SO slow relatively, and the need for speed so acute, perhaps it is time to go back to these old ways.
You might look into my idea of Marthas -- continuous defragging and writing ordering data LRU by track. I envisioned them in hardware, but you could do them in software. See http://mindprod.com/jgloss/defragger.html
The basic idea is that a background task keeps moving dull data to the not-so prime real estate, freeing it up for fresh writes. You might even use a two armed disk, with one arm almost stationary reading and writing active data, and the other shuffling data in and out of the prime real estate. The background arm elevators back and forth like a bus, picking up and dropping off tracks, transporting them to locations more fitting their LRU status. RAM tracks the physical location of all the logical tracks. The problem is crashing. You lose track of everything if you are not careful. I have been pushing this idea of marthas since around 1985. I am surprised it still has not shown up in high end disk controllers. I sent it to all the disk makers years ago.
I'm not sure what happens in a modern disk if you do a full track read. Is it smart enough to start transferring right away, or it is stupid and waits until sector 0 comes around? Perhaps someone buying trainloads of disks could prod the disk controller makers to add this cleverness if it is not there already. You might convince them to add hardware marthaing as well.
I assume that search engines work by having massive amounts of RAM, and each server on the farm looks after its corner of the universe. The big job is to farm the request to the servers that have the info in RAM and then collect it back together.
Google tends to keep feeding back the same sites for all queries. This makes their caching job easier.
You might design a search engine around 64-bit java and use a humongous address space with all data memory-mapped. Then you let the OS deal with the caching, and the GC deal with keeping active objects densely packed in the virtual address space.
Since the data is mostly read-only, you could afford a fair bit of compression. see http://mindprod.com/projects/supercompressor.html
See http://mindprod.com/projects/htmlcompactor.html. Google could spider faster if pages were more compact. "Bribe" websites to keep their pages compact and properly dated.
enough for now...
 Signature Canadian Mind Products, Roedy Green. Coaching, problem solving, economical contract programming. See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Roedy Green - 23 Jul 2004 01:31 GMT >You might look into my idea of Marthas -- continuous defragging and >writing ordering data LRU by track. I envisioned them in hardware, but >you could do them in software. see http://mindprod.com/jgloss/martha.html
here is an essay I wrote back in the OS/2 days on the idea.
Martha, a new class of disk accelerator software
Martha Stewart is an American TV homemaker whose idea of a good time is to sort and organize her attic.
So I name this new class of disk accelerator software after her:
A Martha is a program that runs at low priority in the background. It examines the LAST ACCESS date of HPFS files. It moves files that have not been used in a long time to the outer two edges of the disk drive. This frees up space near the centre. As a side effect , it moves files recently accessed toward the centre. The net effect is reducing the head travel to access the current working set. of files, which might have equivalent effect of a 3 times faster hard disk.
It may be difficult to implement an effective Martha because I don't think HPFS makes any provision to move a file in active use by another process. However, even a Martha that only worked while everyone else slept could still be very effective.
Hardware Implementation of a Martha
Not to worry. This idea is prior art in the public domain. I have posted this idea electronically many times over the last decade. The BIX electronic conference maintains a record of all postings. The idea needs a small processor in the SCSI disk drive and a CMOS memory. Its operation is totally transparent to the operating system.
For simplicity of explanation, I will presume there are an equal number of sectors in each track. You "timestamp" each track into an array indexed by logical track number stored in CMOS each time you access a track. This is not an actual time of day, just an incremental counter.
Whenever the disk is idle, you physically swap tracks so that tracks that have not been accessed in a long time are near the outside edges, and ones accessed recently are near the middle. You maintain an index by logical track number translating to physical track number in CMOS. You also store the logical track number on each track on disk along with the timestamp so that the CMOS table can be reconstructed in case it is damaged. You can make the thing crash proof by always making the copy before you update the CMOS table.
The net effect is to migrate the rarely accessed junk dynamically to the outer edges. This means that in practice the disk head will most of the time only have to traverse a very small part of the entire travel. No matter how the disk is logically organized, it moves toward an efficient layout.
You can even swap tracks on the fly by having a low priority background process clear out old junk from the central tracks and schedule any full track write directly to the freed central tracks, then post a "free" to its old position in the CMOS table.
I'm sure you can handle the implementation details, complications and refinements. The key is, the disk could be three times faster without any improvement to mechanicals or changes to the operating system drivers. It would work better than a "Martha" utility optimising a single partition since the Adaptec method would work globally across multiple partitions on the same physical disk even if they ran under differing operating systems.
Most users are ZIP file packrats and have a ton of material rarely accessed. Even active applications have huge amounts of material rarely accessed -- such as help files or initialization code. The effects may be much more spectacular in practice than you would hope for.
In any case, they would make you look extremely good on any benchmark!
 Signature Canadian Mind Products, Roedy Green. Coaching, problem solving, economical contract programming. See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
kjc - 23 Jul 2004 01:41 GMT Just use one of the open source java caching solutions like JCS or OSCache or SwarmCache.
>>You might look into my idea of Marthas -- continuous defragging and >>writing ordering data LRU by track. I envisioned them in hardware, but [quoted text clipped - 85 lines] > In any case, they would make you look extremely good on > any benchmark! Liz - 23 Jul 2004 07:02 GMT there is a commercial defrag package that starts with a p (can't remember the name right now) that puts less used data at the begin/end of the disk.
> >You might look into my idea of Marthas -- continuous defragging and > >writing ordering data LRU by track. I envisioned them in hardware, but [quoted text clipped - 90 lines] > Coaching, problem solving, economical contract programming. > See http://mindprod.com/jgloss/jgloss.html for The Java Glossary. Roedy Green - 23 Jul 2004 07:50 GMT >there is a commercial defrag package that starts >with a p (can't remember the name right now) that puts less >used data at the begin/end of the disk. see http://mindprod.com/jgloss/defragger.html
You are probably thinking of perfect disk.
What I am suggesting is a continuously running low priority defragger that is not actually a defragger so much as a cacher that keeps data in bands by time of last use. It would happily scatter files all over if parts of them were highly used but other parts were not. It is maintaining last use data on a per track basis.
If you can keep sufficient free prime real estate, all writes can go to it. You don't move the arms to the nether regions of the disk for write, you effectively move the data to prime real estate on every write. Think of a mostly-write database. Its arms would almost never move to write if the background could keep sufficient free space in the prime real estate.
 Signature Canadian Mind Products, Roedy Green. Coaching, problem solving, economical contract programming. See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
ak - 22 Jul 2004 10:51 GMT > I need extremely fast random access to an extremely large file. The file > could be 100 gigabytes in size; the machine will have at most 2 or 4 gb of > memory. Certain parts of the file will be accessed more frequently than > others. The question is, what is the most efficient way to handle caching? hmm, how mach time take one seek() on 100 GB file? ;-)
http://www.google.de/search?q=java+bulk+i%2Fo&ie=UTF-8&hl=de&meta=
 Signature Andrei Kouznetsov http://uio.dev.java.net Unified I/O for Java http://reader.imagero.com Java image reader
ak - 22 Jul 2004 10:59 GMT Bulk IO people says that they achieve speedups for array reading/writing of over 60x. Last version of Unified I/O gives speedup for array reading/writing between 80x and 130x. However I didn't yet tested it with really HUGE files.
 Signature Andrei Kouznetsov http://uio.dev.java.net Unified I/O for Java http://reader.imagero.com Java image reader
Roedy Green - 22 Jul 2004 15:39 GMT >hmm, how mach time take one seek() on 100 GB file? ;-) same as on 100 1 gb files. It all depends on how well the data are localised.
 Signature Canadian Mind Products, Roedy Green. Coaching, problem solving, economical contract programming. See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Liz - 23 Jul 2004 07:03 GMT > > I need extremely fast random access to an extremely large file. The file > > could be 100 gigabytes in size; the machine will have at most 2 or 4 gb of [quoted text clipped - 9 lines] > http://uio.dev.java.net Unified I/O for Java > http://reader.imagero.com Java image reader A seek is a seek is a seek, 1/2 rotation time.
Roedy Green - 23 Jul 2004 07:54 GMT >A seek is a seek is a seek, 1/2 rotation time. A seek is the time to move the head to the correct cylinder, typically 1/3 full span seek, plus 1/2 rotation rotational latency.
If you do full track reads and the disks have what IBM used to call Rotational Positional Sensing, you can start reading almost right away, simply by reading the sectors in a slightly different order. I don't know if this exists in today's desktop machines.
 Signature Canadian Mind Products, Roedy Green. Coaching, problem solving, economical contract programming. See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
Markus Schaber - 22 Jul 2004 13:27 GMT Hi, Chris,
> Also -- does anyone know how Linux allocates memory to the disk cache? > Does it use all available memory? Or is there some way to configure > it? Linux uses (nearly) all available memory, but it also utilizes some more sophisticated tricks, e. G. it writes out pages that were not accessed for some time so that it can quickly free them when memory is needed.
In /proc, there are some entries where you can tune some parameters, e. G. the minimum of pages that Linux is trying to keep free (so it doesn't have to wait for the purging when allocating), or the swappiness (the likelyhood for linux to swap out some long-unused data to make more space for Disk buffers).
Googling or reading the lkml archive should give you details on this.
HTH, Markus
 Signature markus schaber | dipl. informatiker logi-track ag | rennweg 14-16 | ch 8001 zürich phone +41-43-888 62 52 | fax +41-43-888 62 53 mailto:schabios@logi-track.com | www.logi-track.com
Liz - 23 Jul 2004 07:15 GMT > I posted this a couple of days ago, but got no responses and it didn't show > up in Google groups, so it might have gotten lost along the way. If not, [quoted text clipped - 4 lines] > memory. Certain parts of the file will be accessed more frequently than > others. The question is, what is the most efficient way to handle caching? Bell Labs / Lucent Tech. has a product that is perfect for you. It is called DataBlitz. Don't know the cost, but not free.
> I could organize the file in pages, and write some kind of LRU cache to hold > the most active pages in JVM memory. I'm thinking, though, that it might be [quoted text clipped - 16 lines] > Also -- does anyone know how Linux allocates memory to the disk cache? Does > it use all available memory? Or is there some way to configure it? Michael Borgwardt - 23 Jul 2004 09:25 GMT Sounds a like a perfect opportunity to make use of the memory-mapped files in java.nio
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|