Java Forum / General / May 2008
Virtual files
Roedy Green - 14 May 2008 16:11 GMT I have written quite a bit of code that loads an entire file into RAM then does indexOf, substring, regexes etc, working on the giant String, often creating a new giant String and writing it out.
I wondered if anyone had developed some sort of package to allow such code to be easily transformed to work on arbitrarily large text files, e.g. 10 gigabytes.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Andreas Leitgeb - 14 May 2008 17:46 GMT > I have written quite a bit of code that loads an entire file into RAM > then does indexOf, substring, regexes etc, working on the giant > String, often creating a new giant String and writing it out. Don't know of such a library, but depending on the exact actions, such a library may be impossible to exist.
A regex quite surely needs its "haystack" all in memory, so unless you can be sure that your particular regex will never match a string of length greater than some "n", you'd at least have to load the whole file at once. If instead you can exclude the possibility that a match ever exceeds say 10000 chars, then you can apply it to 20000-length blocks, and then keep the upper 10000, load another 10000 from file and re-apply the regexp ...
The other operations are probably less of a principial problem.
Another thing is, that for substrings you perhaps do not always need to create new strings separately in memory to write out. The OutputStreamWriter has a version of "write" that takes a String and two ints offset,length to write out only a substring. I don't know, if that just obtains a .substring in memory, or if it uses the data directly from the given String. If that isn't trustworthy enough, you'll have to obtain shorter substrings and write them out sequentially one by one.
Finally, you can use JNI and have C-code memory-map the whole file at once. But then you won't get a String, but at most a byte[] visible in Java... Paging in/out the appropriate parts of the file is then made OS's job.
PS: There's some hearsay and speculation behind those hints. Where necessary: corrections are welcome.
Tom Anderson - 14 May 2008 19:55 GMT >> I have written quite a bit of code that loads an entire file into RAM >> then does indexOf, substring, regexes etc, working on the giant [quoted text clipped - 7 lines] > length greater than some "n", you'd at least have to load the whole file > at once. Fine, so constrain the length of a match. You lose generality, but i bet it makes no practical difference. How often do you search text for a pattern >100 characters long, let alone >1 000 000 000?
> If instead you can exclude the possibility that a match ever exceeds say > 10000 chars, then you can apply it to 20000-length blocks, and then keep > the upper 10000, load another 10000 from file and re-apply the regexp > ... Something like that. If you were clever at block boundaries, you might be able to work through the blocks quicker than that.
Back in the olden days, when memory was small (and made of magnetic doughnuts) and data lived on tapes, a lot of work was done on algorithms which didn't need to have their whole dataset in memory, and did the minimal amount of IO, and particularly non-sequential IO, to do their jobs. Particularly algorithms related to data-crunching, as you might well want to do with a computer of that period, databases and whatnot, so plenty of sorting and searching work. I bet there's stuff that could be applied to this case. Sadly, i think a lot of this knowledge has been lost - it's all written down, of course, but there isn't a living community of practice which would help one make use if it.
> The other operations are probably less of a principial problem. > [quoted text clipped - 4 lines] > just obtains a .substring in memory, or if it uses the data directly > from the given String. Good idea. I have this vague memory of String doing this internally anyway, possibly in some ancient version of java - there was a thing about getting memory leaks where you take a 10-char substring of a 10 000 000-char string, drop the big string, but the char[] doesn't get collected becaue the little string is holding on to it. Maybe that was a dream.
But a CraftyStringBuffer object that worked by maintaining indices into an existing buffer, and coalescing them or converting them to strings when it saved memory (ie when they were short) would be cool.
> Finally, you can use JNI and have C-code memory-map the whole file at > once. But then you won't get a String, but at most a byte[] visible in > Java... Paging in/out the appropriate parts of the file is then made > OS's job. No need for JNI (your own, at least) - you can memory-map files using NIO:
String fname = ... ; File f = new File(fname) ; MappedByteBuffer buf = new RandomAccessFile(f, "r").getChannel().map(FileChannel.MapMode.READ_ONLY, 0L, f.length()) ;
However, i note that Buffers use ints for indexing, so they can't handle
>2 GB files. Doh. Okay, but the offset and length for the map *are* long, so you can create a sequence of Buffers which between them tile the entire file. You can then wrap these in a GignormoBuffer class which looks like a buffer but uses long indices and routes calls to the appropriate underlying buffer.
Then you write a GignormoCharBuffer which wraps a GignormoBuffer and translates bytes to chars (and doesn't handle encodings which are not 1:k, eg Latin-1 or UTF-16, so no UTF-8).
Then you rewrite string searching and regexps to use a GignormoCharBuffer instead of a String. String searching is easy, you just rock some Boyer-Moore action. Regexps are harder. Your best bet might be to let GignormoCharBuffer produce CharSequences which are windows onto a 2 gigachar region, then apply java's regexps to those, and add logic to search a whole buffer by doing it in bits and sliding the window, including handling the joins (which might be quite hard to do cleverly). Otherwise, write regexps from the ground up to use long-based offsets (enjoy!).
So no, i don't know of a library for doing this. But it doesn't sound like an insuperable problem. But i suspect Roedy knew that already.
tom
 Signature Hit to death in the future head
Eric Sosman - 14 May 2008 18:31 GMT > I have written quite a bit of code that loads an entire file into RAM > then does indexOf, substring, regexes etc, working on the giant [quoted text clipped - 3 lines] > code to be easily transformed to work on arbitrarily large text files, > e.g. 10 gigabytes. Since a String can hold no more than two gigachars, such a package couldn't be a drop-in replacement for your current techniques. You'll need to divide the file into chunks, and your code will need to deal with the cracks between them.
Using CharSequence instead of String should allow a sort of "sliding window" to avoid storing the entire file in memory at once, but CharSequence is also limited to two gigachars: You can't implement a CharSequence that encompasses the whole ten-gig file. It could get you some space economy, but you'd still need to deal with the chunks.
 Signature Eric.Sosman@sun.com
Roedy Green - 14 May 2008 21:46 GMT > Since a String can hold no more than two gigachars, such >a package couldn't be a drop-in replacement for your current >techniques. You'll need to divide the file into chunks, and >your code will need to deal with the cracks between them. I was imagining some sort of package like String/StringBuilder but with 64-bit offsets. It would cache parts of a backing file in RAM.
The regex would have to work by working on chunks, stopping it before it got too close to a boundary, before shifting so that it was always working in the middle of a buffer. It could then do all the logic with standard Regex, but with a wrapper to handle this buffer shifting nonsense.
nio memory mapped files are not quite clever enough since they are limited by address space. nio is probably the tool you need to have a sliding window.
 Signature
Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Lew - 15 May 2008 06:02 GMT >> I have written quite a bit of code that loads an entire file into RAM >> then does indexOf, substring, regexes etc, working on the giant [quoted text clipped - 15 lines] > ten-gig file. It could get you some space economy, but you'd > still need to deal with the chunks. I wonder if a java.nio.MappedByteBuffer might be of use here.
 Signature Lew
Logan Shaw - 17 May 2008 19:32 GMT >>> I have written quite a bit of code that loads an entire file into RAM >>> then does indexOf, substring, regexes etc, working on the giant [quoted text clipped - 17 lines] > > I wonder if a java.nio.MappedByteBuffer might be of use here. I was going to suggest that, but it appears that all the offsets are ints, so the same 2^32 limitation (actually, 2^31 limitation, I guess) would apply.
- Logan
Kenneth P. Turvey - 17 May 2008 20:51 GMT > I was going to suggest that, but it appears that all the offsets are > ints, so the same 2^32 limitation (actually, 2^31 limitation, I guess) > would apply. This kind of thing is coming up more an more often. The API is going to require some major rewriting in the not too distant future. Sun is going to emphasize backward compatibility and what we are going to end up with is a rather broken and fragmented language. I don't know any good way to handle this, maybe a new set of packages for parts of the API based on longs instead?
 Signature Kenneth P. Turvey <kt-usenet@squeakydolphin.com>
Mark Space - 18 May 2008 20:24 GMT >> I was going to suggest that, but it appears that all the offsets are >> ints, so the same 2^32 limitation (actually, 2^31 limitation, I guess) [quoted text clipped - 6 lines] > handle this, maybe a new set of packages for parts of the API based on > longs instead? That's good point. It's too bad that ints and longs and other basic data types aren't treated as more like objects and able to be handled polymorphically.
I wonder how bad it would be just to take the existing API, say to heck with the spec, and replace all ints with longs, in place, without changing any method names or class names?
You'd get a lot of "need to cast from long to int here" but the you could suppress that and it would still run, right? And then you could upgrade your code at some leisure. I'm not sure of the best way for Sun to do this. Maybe some future version (Java 8?) and give the current version an extra long life so everyone has time to react.
The worst part would be in the transition, narrowing casts do not throw overflow errors, so you could have a big issue and not know it. (*coughfixthisSuncough*) I'd like to see this changed at the same time, just so such errors are caught a runtime.
I know some here are against that, because of how "important" and "convenient" bit slicing integers is, but that's a load of b.s. The obvious solution is that the idiom of catching and ignoring a cast related overflow error could be optimized to the older, faster operators that simulate bit slicing.
int i; long val = 0; try { i = (int)val; } catch( CastOverflowException x ) {}
means "just bit slice it." For convenience, Sun could add an annotation that does the same thing:
@SupressCastOverflow public int longToInt( int i ) { return val; }
will also use the "faster bit slice operator." (Sun could also make it a warning rather than an error to narrow primitives without an explicit cast, and use SuppressWarnings.) Yes, you'll have to touch your code. Sorry.
Logan Shaw - 18 May 2008 22:34 GMT > You'd get a lot of "need to cast from long to int here" but the you > could suppress that and it would still run, right? Yes, for all the things that you rebuild from source. Which creates hassles tracking everything if you do have the source, and causes other more serious problems if you use components or libraries whose source you don't have access to or whose builds aren't as reproducible as they should be.
Also, don't you not get any benefit for the code that you don't change? Although I guess maybe that's OK if you just want to change the subset of the code that matters.
Also, seems like you could get some corner cases where there are overloaded method signatures that must be resolved that weren't ambiguous before but now are because the set of possible things that it could resolve to has changed. Sadly, I don't know enough about the rules of how method calls are resolved in order to easily come up with a good example.
- Logan
Kenneth P. Turvey - 18 May 2008 22:40 GMT >> You'd get a lot of "need to cast from long to int here" but the you >> could suppress that and it would still run, right? [quoted text clipped - 4 lines] > you don't have access to or whose builds aren't as reproducible as they > should be. I think the way to handle it would be a new set of packages that had the same classes in them, but were built to handle things with longs.
Would arrays have problems with this? We would certainly want to increase the size limit of arrays as well. I think that might be a problem.
Also, we would need to deprecate all the old interfaces.
 Signature Kenneth P. Turvey <kt-usenet@squeakydolphin.com>
Arne Vajhøj - 19 May 2008 01:59 GMT > I wonder how bad it would be just to take the existing API, say to heck > with the spec, and replace all ints with longs, in place, without [quoted text clipped - 5 lines] > to do this. Maybe some future version (Java 8?) and give the current > version an extra long life so everyone has time to react. You can suppress warning not errors.
I think it would be rather bad.
You will either break a lot of existing code or make Java a lot less type safe.
Arne
Roedy Green - 21 May 2008 19:46 GMT On Sun, 18 May 2008 12:24:02 -0700, Mark Space <markspace@sbc.global.net> wrote, quoted or indirectly quoted someone who said :
> "just bit slice it." could you please define "bit slice".
do you just mean chopping off the high order bits?
 Signature
Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Jeff Higgins - 17 May 2008 20:56 GMT >> I wonder if a java.nio.MappedByteBuffer might be of use here. > > I was going to suggest that, but it appears that all the offsets > are ints, so the same 2^32 limitation (actually, 2^31 limitation, > I guess) would apply. What's wrong with java.nio.channels.FileChannel?
> - Logan Tom Anderson - 18 May 2008 00:37 GMT >>> I wonder if a java.nio.MappedByteBuffer might be of use here. >> [quoted text clipped - 3 lines] > > What's wrong with java.nio.channels.FileChannel? It's not memory-mapped.
tom
 Signature Death to all vowels! The Ministry of Truth says vowels are plus undoublethink. Vowels are a Eurasian plot! Big Brother, leading us proles to victory!
Jeff Higgins - 18 May 2008 01:10 GMT >>>> I wonder if a java.nio.MappedByteBuffer might be of use here. >>> [quoted text clipped - 5 lines] > > It's not memory-mapped. That is true. Going back to the OP, I don't a *requirement* for a memory-mapped, well anything. I suppose my point was that FileChannel has long indexes.
[quote OP] I have written quite a bit of code that loads an entire file into RAM then does indexOf, substring, regexes etc, working on the giant String, often creating a new giant String and writing it out.
I wondered if anyone had developed some sort of package to allow such code to be easily transformed to work on arbitrarily large text files, e.g. 10 gigabytes. [unquote]
JH
> tom Tom Anderson - 18 May 2008 13:23 GMT >>>>> I wonder if a java.nio.MappedByteBuffer might be of use here. >>>> [quoted text clipped - 8 lines] > That is true. Going back to the OP, I don't a *requirement* for a > memory-mapped, well anything. True! But that's what Logan and whoever came before him were getting at. If you can memory-map, you avoid having to do explicit IO, which radically simplifies searching - you just use the buffer, and let the system bring you the data at the right time.
> I suppose my point was that FileChannel has long indexes. True, but so does RandomAccessFile. A FileInputStream doesn't use indices, and that means it can read files of any size. The only thing that FileChannel adds is the ability to do asynchronous IO, which doesn't seem relevant here. All of these approaches are still hamstrung by the fact that even with long indices on the file object, you can only load up to 2 GB of data at a time, because arrays and buffers use int indices.
tom
 Signature 1 p4WN 3v3Ry+h1n G!!!
Jeff Higgins - 18 May 2008 17:17 GMT > All of these approaches are still hamstrung by the fact that even with > long indices on the file object, you can only load up to 2 GB of data at a > time, because arrays and buffers use int indices. Well, there we are.
> 1 p4WN 3v3Ry+h1n G!!! How did your opponent feel about that?
Tom Anderson - 18 May 2008 18:59 GMT >> All of these approaches are still hamstrung by the fact that even with >> long indices on the file object, you can only load up to 2 GB of data at a >> time, because arrays and buffers use int indices. > > Well, there we are. Exactly.
>> 1 p4WN 3v3Ry+h1n G!!! > > How did your opponent feel about that? Insubstantial.
tom
 Signature These spoiled youths forget that when they are shaven they look like boiled potatoes. -- Tara Singh
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|