Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / May 2008

Tip: Looking for answers? Try searching our database.

Virtual files

Thread view: 
Roedy Green - 14 May 2008 16:11 GMT
I have written quite a bit of code that loads an entire file into RAM
then does indexOf, substring, regexes etc, working on the giant
String, often creating a new giant String and writing it out.

I wondered if anyone had developed some sort of package to allow such
code to be easily transformed to work on arbitrarily large text files,
e.g. 10 gigabytes.

Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Andreas Leitgeb - 14 May 2008 17:46 GMT
> I have written quite a bit of code that loads an entire file into RAM
> then does indexOf, substring, regexes etc, working on the giant
> String, often creating a new giant String and writing it out.

Don't know of such a library, but depending on the exact actions,
such a library may be impossible to exist.

A regex quite surely needs its "haystack" all in memory, so unless
you can be sure that your particular regex will never match a string
of length greater than some "n", you'd at least have to load the whole
file at once.
If instead you can exclude the possibility that a match ever exceeds
say 10000 chars, then you can apply it to 20000-length blocks, and
then keep the upper 10000, load another 10000 from file and re-apply
the regexp ...

The other operations are probably less of a principial problem.

Another thing is, that for substrings you perhaps do not
always need to create new strings separately in memory to
write out.  The OutputStreamWriter has a version of "write"
that takes a String and two ints offset,length to write out
only a substring.  I don't know, if that just obtains a
.substring in memory, or if it uses the data directly from
the given String.
If that isn't trustworthy enough, you'll have to obtain
shorter substrings and write them out sequentially one by one.

Finally, you can use JNI and have C-code memory-map the whole
file at once.  But then you won't get a String, but at most a byte[]
visible in Java...  Paging in/out the appropriate parts of the file
is then made OS's job.

PS: There's some hearsay and speculation behind those hints.
Where necessary: corrections are welcome.
Tom Anderson - 14 May 2008 19:55 GMT
>> I have written quite a bit of code that loads an entire file into RAM
>> then does indexOf, substring, regexes etc, working on the giant
[quoted text clipped - 7 lines]
> length greater than some "n", you'd at least have to load the whole file
> at once.

Fine, so constrain the length of a match. You lose generality, but i bet
it makes no practical difference. How often do you search text for a
pattern >100 characters long, let alone >1 000 000 000?

> If instead you can exclude the possibility that a match ever exceeds say
> 10000 chars, then you can apply it to 20000-length blocks, and then keep
> the upper 10000, load another 10000 from file and re-apply the regexp
> ...

Something like that. If you were clever at block boundaries, you might be
able to work through the blocks quicker than that.

Back in the olden days, when memory was small (and made of magnetic
doughnuts) and data lived on tapes, a lot of work was done on algorithms
which didn't need to have their whole dataset in memory, and did the
minimal amount of IO, and particularly non-sequential IO, to do their
jobs. Particularly algorithms related to data-crunching, as you might well
want to do with a computer of that period, databases and whatnot, so
plenty of sorting and searching work. I bet there's stuff that could be
applied to this case. Sadly, i think a lot of this knowledge has been lost
- it's all written down, of course, but there isn't a living community of
practice which would help one make use if it.

> The other operations are probably less of a principial problem.
>
[quoted text clipped - 4 lines]
> just obtains a .substring in memory, or if it uses the data directly
> from the given String.

Good idea. I have this vague memory of String doing this internally
anyway, possibly in some ancient version of java - there was a thing about
getting memory leaks where you take a 10-char substring of a 10 000
000-char string, drop the big string, but the char[] doesn't get collected
becaue the little string is holding on to it. Maybe that was a dream.

But a CraftyStringBuffer object that worked by maintaining indices into an
existing buffer, and coalescing them or converting them to strings when it
saved memory (ie when they were short) would be cool.

> Finally, you can use JNI and have C-code memory-map the whole file at
> once.  But then you won't get a String, but at most a byte[] visible in
> Java...  Paging in/out the appropriate parts of the file is then made
> OS's job.

No need for JNI (your own, at least) - you can memory-map files using NIO:

String fname = ... ;
File f = new File(fname) ;
MappedByteBuffer buf =
    new RandomAccessFile(f, "r").getChannel().map(FileChannel.MapMode.READ_ONLY, 0L, f.length()) ;

However, i note that Buffers use ints for indexing, so they can't handle
>2 GB files. Doh.

Okay, but the offset and length for the map *are* long, so you can create
a sequence of Buffers which between them tile the entire file. You can
then wrap these in a GignormoBuffer class which looks like a buffer but
uses long indices and routes calls to the appropriate underlying buffer.

Then you write a GignormoCharBuffer which wraps a GignormoBuffer and
translates bytes to chars (and doesn't handle encodings which are not 1:k,
eg Latin-1 or UTF-16, so no UTF-8).

Then you rewrite string searching and regexps to use a GignormoCharBuffer
instead of a String. String searching is easy, you just rock some
Boyer-Moore action. Regexps are harder. Your best bet might be to let
GignormoCharBuffer produce CharSequences which are windows onto a 2
gigachar region, then apply java's regexps to those, and add logic to
search a whole buffer by doing it in bits and sliding the window,
including handling the joins (which might be quite hard to do cleverly).
Otherwise, write regexps from the ground up to use long-based offsets
(enjoy!).

So no, i don't know of a library for doing this. But it doesn't sound like
an insuperable problem. But i suspect Roedy knew that already.

tom

Signature

Hit to death in the future head

Eric Sosman - 14 May 2008 18:31 GMT
> I have written quite a bit of code that loads an entire file into RAM
> then does indexOf, substring, regexes etc, working on the giant
[quoted text clipped - 3 lines]
> code to be easily transformed to work on arbitrarily large text files,
> e.g. 10 gigabytes.

    Since a String can hold no more than two gigachars, such
a package couldn't be a drop-in replacement for your current
techniques.  You'll need to divide the file into chunks, and
your code will need to deal with the cracks between them.

    Using CharSequence instead of String should allow a sort
of "sliding window" to avoid storing the entire file in memory
at once, but CharSequence is also limited to two gigachars:
You can't implement a CharSequence that encompasses the whole
ten-gig file.  It could get you some space economy, but you'd
still need to deal with the chunks.

Signature

Eric.Sosman@sun.com

Roedy Green - 14 May 2008 21:46 GMT
>     Since a String can hold no more than two gigachars, such
>a package couldn't be a drop-in replacement for your current
>techniques.  You'll need to divide the file into chunks, and
>your code will need to deal with the cracks between them.

I was imagining some sort of package like String/StringBuilder but
with 64-bit offsets.  It would cache parts of a backing file in RAM.

The regex would have to work by working on chunks, stopping it before
it got too close to a boundary, before shifting so that it was always
working in the middle of a buffer.  It could then do all the logic
with standard Regex, but with a wrapper to handle this buffer shifting
nonsense.

nio memory mapped files are not quite clever enough since they are
limited by address space.  nio is probably the tool you need to have a
sliding window.
Signature


Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Lew - 15 May 2008 06:02 GMT
>> I have written quite a bit of code that loads an entire file into RAM
>> then does indexOf, substring, regexes etc, working on the giant
[quoted text clipped - 15 lines]
> ten-gig file.  It could get you some space economy, but you'd
> still need to deal with the chunks.

I wonder if a java.nio.MappedByteBuffer might be of use here.

Signature

Lew

Logan Shaw - 17 May 2008 19:32 GMT
>>> I have written quite a bit of code that loads an entire file into RAM
>>> then does indexOf, substring, regexes etc, working on the giant
[quoted text clipped - 17 lines]
>
> I wonder if a java.nio.MappedByteBuffer might be of use here.

I was going to suggest that, but it appears that all the offsets
are ints, so the same 2^32 limitation (actually, 2^31 limitation,
I guess) would apply.

  - Logan
Kenneth P. Turvey - 17 May 2008 20:51 GMT
> I was going to suggest that, but it appears that all the offsets are
> ints, so the same 2^32 limitation (actually, 2^31 limitation, I guess)
> would apply.

This kind of thing is coming up more an more often.  The API is going to
require some major rewriting in the not too distant future.  Sun is going
to emphasize backward compatibility and what we are going to end up with
is a rather broken and fragmented language.  I don't know any good way to
handle this, maybe a new set of packages for parts of the API based on
longs instead?

Signature

Kenneth P. Turvey <kt-usenet@squeakydolphin.com>

Mark Space - 18 May 2008 20:24 GMT
>> I was going to suggest that, but it appears that all the offsets are
>> ints, so the same 2^32 limitation (actually, 2^31 limitation, I guess)
[quoted text clipped - 6 lines]
> handle this, maybe a new set of packages for parts of the API based on
> longs instead?

That's good point.  It's too bad that ints and longs and other basic
data types aren't treated as more like objects and able to be handled
polymorphically.

I wonder how bad it would be just to take the existing API, say to heck
with the spec, and replace all ints with longs, in place, without
changing any method names or class names?

You'd get a lot of "need to cast from long to int here" but the you
could suppress that and it would still run, right?  And then you could
upgrade your code at some leisure.  I'm not sure of the best way for Sun
to do this.  Maybe some future version (Java 8?) and give the current
version an extra long life so everyone has time to react.

The worst part would be in the transition, narrowing casts do not throw
overflow errors, so you could have a big issue and not know it.
(*coughfixthisSuncough*)  I'd like to see this changed at the same time,
just so such errors are caught a runtime.

I know some here are against that, because of how "important" and
"convenient" bit slicing integers is, but that's a load of b.s.  The
obvious solution is that the idiom of catching and ignoring a cast
related overflow error could be optimized to the older, faster operators
that simulate bit slicing.

  int i;
  long val = 0;
  try {
      i = (int)val;
  } catch( CastOverflowException x )
  {}

means "just bit slice it."  For convenience, Sun could add an annotation
that does the same thing:

  @SupressCastOverflow
  public int longToInt( int i ) {
    return val;
  }

will also use the "faster bit slice operator." (Sun could also make it a
warning rather than an error to narrow primitives without an explicit
cast, and use SuppressWarnings.)  Yes, you'll have to touch your code.
Sorry.
Logan Shaw - 18 May 2008 22:34 GMT
> You'd get a lot of "need to cast from long to int here" but the you
> could suppress that and it would still run, right?

Yes, for all the things that you rebuild from source.  Which creates
hassles tracking everything if you do have the source, and causes
other more serious problems if you use components or libraries whose
source you don't have access to or whose builds aren't as reproducible
as they should be.

Also, don't you not get any benefit for the code that you don't change?
Although I guess maybe that's OK if you just want to change the subset
of the code that matters.

Also, seems like you could get some corner cases where there are
overloaded method signatures that must be resolved that weren't
ambiguous before but now are because the set of possible things
that it could resolve to has changed.  Sadly, I don't know enough
about the rules of how method calls are resolved in order to easily
come up with a good example.

  - Logan
Kenneth P. Turvey - 18 May 2008 22:40 GMT
>> You'd get a lot of "need to cast from long to int here" but the you
>> could suppress that and it would still run, right?
[quoted text clipped - 4 lines]
> you don't have access to or whose builds aren't as reproducible as they
> should be.

I think the way to handle it would be a new set of packages that had the
same classes in them, but were built to handle things with longs.

Would arrays have problems with this?  We would certainly want to
increase the size limit of arrays as well.  I think that might be a
problem.  

Also, we would need to deprecate all the old interfaces.

Signature

Kenneth P. Turvey <kt-usenet@squeakydolphin.com>

Arne Vajhøj - 19 May 2008 01:59 GMT
> I wonder how bad it would be just to take the existing API, say to heck
> with the spec, and replace all ints with longs, in place, without
[quoted text clipped - 5 lines]
> to do this.  Maybe some future version (Java 8?) and give the current
> version an extra long life so everyone has time to react.

You can suppress warning not errors.

I think it would be rather bad.

You will either break a lot of existing code or make
Java a lot less type safe.

Arne
Roedy Green - 21 May 2008 19:46 GMT
On Sun, 18 May 2008 12:24:02 -0700, Mark Space
<markspace@sbc.global.net> wrote, quoted or indirectly quoted someone
who said :

> "just bit slice it."
could you please define "bit slice".

do you just mean chopping off the high order bits?
Signature


Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Jeff Higgins - 17 May 2008 20:56 GMT
>> I wonder if a java.nio.MappedByteBuffer might be of use here.
>
> I was going to suggest that, but it appears that all the offsets
> are ints, so the same 2^32 limitation (actually, 2^31 limitation,
> I guess) would apply.

What's wrong with java.nio.channels.FileChannel?
>   - Logan
Tom Anderson - 18 May 2008 00:37 GMT
>>> I wonder if a java.nio.MappedByteBuffer might be of use here.
>>
[quoted text clipped - 3 lines]
>
> What's wrong with java.nio.channels.FileChannel?

It's not memory-mapped.

tom

Signature

Death to all vowels! The Ministry of Truth says vowels are plus
undoublethink. Vowels are a Eurasian plot! Big Brother, leading us proles
to victory!

Jeff Higgins - 18 May 2008 01:10 GMT
>>>> I wonder if a java.nio.MappedByteBuffer might be of use here.
>>>
[quoted text clipped - 5 lines]
>
> It's not memory-mapped.

That is true.  Going back to the OP, I don't a *requirement*
for a memory-mapped, well anything.  I suppose my point was that
FileChannel has long indexes.

[quote OP]
I have written quite a bit of code that loads an entire file into RAM
then does indexOf, substring, regexes etc, working on the giant
String, often creating a new giant String and writing it out.

I wondered if anyone had developed some sort of package to allow such
code to be easily transformed to work on arbitrarily large text files,
e.g. 10 gigabytes.
[unquote]

JH

> tom
Tom Anderson - 18 May 2008 13:23 GMT
>>>>> I wonder if a java.nio.MappedByteBuffer might be of use here.
>>>>
[quoted text clipped - 8 lines]
> That is true.  Going back to the OP, I don't a *requirement* for a
> memory-mapped, well anything.

True! But that's what Logan and whoever came before him were getting at.
If you can memory-map, you avoid having to do explicit IO, which radically
simplifies searching - you just use the buffer, and let the system bring
you the data at the right time.

> I suppose my point was that FileChannel has long indexes.

True, but so does RandomAccessFile. A FileInputStream doesn't use indices,
and that means it can read files of any size. The only thing that
FileChannel adds is the ability to do asynchronous IO, which doesn't seem
relevant here. All of these approaches are still hamstrung by the fact
that even with long indices on the file object, you can only load up to 2
GB of data at a time, because arrays and buffers use int indices.

tom

Signature

1 p4WN 3v3Ry+h1n G!!!

Jeff Higgins - 18 May 2008 17:17 GMT
> All of these approaches are still hamstrung by the fact that even with
> long indices on the file object, you can only load up to 2 GB of data at a
> time, because arrays and buffers use int indices.

Well, there we are.

> 1 p4WN 3v3Ry+h1n G!!!
How did your opponent feel about that?
Tom Anderson - 18 May 2008 18:59 GMT
>> All of these approaches are still hamstrung by the fact that even with
>> long indices on the file object, you can only load up to 2 GB of data at a
>> time, because arrays and buffers use int indices.
>
> Well, there we are.

Exactly.

>> 1 p4WN 3v3Ry+h1n G!!!
>
> How did your opponent feel about that?

Insubstantial.

tom

Signature

These spoiled youths forget that when they are shaven they look like
boiled potatoes. -- Tara Singh



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.