Java Forum / General / December 2006
Bignums, object grind, and garbage collection
John Ersatznom - 16 Dec 2006 02:58 GMT I'm working on a Java project in which I'm wrangling with three intertwined issues: bignum arithmetic, a need to optimize memory use/gc behavior, and a need for speed. As such, I've got the following apparent options, which each have caveats as well as good points:
1. Use BigInteger and BigDecimal Pros: * Already done for me. * Clean and safe. * Using code will be easy to understand by others. * 100% Pure Java; platform independent. Cons: * Since they're immutable, lengthy calculations will generate large numbers of "new BigFoo" and heaps of discarded objects for the GC to process. * I'm not sure the implementation is as fast as possible. In particular, it doesn't seem to be documented anywhere whether these classes switch to Karatsuba multiplication and then to FFTs at appropriate bit lengths or just stick with slow, O(n^2) long multiplication throughout. 2. Use something like the GNU math library via JNI; there are libraries out there that use sensible multiplication implementations at large bit sizes. Pros: * Already done for me save creating the wrapper classes * I can probably make my bignums mutable, so instances can be assigned to and otherwise recycled reducing newing and GCing. * Fast multiplies. Cons: * Non-immutable objects will make the using code harder to understand and more error-prone. * Not using BigDecimal and BigInteger will make the using code less clear. * Going through JNI is slow, so some inner loops would also have to be implemented in C and called down through JNI. * Not 100% Pure Java; will require compiling some C and some Java to build on any given box. The C can at least be highly portable, since all the UI/graphics/whatever would be done in Java. 3. Roll my own mutable bignum classes, presumably in Java. Pros: * I can definitely make my bignums mutable, so instances can be assigned to and otherwise recycled reducing newing and GCing. * I can use fast implementations for various bit sizes. * I can avoid JNI and have 100% Pure Java, platform independent. Cons: * Non-immutable objects will make the using code harder to understand and more error-prone. * Not using BigDecimal and BigInteger will make the using code less clear. * A lot of work to do myself, some of it doubtless wheel- reinventing. * May not be as fast as native code. * May not be as fast as BigDecimal and BigInteger for small bit lengths until the effects of object churn are felt in the form of lots of GC CPU use and a process size in the nine-figures. * Implementing operations will be tricky with no easy way to trap overflows. For efficient adds, there's a fairly obvious method using arrays of longs with 63 bits of bignum per long, using the sum becoming negative as a carry flag. OTOH, multiplies look like they'd prefer using ints with 31 bits per int and intermediate longs, for the basic long division multiply or Karatsuba. (FFT would require a wholly different implementation and I'd cross that when I came to it; my early priority would be getting a working bignum using just long multiplication.) 4. Find some library or code that resulted from someone else doing 3.
The tradeoffs among the above seem to depend at least partially on the vagaries of the VM, and in particular of the GC. It's not hard to imagine a VM implementation smart enough to recycle objects of a frequently used class under the hood. It still seems unlikely that it can be as efficient as explicitly freeing or assigning to the bignums under some circumstances.
My determinations so far are: * Using JNI for anything is a last resort. * The choice is really down to mutable bignums vs. BigInteger and BigDecimal.
If the bignums need to be mutable to get efficiently running code that doesn't bloat up the process to hundreds of megs and spend more time in GC than in my own code, then option 1 is out. On the other hand, if getting decent multiplies at large bit sizes isn't going to happen with BigFoo, then option 1 is out and I may as well make the darn things mutable.
Since I'm not aware of any pre-existing code suitable for 4, and JNI is a last resort, it's basically down to 1 vs. 3.
What I'd like is: a) People to weigh in on bignums and garbage collection: * Would code that generates billions of intermediate values during calculations but keeps none of them for very long be seriously bogged down and the memory requirements bloated by using the immutable bignums? * Are the multiplies in the standard BigFoos inefficient when the values are hundreds or even thousands of bits long? b) In case either answer above is "yes" and I need to make my own: * How best to go about implementing the basic, non-FFT bignum? Use ints or longs? Tips on implementing multiplication (and subtraction!)? * What are sensible thresholds for switching implementations, in terms of bit lengths, to keep multiplies fast? * How best to implement a BigDecimal-alike given a BigInteger-alike?
Thank you.
John W. Kennedy - 16 Dec 2006 03:40 GMT > * I'm not sure the implementation is as fast as > possible. In particular, it doesn't seem to be > documented anywhere whether these classes switch to > Karatsuba multiplication and then to FFTs at > appropriate bit lengths or just stick with slow, O(n^2) > long multiplication throughout. For whatever it's worth, Ruby finds the first 14 perfect numbers a /lot/ faster than Java does. (Perl is a lot slower. GCLISP is fastest.)
 Signature John W. Kennedy "The blind rulers of Logres Nourished the land on a fallacy of rational virtue." -- Charles Williams. "Taliessin through Logres: Prelude"
John Ersatznom - 16 Dec 2006 03:59 GMT >> * I'm not sure the implementation is as fast as >> possible. In particular, it doesn't seem to be [quoted text clipped - 5 lines] > For whatever it's worth, Ruby finds the first 14 perfect numbers a /lot/ > faster than Java does. (Perl is a lot slower. GCLISP is fastest.) Is Ruby a broadly-ported system with free* development tools, a similar range of standard classes and APIs including decent GUI and other functionality, with a decent free* IDE available, and one that's somehow on-topic for comp.lang.java.programmer? If so, I may look into it.
And I very much doubt any flavor of LISP is faster than C or native assembly for this kind of thing, absent some kind of contrived circumstance or using purpose-built LISP-oriented hardware.
* As in speech but just as importantly as in beer
John W. Kennedy - 16 Dec 2006 04:34 GMT >>> * I'm not sure the implementation is as fast as >>> possible. In particular, it doesn't seem to be [quoted text clipped - 10 lines] > functionality, with a decent free* IDE available, and one that's somehow > on-topic for comp.lang.java.programmer? If so, I may look into it. It's got most of that, in fact. But my point is that, even though Ruby is interpreted -- not even bytecoded -- its big-number performance is about five times faster than Java's, HotSpot and all, which rather implies that Java's big numbers could be a lot sleeker than they are.
> And I very much doubt any flavor of LISP is faster than C or native > assembly for this kind of thing, absent some kind of contrived > circumstance or using purpose-built LISP-oriented hardware. C and Assembler don't have intrinsic support of big numbers.
 Signature John W. Kennedy "The blind rulers of Logres Nourished the land on a fallacy of rational virtue." -- Charles Williams. "Taliessin through Logres: Prelude"
John Ersatznom - 16 Dec 2006 05:15 GMT >>>> * I'm not sure the implementation is as fast as >>>> possible. In particular, it doesn't seem to be [quoted text clipped - 16 lines] > about five times faster than Java's, HotSpot and all, which rather > implies that Java's big numbers could be a lot sleeker than they are. Sounds like I need to go with my own bignums then.
Any suggestions on implementing the basic array-of-primitive-integer-type case are welcome, as well as any pointers to freely reusable libraries for the purpose. (They must be freely distributable with the calling code. LGPL preferred, and failing that some kind of open source preferred, but anything that can be called from GPL and LGPL code and binary distributed with same would do. Requiring a nonstandard native library is a big minus.)
nebulous99@gmail.com - 16 Dec 2006 05:57 GMT > Any suggestions on implementing the basic > array-of-primitive-integer-type case are welcome, as well as any [quoted text clipped - 3 lines] > from GPL and LGPL code and binary distributed with same would do. > Requiring a nonstandard native library is a big minus.) Here's two, but they might have some downsides.
Apfloat: http://www.apfloat.org/apfloat_java/ Claims to be faster than BigInteger etc. and it's LGPL. But these seem to be immutable objects. If you're running some kind of science simulation that generates and discards billions of distinct values in the space of a few hours there'll be a lot of gc overhead and memory wastage, I expect (in another post you implied these rates of object generation would occur in your application). Apfloat seems to use the disk for some reason for temporary files and
JScience: http://en.wikipedia.org/wiki/JScience Also claims to be faster, and to have some kind of recycling system under the hood to cut down on GC use, but it is also huge and complex. Studying it to learn how to effectively use the bits you need for AP math would probably take a while. It's also open source (BSD license). See also http://en.wikipedia.org/wiki/Javolution which mentions the object recycling, and the home pages linked from the wikipedia articles. JScience is apparently built on it. Javolution is also BSD-licensed.
Both libraries appear to be 1.5-aware (generics syntax is evident in their javadocs, e.g. Foo<Bar>) and may require 1.5 (I'm pretty sure Apfloat does).
If I were you, I'd go with JScience or roll-your-own, with Apfloat a dark-horse candidate if you have a magic VM that can cope with millions of allocations a second of small to medium-size objects and arrays without turning into molasses or selling OutOfMemoryErrors like hotcakes, or a multi-GB-RAM workstation with 64-bit everything out the wazoo. If you need speed, the built-in BigFoos are, of course, out of the question.
John Ersatznom - 16 Dec 2006 06:18 GMT > JScience: http://en.wikipedia.org/wiki/JScience > Also claims to be faster, and to have some kind of recycling system [quoted text clipped - 5 lines] > articles. JScience is apparently built on it. Javolution is also > BSD-licensed. This may be just what the doctor ordered; thank you.
I especially like these bits:
"Operations on instances of this class are quite fast as information substantially below the precision level (aka noise) is not processed/stored. There is no limit on a real precision but precision degenerates (due to numeric errors) and calculations accelerate as more and more operations are performed." (Real class API.)
"Transparent Object Recycling For example, our benchmark indicates that adding immutable LargeInteger is up to 8-10x faster than adding java.math.BigInteger" (Main page. And with semantically-immutable objects?!)
I can use these under the hood, and maybe even use some of the realtime stuff throughout the project, which may eventually want or need distributed capabilities and suchlike. XML externalization as an alternative to Serializable is also attractive on a number of levels.
Lew - 16 Dec 2006 18:44 GMT > "Transparent Object Recycling For example, our benchmark indicates that > adding immutable LargeInteger is up to 8-10x faster than adding > java.math.BigInteger" (Main page. And with semantically-immutable > objects?!) Be cautious making assumptions about how fast or slow GC makes things.
The GC is fast indeed for multiplicities of small objects with short lifespans. They are quickly discarded and quickly reclaimed.
It is also tunable, and different algorithms are invokable.
Object creation time may be a tad difficult, but I have read that Java's creation times are faster than most people believe. I have never heard that Sun's JVMs re-use objects in some sort of creation pool, but that doesn't mean that they don't.
The Java lib versions may use O(n^2) algorithms, but a hand-rolled BigFoo with immutable fields and good algorithms may be faster than you suspect.
Do not overlook the possible influence of JIT compilation to optimize away apparent inefficiencies in your source, especially under load over time.
Anything involving billions of anything will be slow.
You are probably stuck with having to implement prototypes of more than one approach, and actually profile and measure them. You could start with the LPGL libraries nebulous pointed out, compared with Java's standard implementations.
You clearly know your algorithms. The right algorithms will have so much more effect than putative difficulties with GC or object creation times.
You can also throw hardware at the problem, like multiway / multithreaded multiprocessors.
- Lew
John Ersatznom - 16 Dec 2006 22:38 GMT > The GC is fast indeed for multiplicities of small objects with short > lifespans. They are quickly discarded and quickly reclaimed. > > It is also tunable, and different algorithms are invokable. I can't expect the end-users to do lots of tuning as part of the installation, I'm afraid. Unless there's a way to include such tuning in the distribution, and have the tuning work under different use patterns by different users, and have it not impact anything else on the deployment system, including other Java stuff that may live there.
> Object creation time may be a tad difficult, but I have read that Java's > creation times are faster than most people believe. I have never heard > that Sun's JVMs re-use objects in some sort of creation pool, but that > doesn't mean that they don't. It still seems to me that given these two choices, the former must invariably be faster:
x.add(y) changing the fields in x, whose old value is no longer needed and where the add can be implemented using either no temporaries, or reused temporaries.
Foo z = x.add(y) where in addition to the fields in z having to be set, which is the same work done above, but also: * a new Foo has to be allocated * The disused x has to be garbage collected later (Here the add impmementation ends with something like "return new Foo(args)".)
However little amortized overhead the additional allocation of z and the collecting of x takes, it isn't zero and it adds up.
The memory usage is clearly also higher.
> Do not overlook the possible influence of JIT compilation to optimize > away apparent inefficiencies in your source, especially under load over > time. If you can convince me that the Hotspot server VM JIT is smart enough to change Foo z = x.add(y) under the hood into effectively x.add(y) where x is simply changed in content and renamed as z. If x is a local variable and the VM can *prove* that x gets the sole reference to its object, and that reference never leaves the function (x being a temporary in a calculation), then that might be possible. Of course, that depends on several things: probably the object x refers to had to be allocated in the same method using x, rather than coming in in a parameter; the class must be provably not keeping an internal reference, such as for a cache...
> Anything involving billions of anything will be slow. True. And it will be more sensitive to inefficiencies. A single calculation taking a fraction of a second can be 30% slower and no-one cares, so long as it's isolated and the throughput bottleneck remains user interaction or I/O. A simulation running for hours taking 130 hours instead of 100 is taking more than an extra *day*, which is definitely noticeable! Also, anything involving billions of anything is potentially going to crash with OOME. (Most VMs including Sun's also get unstable when memory gets scarce, and abnormal termination of the VM is IME frequent when memory is tight and especially if the app has already recovered from OOME, and quite rare otherwise at least for Sun VMs.) If we have a sequential calculation, such as a simulation with successive states, ideally we only need a process size around twice the size of the state plus a little -- to hold the existing state, the next state being built with reference to the existing state, and some temporaries. At the end of the cycle, the references to next state and existing state can simply be swapped, and the new state overwriting the two-iterations-ago state as it gets built in turn, as well as the temporaries overwritten.
Use immutable objects and lots of "new" each cycle, however, and the picture changes. The size grows with every iteration. Then if you're lucky the gc begins to use a significant fraction of the CPU and the size stabilizes, with old state data being deallocated at the same rate as new state data is allocated. The process is somewhat larger and (maybe only slightly) slower than the picture painted above, but it's stable and it's pretty internally rigorous, as there's no danger something is overwritten whose old value is still being used, creating errors that creep into the calculations quietly.
If you're unlucky the gc can't keep up with the sheer rate of object allocation demand and either the gc dominates the cpu time as the simulation is forced to slow down until it only wants new objects at the maximum rate the gc/allocation code of the VM can produce them or the gc can't cope and OOME gets thrown. Then the simulation crashes, or it tries to save or recover only to risk the VM crashing.
> You are probably stuck with having to implement prototypes of more than > one approach, and actually profile and measure them. You could start > with the LPGL libraries nebulous pointed out, compared with Java's > standard implementations. Actually they were BSD licensed, but that's even more liberal from the look of things.
> You clearly know your algorithms. The right algorithms will have so much > more effect than putative difficulties with GC or object creation times. The right algorithms combined with the sleekest memory management will be better than either by itself. :)
> You can also throw hardware at the problem, like multiway / > multithreaded multiprocessors. The right algorithms, the sleekest memory management, and the fastest hardware will be better than any two by themselves. ;)
I know it's not crucial to heavily optimize quick, event-driven bits of code; only stuff that, in total, takes a user-perceptible amount of time. But the runs may take hours between the user initiating one and getting a result. A 3-ms routine taking one millisecond longer to respond to a keystroke is not noticeable. A 3-hour job slowing to take 4 hours is.
Lew - 17 Dec 2006 04:40 GMT > ... several salient points ... You make good points. I am curious how well the JVM would keep up with immutable classes in an app like yours, and how much difference it really makes to use mutable instances.
Since the Sun classes evidently use a slow algorithm you are thrown into the world of third parties or roll-yer-own anyway.
- Lew
John Ersatznom - 17 Dec 2006 14:30 GMT > > ... several salient points ... > [quoted text clipped - 4 lines] > Since the Sun classes evidently use a slow algorithm you are thrown into > the world of third parties or roll-yer-own anyway. I'm currently leaning toward rolling my own wrapper classes, but using the JScience LargeInteger under the hood with a PoolContext. I had a look at the LargeInteger source code, and it uses normal multiplies up to a point and then Karatsuba multiplies, which looks efficient. OTOH, the Real class in JScience does pretty much every operation twice to maintain a scientifically-controlled error bound. For some numerical stuff this would be important, but for some of my plans I need speed and behavior similar to a normal float or double that happens to have a huge mantissa, which means wrapping LargeInteger with a power-of-2 exponent and doing everything only once. :)
Eventually I want an FFT implementation to throw at floating-point calculations reaching the several thousands of bits, though, as well, but I'll cross that one when I come to it.
One thing I'll probably be doing is dropping the LSBs off the more accurate operand in adds and subtracts. Their effect disappears in the noise anyway.
Lew - 17 Dec 2006 17:29 GMT > I'm currently leaning toward rolling my own wrapper classes, but using > the JScience LargeInteger under the hood with a PoolContext. Is a PoolContext an object pool for the LargeInteger objects? If so, it might have poorer performance than small immutable objects, since long-lived objects are subject to more expensive GC than short-lived ones.
Or not.
I haven't tried such a thing as yours, and everything I've read about GC makes it sound unpredictable and mysterious.
Only test runs of your classes with immutable vs. pooled objects will tell us for sure. With different GC strategies. I wonder if anyone has run comparable tests before.
- Lew
John Ersatznom - 17 Dec 2006 19:00 GMT >> I'm currently leaning toward rolling my own wrapper classes, but using >> the JScience LargeInteger under the hood with a PoolContext. > > Is a PoolContext an object pool for the LargeInteger objects? If so, it > might have poorer performance than small immutable objects, since > long-lived objects are subject to more expensive GC than short-lived ones. This assumes that a) LargeIntegers are, despite their name, small :) and b) the PoolContext objects get gc'd so much more expensively it outweighs their getting gc'd much less frequently. (One LargeInteger that gets 10 uses gets 1 more expensive gc per 10 uses instead of 10 less expensive gcs per 10 uses. So it has to be at least Nx more expensive for one that lives N times as long as usual.
> Only test runs of your classes with immutable vs. pooled objects will > tell us for sure. With different GC strategies. I wonder if anyone has > run comparable tests before. I'm going to, before too too much longer, although there's a somewhat confounded benchmark at http://jscience.org/doc/benchmark.html -- confounded because it doesn't separate the effects of PoolContexting from the effects of using LargeInteger in place of BigInteger. Separate comparisons of LargeInteger (PoolContext) to LargeInteger (no context) and LargeInteger (no context) to BigInteger would be more informative.
Chris Smith - 16 Dec 2006 03:46 GMT > What I'd like is: > a) People to weigh in on bignums and garbage collection: [quoted text clipped - 3 lines] > memory requirements bloated by using the immutable > bignums? Because your application will be different from all others, this is hard to answer in general. What I can do is say that there is good reason to entertain the possibility that this may not actually be a problem at all. Implementations of garbage collectors for Java are tuned to be as efficient as possible with relatively small, short-lived objects; and it is even theoretically faster that it will be faster than a non-garbage- collected implementation in this case.
The performance figures here will depend critically on the environmental factors: how much memory is available as GC slack, the sizes of the nursery and first intermediate generations relative to the rate of allocation, and so forth. The best thing to do would be to implement a non-trivial part of the calculation -- something reasonably representative of the kinds of calculation that you expect to do in the final product -- and test it to determine whether the performance is acceptable. For helpful information, try reading the advice at http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html. Especially the info on command line options for tuning allocation segment sizes should be worthwhile.
> * Are the multiplies in the standard BigFoos inefficient > when the values are hundreds or even thousands of bits > long? The source code for BigInteger and BigDecimal are available in the src.zip file that comes with every JDK installation. Briefly scanning the BigInteger implementation, it looks like it's quadratic on the lengths of the values, which means you're out of luck here.
I'll leave your (b) questions for someone who knows more than I.
 Signature Chris Smith
John Ersatznom - 16 Dec 2006 04:06 GMT > Because your application will be different from all others, this is hard > to answer in general. What I can do is say that there is good reason to [quoted text clipped - 3 lines] > is even theoretically faster that it will be faster than a non-garbage- > collected implementation in this case. I don't see how it can be faster, at least for the case of using new...gc versus a mutable object for a temporary, rather than something long-lived with an indefinite lifecycle.
> Especially the > info on command line options for tuning allocation segment sizes should > be worthwhile. Not something I want the users to have to muck about with if I can help it. For them it should be plug-and-play if at all possible. And without making global settings changes that will muck up every *other* Java app or applet they use, either.
> The source code for BigInteger and BigDecimal are available in the > src.zip file that comes with every JDK installation. Briefly scanning > the BigInteger implementation, it looks like it's quadratic on the > lengths of the values, which means you're out of luck here. And BigDecimal? Or is it implemented over BigInteger?
Chris Smith - 16 Dec 2006 15:28 GMT > > Because your application will be different from all others, this is hard > > to answer in general. What I can do is say that there is good reason to [quoted text clipped - 7 lines] > new...gc versus a mutable object for a temporary, rather than something > long-lived with an indefinite lifecycle. Loosely speaking, it's because the time required for garbage collection can be made proportional to the size of objects that survive the collection (+ the size of the root set). Manual memory management is always proportional to the number (perhaps also size) of objects that do not survive the collection. If you're using a lot of temporary objects, the number of objects that don't survive in the newest generations can be far higher than the number that do.
> > Especially the > > info on command line options for tuning allocation segment sizes should [quoted text clipped - 4 lines] > making global settings changes that will muck up every *other* Java app > or applet they use, either. Presumably, you'll have some kind of script (or Windows .lnk file, or whatever) that runs your application. That's where you'd put the command line options. It would affect anything that's not run with that script.
> And BigDecimal? Or is it implemented over BigInteger? I don't know. The point was that you can check src.zip and read the source code.
 Signature Chris Smith
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|