Java Forum / General / January 2006
Optimising Garbage Collection
Roedy Green - 14 Jan 2006 05:43 GMT Consider a program where you read a file into RAM in one great String, process it, create a new version, then if it is different, write it back on one fell swoop. Repeat for thousands of files.
From GC's point of view, you are creating whacking two huge byte arrays for the I/O, a huge StringBuilder and two huge String objects then discarding them for every file.
Ideally there should be some sort of mini GC where just these objects are reclaimed, leaving the rest alone.
A generational collector obviously helps. Is there anything simple I can do to boost this along.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Tony Morris - 13 Jan 2006 20:11 GMT > Consider a program where you read a file into RAM in one great String, > process it, create a new version, then if it is different, write it [quoted text clipped - 12 lines] > Canadian Mind Products, Roedy Green. > http://mindprod.com Java custom programming, consulting and coaching. Hello Roedy, If you are doing small modifications to the class java.lang.String before performing some operation on them, you might consider not creating large copies of them at all. For example, in order to change one character of a String, you must perform a data copy with your modification. If what you really want is that String to appear to have that character changed through the String contract (its methods), then it is an unfortunate limitation of both java.lang.String and arrays in that they are restricted to an in-memory representation of their backing data while exposing too much contract i.e. not workaroundable (java.lang.CharSequence is a frivoleous attempt to prevent this restriction). It would be better instead to have that data arbitrarily represented in a type that provides an appropriate abstraction, and some method that makes that abstraction appear as if a character were changed, without enforcing that a data copy occurred. In fact, it would be better if that data copy didn't occur and allow the client to decide when to copy for performance reasons.
To illustrate an example, consider the following, more appropriate abstraction of an array: interface Array<E> { int length(); void set(E e, int index) throws ArrayIndexOutOfBoundsException; E get(int index) throws ArrayIndexOutOfBoundsException; }
An implementation of this interface is not restricted to an in-memory representation, but suppose it was anyway - for simplicity. Now if I wanted to change the element at the nth index, a typical array would force a copy with something like System.arraycopy. Since you don't have that restriction here - since you have a more appropriate abstraction - this enforcement does not hold.
You could write the following interface: interface ReplaceNthElementOfArray { <E> Array<E> replace(int n, E e, Array<E> a); } Implementations are not forced to perform a data copy in order to fulfill the contract, since it has much less excess than the contract of a typical array.
<plug shame="high"> I am not sure if all of this is even relevant to your question (I am a bit vague on what you really want), but if it is, I can point you to ContractualJ [http://contractualj.com], which has attempted to provide a more appropriate abstraction of both arrays, and Strings by way of a net.tmorris.adt.Sequence. In the case of an ordered sequence of characters, I prefer to use a Sequence<Character> along with a net.tmorris.primitives.Charable over a java.lang.String, since this allows me to perform operations on the contract without the intrinsic excess contract of most (all?) Java types. As a consequence, this often results in significant performance improvements - no need to copy data due to excess type contract. </plug>
-- Tony Morris http://tmorris.net/ Java Questions and Answers http://jqa.tmorris.net/
Roedy Green - 14 Jan 2006 06:25 GMT >If you are doing small modifications to the class java.lang.String before >performing some operation on them, you might consider not creating large >copies of them at all I have a number of applications that process a file in one fell swoop.
1. seek out the pattern <!-- macro XXXX parm1 parm2 ... -->and expand the macros, replacing any previous expansion. The macros are written in Java.
2. compact removing whitespace from HTML that has no effect on rendering.
3. clean up the "see" section at the bottom of each page, sorting links, and resolving words into links.
4. remove excess blank lines
5. align in columns.
6. spaces to tabs
7. tabs to spaces
8. search for <DT entries and extract items for indexing.
9. convert & to & appropriately
10. remove duplicate lines
11. reflow text to standard line length.
12. split files into pieces based on embedded markers.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Tony Morris - 13 Jan 2006 20:33 GMT > >If you are doing small modifications to the class java.lang.String before > >performing some operation on them, you might consider not creating large [quoted text clipped - 33 lines] > Canadian Mind Products, Roedy Green. > http://mindprod.com Java custom programming, consulting and coaching. That sounds very much to me like you do indeed need a more appropriate abstraction than java.lang.String, since performing those operations individually will incur a performance hit by way of a data copy each time (as each String instance is created). You could of course, try to do something clever with collections, but they too have issues related to inappropriate abstractions. I don't want to plug myself again, but in my opinion, I cannot find a better recommendation - except of course, write such an abstraction yourself.
Tony Morris http://tmorris.net/ Java Questions and Answers http://jqa.tmorris.net/
Stefan Ram - 14 Jan 2006 18:46 GMT >In the case of an ordered sequence of characters, I prefer to >use a Sequence<Character> along with a [quoted text clipped - 4 lines] >improvements - no need to copy data due to excess type >contract. It somewhat reminds me of the way I implement "toString" in "de.dclj.ram.type.tuple.ComparableTuple0".
Why do I have to convert my CharSequence to a String if the client would be happy with a CharSequence?
Therefore, I offer "toCharSequence" for all clients:
/** {@inheritDoc} */ public java.lang.CharSequence toCharSequence() { java.lang.CharSequence result = "()"; if( this.size > 0 ) { final StringBuilder builder = new StringBuilder(); /* ... */ result = builder; } return result; }
An then "toString" for clients who really need a string:
/** {@inheritDoc} */ public java.lang.String toString() { return this.toCharSequence().toString(); }
(The full code is avaiable at: http://www.purl.org/stefan_ram/html/ram.jar/de/dclj/ram/type/tuple/ComparableTup le.html http://www.purl.org/stefan_ram/html/ram.jar/de/dclj/ram/type/tuple/ComparableTup le0.html http://www.purl.org/stefan_ram/java/de/dclj/ram/type/tuple/ComparableTuple.java http://www.purl.org/stefan_ram/java/de/dclj/ram/type/tuple/ComparableTuple0.java )
Robert Klemme - 16 Jan 2006 09:56 GMT > Consider a program where you read a file into RAM in one great String, > process it, create a new version, then if it is different, write it [quoted text clipped - 3 lines] > arrays for the I/O, a huge StringBuilder and two huge String objects > then discarding them for every file. You don't need those huge byte arrays if you do conversion during reading from the stream.
> Ideally there should be some sort of mini GC where just these objects > are reclaimed, leaving the rest alone. > > A generational collector obviously helps. Is there anything simple I > can do to boost this along. I agree with Tony, you should think of a different representation of your file contents in memory. This soulds like a typical application of a parser. You might get away with a scanner that hacks your input into tokens though.
HTH
robert
Roedy Green - 16 Jan 2006 11:33 GMT >I agree with Tony, you should think of a different representation of your >file contents in memory. This soulds like a typical application of a >parser. You might get away with a scanner that hacks your input into >tokens though. The giant string works well. I do the i/o as fast as theoretically possible. You can't beat indexof for for finding the parts of the string I have to process. This way I can really rip through a large number of files quickly.
My biggest file is only about half a megabyte and most are much smaller.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Alun Harford - 17 Jan 2006 18:31 GMT > >I agree with Tony, you should think of a different representation of your > >file contents in memory. This soulds like a typical application of a [quoted text clipped - 8 lines] > My biggest file is only about half a megabyte and most are much > smaller. Why not read it into a StringBuffer, change it, write it back? (you can even recycle the StringBuffer if you want). StringBuffer has indexof, which is still nice and fast.
Alun Harford
Roedy Green - 17 Jan 2006 20:50 GMT On Tue, 17 Jan 2006 18:31:24 -0000, "Alun Harford" <usenet@alunharford.co.uk> wrote, quoted or indirectly quoted someone who said :
>Why not read it into a StringBuffer, change it, write it back? (you can even >recycle the StringBuffer if you want). StringBuffer has indexof, which is >still nice and fast. I am not changing it in the sense of overlaying a few pieces.. I am taking pieces of it, and replacing other pieces with a different length piece. The most efficient way to do that is to create a new StringBuilder, rather than deleting and inserting from the old one, at least in terms of CPU to slide to make room for the pieces. It should work with less RAM though with your method, so long as StringBuilder does not go allocating temp space to do its sliding.
I'll put that idea on my todo stack to experiment with to see what the net effect is. The program toggles back and forth between consuming 35% and 65% of the CPU cycles on an idle machine. I don't know if that is mostly in my algorithm or in GC. GC should be quick. There are that many live objects other than classes.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 17 Jan 2006 21:21 GMT On Tue, 17 Jan 2006 20:50:01 GMT, Roedy Green <my_email_is_posted_on_my_website@munged.invalid> wrote, quoted or indirectly quoted someone who said :
>I'll put that idea on my todo stack to experiment with to see what the >net effect is. The program toggles back and forth between consuming >35% and 65% of the CPU cycles on an idle machine. I don't know if that >is mostly in my algorithm or in GC. GC should be quick. There are that >many live objects other than classes. Your algorithm is actually much better than I first thought. Most of the time when I expand a macro, the expansion is the same as last time. There is no need to replace anything, and even if I did, it would be the same length as the old expansion, and hence StringBuilder. replace should not need to shift anything.
So long as my estimate for StringBuilder size is good, this should cut my memory use in half. Thanks! I never would have thought of this on my own.
Perhaps I can do this up brown by recycling the StringBuilder. I noticed there is no StringBuilder.clear or setSize. I wonder what the most efficient way to clear for reuse is.
I need to chew on the problem of deciding when to strink a StringBuilder. There is no point is keeping it huge just because one file was huge. Maybe I could simply process the files in ascending size order.
To make this even faster, I would need a way to do I/O direct to/from the char[] in the StringBuilder.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 17 Jan 2006 21:37 GMT On Tue, 17 Jan 2006 21:21:21 GMT, Roedy Green <my_email_is_posted_on_my_website@munged.invalid> wrote, quoted or indirectly quoted someone who said :
>Perhaps I can do this up brown by recycling the StringBuilder. I >noticed there is no StringBuilder.clear or setSize. I wonder what the >most efficient way to clear for reuse is. it is StringBuilder.setLength
See http://mindprod.com/jgloss/stringbuilder.html
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Thomas Weidenfeller - 18 Jan 2006 08:49 GMT > The giant string works well. I do the i/o as fast as theoretically > possible. You can't beat indexof for for finding the parts of the > string I have to process. From you original description of the problem, where you wrote you look for certain sequences of characters, I have to disagree! Like I mentioned in previous discussions, look for Boyer-Moore, Knuth-Morris-Pratt, Rabin-Karp or similar algorithms. You can beat linear, character by character text searches under certain conditions.
As for chopping the the data into pieces, and rebuild the stuff in memory, have you considered to not copy the data around, but assemble the result from "chunks" of the original data, and your changed data?
Your output would be described as a sequence of "chunks". Each chunks either refers to original data (e.g. by an offset and length), or some new data you just changed.
You have gathering write operations in nio, so writing these chunks could be relatively fast. "Could", because it depends on the nio implementation and the operating system. If the operating system supports scatter/gather I/O, and if nio uses this, it might work and be somewhat faster.
/Thomas
 Signature The comp.lang.java.gui FAQ: ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
Roedy Green - 18 Jan 2006 10:20 GMT On Wed, 18 Jan 2006 09:49:20 +0100, Thomas Weidenfeller <nobody@ericsson.invalid> wrote, quoted or indirectly quoted someone who said :
>You have gathering write operations in nio, so writing these chunks >could be relatively fast. It would be interesting to see how clever nio can get about copying. It still has to decode and encode. decoding does not know how long the result will be ahead of time. It makes a conservative guess and trims the result, before creating the string. FileReader thus dithers horribly copying the array over and over and over before you get your string.
Your gather algorithm is extremely memory efficient, especially if the expansions for the most part are the same as last time -- the usual case.
I wrote a Boyer Moore. It was only marginally faster than indexOf. Perhaps it could do better by working on a raw char[].
see http://mindprod.com/jgloss/products1.html#BOYER
I do quite a bit of this sort of programming, so I am interested in the general case the best way to write such programs -- how to you get the file in, modified and out with minimal fuss and RAM consumption.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|