Java Forum / General / January 2006
new String ( byte[] , encoding ) under the hood
Roedy Green - 14 Jan 2006 08:16 GMT I was curious how new String ( byte[], encoding ) could guess the correct size of the buffer to convert into String.
It makes an estimate based on number of bytes times the max number of chars per byte, an attribute of the encoding. This will be slightly on the high side if there are any multibyte chars, but accurate for Latin-1. It then decodes, and calls trim to System.arraycopy to get an char[] the right size. The new String then does another System.arraycopy.
You leave in your wake the original byte[], two char[] and the string.
Going the other way String -> byte uses similar logic, but the buffer size is not so fortunate. For UTF-8 it makes the conservative assumption each char might need 3 bytes, making the buffer 3 times bigger than it needs to be in the ordinary case.
Sun could optimise could streamline these operations to cut out the intermediate objects.
Here's an idea. Why not allow strings and char arrays etc to temporarily be too big. They are logically sized. Only on the next GC do the objects get pruned to size if need be. You would save a lot of copying and new object creating just to get arrays the precise correct size. There would be a method to prune an array to size that just logically chopped it and marked it for later true pruning. Most of the time though such objects will soon be discarded, and you then get away without ever doing the copy.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Stefan Schulz - 14 Jan 2006 09:09 GMT > Here's an idea. Why not allow strings and char arrays etc to > temporarily be too big. They are logically sized. Only on the next GC [quoted text clipped - 4 lines] > time though such objects will soon be discarded, and you then get away > without ever doing the copy. I would mainly object that doing things this way would make the garbage collector do work it is not normally supposed to do, that is, the housekeeping of the String class. The garbage collector is supposed to reclaim unreachable objects. It is pretty good at that. It is not supposed to do much else.
If this truely and absolutely does become the bottleneck of your application, i would suggest doing it "by hand" in a more efficient way (for example, re-using one char [] as decode buffer).
Chris Uppal - 14 Jan 2006 13:10 GMT > Here's an idea. Why not allow strings and char arrays etc to > temporarily be too big. They are logically sized. Only on the next GC [quoted text clipped - 4 lines] > time though such objects will soon be discarded, and you then get away > without ever doing the copy. Some GCed languages do allow you to change the size of array objects, and that /could/ be implemented in the way you describe (though I'm not sure it'd be worth it). Off the top of my head, I cannot think of a persuasive reason why Java does not allow dynamic resizing of arrays.
-- chris
Roedy Green - 14 Jan 2006 22:57 GMT On 14 Jan 2006 13:10:06 GMT, "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly quoted someone who said :
>Some GCed languages do allow you to change the size of array objects, >and that /could/ be implemented in the way you describe (though I'm not >sure it'd be worth it). Off the top of my head, I cannot think of a >persuasive reason why Java does not allow dynamic resizing of arrays. resize down should be easier than resize up. With down you dont HAVE to change any allocation right way. Resize down in very common. Resize up is usually done with ArrayList where you allocate new ram and copy.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Chris Uppal - 16 Jan 2006 12:18 GMT I wrote:
> Some GCed languages do allow you to change the size of array objects, > and that /could/ be implemented in the way you describe (though I'm not > sure it'd be worth it). Off the top of my head, I cannot think of a > persuasive reason why Java does not allow dynamic resizing of arrays. Having thought more about it, I suspect that the problem is that there is no safe /and/ fast way of allowing multi-threaded access to an array which can change size.
If you get the threading wrong for accessing an array of fixed size, then all that happens is that your application reads the wrong data. If the code that is checking the array bounds uses a stale size value, then you can break security and/or crash the JVM. The latter possibilities -- not unreasonably -- are considered unreasonable ;-)
-- chris
Mike Schilling - 14 Jan 2006 19:34 GMT >I was curious how new String ( byte[], encoding ) could guess the > correct size of the buffer to convert into String. [quoted text clipped - 5 lines] > char[] the right size. The new String then does another > System.arraycopy. Why doesn't it just create the String via new String(char[], int, int), which would eliminate the extra copy? Better still would be a String constructor that takes an array of character arrays and a total length, (say, new String(char[][], int) to eliminate the need to allocate a big contiguous character array in the first place. Instead, smaller buffers (say, 4K) could be allocated as required. String always has to copy all the characters to make an immutable char array, the place to achieve savings would be before this.
Roedy Green - 14 Jan 2006 23:03 GMT On Sat, 14 Jan 2006 19:34:43 GMT, "Mike Schilling" <mscottschilling@hotmail.com> wrote, quoted or indirectly quoted someone who said :
> String always has to copy all the >characters to make an immutable char array, the place to achieve savings >would be before this. I wonder if there could be some way to hand off a char array to be inserted in a string. The problem is ensuring nobody holds onto a reference to it.
That way you could avoid the copy. It gets quite silly how much copying goes on to do the simplest things.
There needs to me some low level, perhaps even hardware mechanism to hand off a chunk of RAM in such a way the original owner can no longer meddle with it, and perhaps not even see it. It might be done by remapping a vm page to another address.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Stefan Schulz - 15 Jan 2006 00:01 GMT > There needs to me some low level, perhaps even hardware mechanism to > hand off a chunk of RAM in such a way the original owner can no longer > meddle with it, and perhaps not even see it. It might be done by > remapping a vm page to another address. This is something you generally can not truely control, unless you remap the entire page to be read-only afterwards (which is, by the way, far below java's threshold when it comes to OS specific features)
Also, as much as it makes me sound like a spoilsport: Memory copies are cheap. I would be extremely surprised of the string copying made up even 1% of your typical applications time.
To phrase things a bit differently: Before worrying about how you handle your strings, worry about the performance of your XML parser, your database driver and GUI. ;)
Mike Schilling - 16 Jan 2006 01:51 GMT > Why doesn't it just create the String via new String(char[], int, int), > which would eliminate the extra copy? Better still would be a String [quoted text clipped - 4 lines] > the characters to make an immutable char array, the place to achieve > savings would be before this. Let me refine this. What I'd really like to see is
new String(Reader rdr, int maxLength)
Create a string from characters read from rdr, with the lenght of the resulting string being the minimum of
1. The number of characters that can be read from rdr, and 2. maxLen
Likewise
new String(InputStream strm, String encoding, int maxBytes)
This would eliminate the need to move all of the chanracters into a contiguous array to be copied to a second contiguous array. It sould be of great help in, for instance, XML parsers, where text that needs to be put into a string quite possibly crosses buffer boundaries.
Chris Uppal - 16 Jan 2006 12:20 GMT > new String(Reader rdr, int maxLength) One problem with that is that it creates a dependency from the -- very core -- String class to the -- rather non-core -- IO classes. I think that would be undesirable, even though your proposed methods otherwise make a lot of sense.
(Incidentally, I've just realised that similar thinking may underlie the absence of a form of String.split() which takes a compiled regexp rather than a String.)
It's a bit awkward really. It would be nice to have such things, but when (a) Java lacks "open" classes, and (b) String is declared final, there's not a lot a room for manoeuvre.
It would be nice to create alternative kinds of String with different internal implementations -- such as using a UTF-32 or UTF-8 encoding internally, or using a variable-sized collection of char[] arrays to hold their data. Sadly we cannot. I don't think that making String final was such a silly idea /at the time/ but in retrospect I think it was an unfortunate choice.
-- chris
Robert Klemme - 16 Jan 2006 12:48 GMT >> new String(Reader rdr, int maxLength) > > One problem with that is that it creates a dependency from the -- > very core -- String class to the -- rather non-core -- IO classes. > I think that would be undesirable, even though your proposed methods > otherwise make a lot of sense. Totally agree.
> (Incidentally, I've just realised that similar thinking may underlie > the absence of a form of String.split() which takes a compiled regexp > rather than a String.) Yeah, likely.
> It's a bit awkward really. It would be nice to have such things, but > when (a) Java lacks "open" classes, and (b) String is declared final, > there's not a lot a room for manoeuvre. That's an euphemism. :-)
> It would be nice to create alternative kinds of String with different > internal implementations -- such as using a UTF-32 or UTF-8 encoding > internally, or using a variable-sized collection of char[] arrays to > hold their data. Sadly we cannot. I don't think that making String > final was such a silly idea /at the time/ but in retrospect I think > it was an unfortunate choice. I'm not so sure. After all, what's the advantage of subclassing String when there's CharSequence? Granted, it came in quite late and quite some methods provide only for String arguments instead of CharSequence. But personally I never run into a situation where I actually whished I had a String with these properties. YMMV though.
Kind regards
robert
Chris Uppal - 16 Jan 2006 13:09 GMT > > It would be nice to create alternative kinds of String with different > > internal implementations -- such as using a UTF-32 or UTF-8 encoding [quoted text clipped - 6 lines] > when there's CharSequence? Granted, it came in quite late and quite some > methods provide only for String arguments instead of CharSequence. Agreed, up to a point. My reservations being: (A) CharSequence is little used in practise, and I don't see much chance of that changing. (A) "CharSequence" is a silly name for wide use; "String" is the only reasonable name. (B) The CharSequence interface is too narrow -- it doesn't correspond to the abstract API of a String.
(Which, now I come to think of it, is quite a long list of reservations -- perhaps I shouldn't have started by saying "Agreed" ;-)
-- chris
Robert Klemme - 16 Jan 2006 13:52 GMT >>> It would be nice to create alternative kinds of String with >>> different internal implementations -- such as using a UTF-32 or [quoted text clipped - 11 lines] > little used in practise, and I don't see much chance of that > changing. I guess that's a problem of education caused by the late arrival of CS.
> (A) "CharSequence" is a silly name for wide use; "String" > is the only reasonable name. Well, String is definitely much better although, given the circumstances, I find CharSequence describes pretty much what it's about - again, the late arrival...
> (B) The CharSequence interface is too > narrow -- it doesn't correspond to the abstract API of a String. I on the other hand think that String's interface is bloated. There's a lot stuff in there that probably doesn't belong there. CharSequence contains really the basic stuff for an immutable string but String contains a lot string processing that IMHO doesn't necessarily belong there (split() for example, maybe concat() and replace*(), other methods which have in part been flagged deprecated). But it's certainly debatable. While for example a general indexOf algorithm could well be put into some class as static method which solely relies on CharSequence, implementations for String are likely much more efficient if implemented in class String itself. Library design isn't easy... :-)
> (Which, now I come to think of it, is quite a long list of > reservations -- perhaps I shouldn't have started by saying "Agreed" > ;-) LOL
Kind regards
robert
Roedy Green - 16 Jan 2006 20:38 GMT >I on the other hand think that String's interface is bloated. So where should such methods go?
What are the disadvantages of putting them on String?
Coding convenience having all the methods at hand for String is a big plus.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Andrew McDonagh - 16 Jan 2006 22:22 GMT >>I on the other hand think that String's interface is bloated. > > So where should such methods go? > > What are the disadvantages of putting them on String? it doesn't conform to SRP or LoD for a start...
> Coding convenience having all the methods at hand for String is a big > plus. No, a good IDE and a better design does this for us, not simplified, fragile, everything-in-one-bucket designs.
Chris Uppal - 17 Jan 2006 11:52 GMT > > What are the disadvantages of putting them on String? > > it doesn't conform to SRP or LoD for a start... Can you state what the "single responsibility" of String /should/ be. I can't. It seems to me to be a general purpose class, and it's principle design criterion should be convernience for the user.
And what does the Occasionally Good Advice of Demeter have to do with it ?
-- chris
Chris Smith - 19 Jan 2006 05:52 GMT > And what does the Occasionally Good Advice of Demeter have to do with it ? Wow, I thought I was the only person who considers that to be overstated. Everyone else seems so caught up with it...
Nevertheless, I assumed Andrew meant "Levels of Detail", a term which is commonly used in 3D modeling but I've seen applied metaphorically to programming. Of course, it would help if Andrew would actually say what he means. ;)
 Signature www.designacourse.com The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
Chris Uppal - 19 Jan 2006 11:45 GMT [me:]
> > And what does the Occasionally Good Advice of Demeter have to do with > > it ? > > Wow, I thought I was the only person who considers that to be > overstated. Everyone else seems so caught up with it... It does seem to get more attention than it warrants (even conceding its -- inflated, IMO -- value). I assume that's the result of having a Cool Name (tm).
> Nevertheless, I assumed Andrew meant "Levels of Detail", a term which is > commonly used in 3D modeling but I've seen applied metaphorically to > programming. I suppose that is a possibility too. Not sure I see the applicability, even so...
-- chris
Googmeister - 16 Jan 2006 14:02 GMT > I don't think that making String final was such a silly idea /at > the time/ but in retrospect I think it was an unfortunate choice. The Java designers wanted String to be immutable, and this is probably the reason it is final. Same with Integer, etc.
But maybe I am wrong, since StringBuilder and StringBuffer are also final, even though they are mutable.
Mike Schilling - 16 Jan 2006 18:02 GMT >> I don't think that making String final was such a silly idea /at >> the time/ but in retrospect I think it was an unfortunate choice. [quoted text clipped - 4 lines] > But maybe I am wrong, since StringBuilder and StringBuffer > are also final, even though they are mutable. They are (or were pre-1.5 at least) intimate enough with String that making them unfinal would introduced holes into String's immutability.
Thomas Hawtin - 16 Jan 2006 18:56 GMT >>But maybe I am wrong, since StringBuilder and StringBuffer >>are also final, even though they are mutable. > > They are (or were pre-1.5 at least) intimate enough with String that making > them unfinal would introduced holes into String's immutability. That doesn't explain why StringBuilder is final.
There are other security reasons for requiring final. Say I had a security conscious class and I allowed some method that took a StringBuilder and appended some private objects to it. Now imagine someone malicious comes along and overrides StringBuilder.append(Object). Malicious code now has access to my sensitive object. Presumably that is why ObjectOutputStream.writeObject is final, and why writeObjectOverride and the auditSubclass nonsense was introduced.
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Mike Schilling - 16 Jan 2006 23:19 GMT >>>But maybe I am wrong, since StringBuilder and StringBuffer >>>are also final, even though they are mutable. [quoted text clipped - 9 lines] > along and overrides StringBuilder.append(Object). Malicious code now has > access to my sensitive object. As opposed to having access to their text representations through normal StringBuilder behavior?
Chris Uppal - 17 Jan 2006 12:14 GMT > There are other security reasons for requiring final. Say I had a > security conscious class and I allowed some method that took a > StringBuilder and appended some private objects to it. Now imagine > someone malicious comes along and overrides > StringBuilder.append(Object). So the hypothetical security-conscious code should use toString() explicitly before appending the data. I'm not saying you are wrong about Sun's reasons, but the whole thing seems to be part and parcel with the design error of making String final. Very little code /is/ security-conscious, and what there is tends to be anything but trivial. Making String (and pals) significantly less flexible for everyone is a bad bargain if all that's gained is that some complicated and difficult code is marginally less complicated and difficult.
-- chris
Thomas Hawtin - 17 Jan 2006 13:59 GMT > Very little code /is/ security-conscious, That is a big problem.
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Chris Uppal - 17 Jan 2006 17:42 GMT > > Very little code /is/ security-conscious, > > That is a big problem. A lot depends on what exactly we are talking about when we say "security-conscious". If we mean that there is (far) too much code written with (far) too little thought for security against /external/ attacks -- basically against hostile input -- then I agree whole-heartedly[*]. But if we are (as I thought we were) discussing security against other code running on the same JVM, then I'm much less sure. My impression is that it's rare to have to protect one's code against attack by -- say -- hostile subclasses, yet it's that broad category of protection that the finality of String (and pals) seems to be aimed at.
-- chris
([*] and no, I don't claim to be specially good at it myself, though I'm not a total looser. Secure coding is something that comes with practise, feedback, and constant attention, and I've had too little practise and almost no feedback.)
Roedy Green - 16 Jan 2006 20:43 GMT >But maybe I am wrong, since StringBuilder and StringBuffer >are also final, even though they are mutable. Two thoughts. When reading code, you have no doubts about what a String or StringBuilder is up to. As soon as you take off the final, all bets are off. This is too big a temptation for unmaintainable code. You need something solid and unshifting for your foundations. If you want your own String or StringBuilder you can write your own, cannibalise even, which then is clearly different. They are not that complicated underneath.
The other reason is speed. It is highly convenient for hotspot to know that code is final, not just temporarily final until some dynamic class loads and upsets the apple cart. Imagine what havoc a custom StringBuilder class being loaded could do to all the finely optimised HotSpot native code for StringBuilder. It also allows special tuning for String and StringBuilder, knowing it will not be meddled with by overriding.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Thomas Hawtin - 16 Jan 2006 21:29 GMT > The other reason is speed. It is highly convenient for hotspot to > know that code is final, not just temporarily final until some dynamic [quoted text clipped - 3 lines] > for String and StringBuilder, knowing it will not be meddled with by > overriding. Actually, HotSpot ignores the final flag (although a few methods are recognised as intrinsics, which may make a difference). And it's quite happy to inline code called through interfaces.
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Roedy Green - 16 Jan 2006 21:41 GMT On Mon, 16 Jan 2006 21:40:18 +0000, Thomas Hawtin <usenet@tackline.plus.com> wrote, quoted or indirectly quoted someone who said :
>Actually, HotSpot ignores the final flag (although a few methods are >recognised as intrinsics, which may make a difference). And it's quite >happy to inline code called through interfaces. But then if someone later overrides the non final code, it has to go into a panic, stop everything,and regenerate all optimised machine code that calls the overridden class. It can no longer inline the code it was presuming was for them moment as if final.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Chris Uppal - 17 Jan 2006 12:02 GMT > The Java designers wanted String to be immutable, and this is > probably the reason it is final. Same with Integer, etc. I agree that immutability is a worthwhile design goal. I don't think that /guaranteed/ immutability is worth the price of making String (and pals) final. If such a facility was required (as it might well be) then that could have been implemented by a final subclass of String (which inherited String's immutable API, and was final in order to lock that down). I'd have been tempted to use the ImmutableString subclass for the values of String literals.
-- chris
Mike Schilling - 16 Jan 2006 17:59 GMT >> new String(Reader rdr, int maxLength) > [quoted text clipped - 4 lines] > undesirable, even though your proposed methods otherwise make a lot of > sense. As opposed to System.out and Exception.printStackTrace(Writer) :-)
Java implementations must, by license, implement java.io every bit as fully as java.lang.
Chris Uppal - 17 Jan 2006 12:07 GMT [me:]
> > One problem with that is that it creates a dependency from the -- very > > core -- [quoted text clipped - 4 lines] > > As opposed to System.out and Exception.printStackTrace(Writer) :-) <grin/>
Um, well... Yes.
I'll agree that Exception.printStackTrace() is, er, unfortunate. I'm less bothered by System.out. String, like Object, and Class is one of the classes that is needed just in order to describe the Java language and semantics, as opposed to being merely part of the Java platform (for instance, they need special treatment during bootstrapping). Unfortunately, for that argument, so is Throwable...
-- chris
Mike Schilling - 18 Jan 2006 00:01 GMT > [me:] >> > One problem with that is that it creates a dependency from the -- very [quoted text clipped - 19 lines] > argument, so > is Throwable... While Reader is part of java.io, a Reader is, essentially, a source of chars (and might not do any IO e.g. StringReader). It makes perfect sense to me to be able to constrct a String from a source or chars. If it would make you feel better to create an interface java.lang.CharacterSource which is a superinterface of Reader, it's OK with me :-)
Chris Uppal - 18 Jan 2006 10:39 GMT > While Reader is part of java.io, a Reader is, essentially, a source of > chars (and might not do any IO e.g. StringReader). It makes perfect > sense to me to be able to constrct a String from a source or chars. I agree. It was (and is) only the inverted dependency link that offends.
> If > it would make you feel better to create an interface > java.lang.CharacterSource which is a superinterface of Reader, it's OK > with me :-) I'd settle for that ;-)
Tempting to suggest a String constructor that takes an Iterator<Character> or -- to follow Sun guidelines -- an Iterable<Character>
But that would be horrendously inefficient, and would futher promulgate the myth that Java's char type corresponds to Unicode characters. Perhaps, for both reasons, a constructor taking an Iterable<char[]> would be better.
-- chris
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|