Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / January 2006

Tip: Looking for answers? Try searching our database.

new String ( byte[] , encoding ) under the hood

Thread view: 
Roedy Green - 14 Jan 2006 08:16 GMT
I was curious how new String ( byte[], encoding ) could guess the
correct size of the buffer to convert into String.

It makes an estimate based on number of bytes times the max number of
chars per byte, an attribute of the encoding.  This will be slightly
on the high side if there are any multibyte chars, but accurate for
Latin-1. It then decodes, and calls trim to System.arraycopy to get an
char[] the right size. The new String then does another
System.arraycopy.

You leave in your wake the original byte[], two char[] and the string.

Going the other way String -> byte uses similar logic, but the buffer
size is not so fortunate.  For UTF-8 it makes the conservative
assumption each char might need 3 bytes, making the buffer 3 times
bigger than it needs to be in the ordinary case.

Sun could optimise could streamline these operations to cut out the
intermediate objects.

Here's an idea. Why not allow strings and char arrays etc to
temporarily be too big.  They are logically sized. Only on the next GC
do the objects get pruned to size if need be.  You would save a lot of
copying and new object creating just to get arrays the precise correct
size.  There would be a method to prune an array to size that just
logically chopped it and marked it for later true pruning. Most of the
time though such objects will soon be discarded, and you then get away
without ever doing the copy.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Stefan Schulz - 14 Jan 2006 09:09 GMT
> Here's an idea. Why not allow strings and char arrays etc to
> temporarily be too big.  They are logically sized. Only on the next GC
[quoted text clipped - 4 lines]
> time though such objects will soon be discarded, and you then get away
> without ever doing the copy.

I would mainly object that doing things this way would make the garbage
collector do work it is not normally supposed to do, that is, the
housekeeping of the String class. The garbage collector is supposed to
reclaim unreachable objects. It is pretty good at that. It is not
supposed to do much else.

If this truely and absolutely does become the bottleneck of your
application, i would suggest doing it "by hand" in a more efficient way
(for example, re-using one char [] as decode buffer).
Chris Uppal - 14 Jan 2006 13:10 GMT
> Here's an idea. Why not allow strings and char arrays etc to
> temporarily be too big.  They are logically sized. Only on the next GC
[quoted text clipped - 4 lines]
> time though such objects will soon be discarded, and you then get away
> without ever doing the copy.

Some GCed languages do allow you to change the size of array objects,
and that /could/ be implemented in the way you describe (though I'm not
sure it'd be worth it).  Off the top of my head, I cannot think of a
persuasive reason why Java does not allow dynamic resizing of arrays.

   -- chris
Roedy Green - 14 Jan 2006 22:57 GMT
On 14 Jan 2006 13:10:06 GMT, "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>Some GCed languages do allow you to change the size of array objects,
>and that /could/ be implemented in the way you describe (though I'm not
>sure it'd be worth it).  Off the top of my head, I cannot think of a
>persuasive reason why Java does not allow dynamic resizing of arrays.

resize down should be easier than resize up. With down you dont HAVE
to change any allocation right way. Resize down in very common. Resize
up is usually done with ArrayList where you allocate new ram and copy.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Chris Uppal - 16 Jan 2006 12:18 GMT
I wrote:

> Some GCed languages do allow you to change the size of array objects,
> and that /could/ be implemented in the way you describe (though I'm not
> sure it'd be worth it).  Off the top of my head, I cannot think of a
> persuasive reason why Java does not allow dynamic resizing of arrays.

Having thought more about it, I suspect that the problem is that there is no
safe /and/ fast way of allowing multi-threaded access to an array which can
change size.

If you get the threading wrong for accessing an array of fixed size, then all
that happens is that your application reads the wrong data.  If the code that
is checking the array bounds uses a stale size value, then you can break
security and/or crash the JVM.  The latter possibilities -- not unreasonably --
are considered unreasonable ;-)

   -- chris
Mike Schilling - 14 Jan 2006 19:34 GMT
>I was curious how new String ( byte[], encoding ) could guess the
> correct size of the buffer to convert into String.
[quoted text clipped - 5 lines]
> char[] the right size. The new String then does another
> System.arraycopy.

Why doesn't it just create the String via new String(char[], int, int),
which would eliminate the extra copy?  Better still would be a String
constructor that takes an array of character arrays and a total length,
(say, new String(char[][], int) to eliminate the need to allocate a big
contiguous character array in the first place.  Instead, smaller buffers
(say, 4K) could be allocated as required.  String always has to copy all the
characters to make an immutable char array, the place to achieve savings
would be before this.
Roedy Green - 14 Jan 2006 23:03 GMT
On Sat, 14 Jan 2006 19:34:43 GMT, "Mike Schilling"
<mscottschilling@hotmail.com> wrote, quoted or indirectly quoted
someone who said :

>  String always has to copy all the
>characters to make an immutable char array, the place to achieve savings
>would be before this.

I wonder if there could be some way to hand off a char array to be
inserted in a string.  The problem is ensuring nobody holds onto a
reference to it.

That way you could avoid the copy.  It gets quite silly how much
copying goes on to do the simplest things.

There needs to me some low level, perhaps even hardware mechanism to
hand off a chunk of RAM in such a way the original owner can no longer
meddle with it, and perhaps not even see it. It might be done by
remapping a vm page to another address.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Stefan Schulz - 15 Jan 2006 00:01 GMT
> There needs to me some low level, perhaps even hardware mechanism to
> hand off a chunk of RAM in such a way the original owner can no longer
> meddle with it, and perhaps not even see it. It might be done by
> remapping a vm page to another address.

This is something you generally can not truely control, unless you
remap the entire page to be read-only afterwards (which is, by the way,
far below java's threshold when it comes to OS specific features)

Also, as much as it makes me sound like a spoilsport: Memory copies are
cheap. I would be extremely surprised of the string copying made up
even 1% of your typical applications time.

To phrase things a bit differently: Before worrying about how you
handle your strings, worry about the performance of your XML parser,
your database driver and GUI. ;)
Mike Schilling - 16 Jan 2006 01:51 GMT
> Why doesn't it just create the String via new String(char[], int, int),
> which would eliminate the extra copy?  Better still would be a String
[quoted text clipped - 4 lines]
> the characters to make an immutable char array, the place to achieve
> savings would be before this.

Let me refine this.  What I'd really like to see is

   new String(Reader rdr, int maxLength)

Create a string from characters read from rdr, with the lenght of the
resulting string being the minimum of

1. The number of characters that can be read from rdr, and
2. maxLen

Likewise

   new String(InputStream strm, String encoding, int maxBytes)

This would eliminate the need to move all of the chanracters into a
contiguous array to be copied to a second contiguous array. It sould be of
great help in, for instance, XML parsers, where text that needs to be put
into a string quite possibly crosses buffer boundaries.
Chris Uppal - 16 Jan 2006 12:20 GMT
>     new String(Reader rdr, int maxLength)

One problem with that is that it creates a dependency from the -- very core --
String class to the -- rather non-core -- IO classes.   I think that would be
undesirable, even though your proposed methods otherwise make a lot of sense.

(Incidentally, I've just realised that similar thinking may underlie the
absence of a form of String.split() which takes a compiled regexp rather than a
String.)

It's a bit awkward really.  It would be nice to have such things, but when (a)
Java lacks "open" classes, and (b) String is declared final, there's not a lot
a room for manoeuvre.

It would be nice to create alternative kinds of String with different internal
implementations -- such as using a UTF-32 or UTF-8 encoding internally, or
using a variable-sized collection of char[] arrays to hold their data.  Sadly
we cannot.  I don't think that making String final was such a silly idea /at
the time/ but in retrospect I think it was an unfortunate choice.

   -- chris
Robert Klemme - 16 Jan 2006 12:48 GMT
>>     new String(Reader rdr, int maxLength)
>
> One problem with that is that it creates a dependency from the --
> very core -- String class to the -- rather non-core -- IO classes.
> I think that would be undesirable, even though your proposed methods
> otherwise make a lot of sense.

Totally agree.

> (Incidentally, I've just realised that similar thinking may underlie
> the absence of a form of String.split() which takes a compiled regexp
> rather than a String.)

Yeah, likely.

> It's a bit awkward really.  It would be nice to have such things, but
> when (a) Java lacks "open" classes, and (b) String is declared final,
> there's not a lot a room for manoeuvre.

That's an euphemism. :-)

> It would be nice to create alternative kinds of String with different
> internal implementations -- such as using a UTF-32 or UTF-8 encoding
> internally, or using a variable-sized collection of char[] arrays to
> hold their data.  Sadly we cannot.  I don't think that making String
> final was such a silly idea /at the time/ but in retrospect I think
> it was an unfortunate choice.

I'm not so sure.  After all, what's the advantage of subclassing String
when there's CharSequence?  Granted, it came in quite late and quite some
methods provide only for String arguments instead of CharSequence.  But
personally I never run into a situation where I actually whished I had a
String with these properties.  YMMV though.

Kind regards

   robert
Chris Uppal - 16 Jan 2006 13:09 GMT
> > It would be nice to create alternative kinds of String with different
> > internal implementations -- such as using a UTF-32 or UTF-8 encoding
[quoted text clipped - 6 lines]
> when there's CharSequence?  Granted, it came in quite late and quite some
> methods provide only for String arguments instead of CharSequence.

Agreed, up to a point.  My reservations being:  (A) CharSequence is little used
in practise, and I don't see much chance of that changing.  (A) "CharSequence"
is a silly name for wide use; "String" is the only reasonable name.  (B) The
CharSequence interface is too narrow -- it doesn't correspond to the abstract
API of a String.

(Which, now I come to think of it, is quite a long list of reservations --
perhaps I shouldn't have started by saying "Agreed"  ;-)

   -- chris
Robert Klemme - 16 Jan 2006 13:52 GMT
>>> It would be nice to create alternative kinds of String with
>>> different internal implementations -- such as using a UTF-32 or
[quoted text clipped - 11 lines]
> little used in practise, and I don't see much chance of that
> changing.

I guess that's a problem of education caused by the late arrival of CS.

>  (A) "CharSequence" is a silly name for wide use; "String"
> is the only reasonable name.

Well, String is definitely much better although, given the circumstances,
I find CharSequence describes pretty much what it's about - again, the
late arrival...

>  (B) The CharSequence interface is too
> narrow -- it doesn't correspond to the abstract API of a String.

I on the other hand think that String's interface is bloated.  There's a
lot stuff in there that probably doesn't belong there.  CharSequence
contains really the basic stuff for an immutable string but String
contains a lot string processing that IMHO doesn't necessarily belong
there (split() for example, maybe concat() and replace*(), other methods
which have in part been flagged deprecated).  But it's certainly
debatable.  While for example a general indexOf algorithm could well be
put into some class as static method which solely relies on CharSequence,
implementations for String are likely much more efficient if implemented
in class String itself.  Library design isn't easy... :-)

> (Which, now I come to think of it, is quite a long list of
> reservations -- perhaps I shouldn't have started by saying "Agreed"
> ;-)

LOL

Kind regards

   robert
Roedy Green - 16 Jan 2006 20:38 GMT
>I on the other hand think that String's interface is bloated.

So where should such methods go?

What are the disadvantages of putting them on String?

Coding convenience having all the methods at hand for String is a big
plus.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Andrew McDonagh - 16 Jan 2006 22:22 GMT
>>I on the other hand think that String's interface is bloated.
>
> So where should such methods go?
>
> What are the disadvantages of putting them on String?

it doesn't conform to SRP or LoD for a start...

> Coding convenience having all the methods at hand for String is a big
> plus.

No, a good IDE and a better design does this for us, not simplified,
fragile, everything-in-one-bucket designs.
Chris Uppal - 17 Jan 2006 11:52 GMT
> > What are the disadvantages of putting them on String?
>
> it doesn't conform to SRP or LoD for a start...

Can you state what the "single responsibility" of String /should/ be.  I can't.
It seems to me to be a general purpose class, and it's principle design
criterion should be convernience for the user.

And what does the Occasionally Good Advice of Demeter have to do with it ?

   -- chris
Chris Smith - 19 Jan 2006 05:52 GMT
> And what does the Occasionally Good Advice of Demeter have to do with it ?

Wow, I thought I was the only person who considers that to be
overstated.  Everyone else seems so caught up with it...

Nevertheless, I assumed Andrew meant "Levels of Detail", a term which is
commonly used in 3D modeling but I've seen applied metaphorically to
programming.  Of course, it would help if Andrew would actually say what
he means. ;)

Signature

www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Chris Uppal - 19 Jan 2006 11:45 GMT
[me:]
> > And what does the Occasionally Good Advice of Demeter have to do with
> > it ?
>
> Wow, I thought I was the only person who considers that to be
> overstated.  Everyone else seems so caught up with it...

It does seem to get more attention than it warrants (even conceding its --
inflated, IMO -- value).  I assume that's the result of having a Cool Name
(tm).

> Nevertheless, I assumed Andrew meant "Levels of Detail", a term which is
> commonly used in 3D modeling but I've seen applied metaphorically to
> programming.

I suppose that is a possibility too.  Not sure I see the applicability, even
so...

   -- chris
Googmeister - 16 Jan 2006 14:02 GMT
> I don't think that making String final was such a silly idea /at
> the time/ but in retrospect I think it was an unfortunate choice.

The Java designers wanted String to be immutable, and this is
probably the reason it is final. Same with Integer, etc.

But maybe I am wrong, since StringBuilder and StringBuffer
are also final, even though they are mutable.
Mike Schilling - 16 Jan 2006 18:02 GMT
>> I don't think that making String final was such a silly idea /at
>> the time/ but in retrospect I think it was an unfortunate choice.
[quoted text clipped - 4 lines]
> But maybe I am wrong, since StringBuilder and StringBuffer
> are also final, even though they are mutable.

They are (or were pre-1.5 at least) intimate enough with String that making
them unfinal would introduced holes into String's immutability.
Thomas Hawtin - 16 Jan 2006 18:56 GMT
>>But maybe I am wrong, since StringBuilder and StringBuffer
>>are also final, even though they are mutable.
>
> They are (or were pre-1.5 at least) intimate enough with String that making
> them unfinal would introduced holes into String's immutability.

That doesn't explain why StringBuilder is final.

There are other security reasons for requiring final. Say I had a
security conscious class and I allowed some method that took a
StringBuilder and appended some private objects to it. Now imagine
someone malicious comes along and overrides
StringBuilder.append(Object). Malicious code now has access to my
sensitive object. Presumably that is why ObjectOutputStream.writeObject
is final, and why writeObjectOverride and the auditSubclass nonsense was
introduced.

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Mike Schilling - 16 Jan 2006 23:19 GMT
>>>But maybe I am wrong, since StringBuilder and StringBuffer
>>>are also final, even though they are mutable.
[quoted text clipped - 9 lines]
> along and overrides StringBuilder.append(Object). Malicious code now has
> access to my sensitive object.

As opposed to having access to their text representations through normal
StringBuilder behavior?
Chris Uppal - 17 Jan 2006 12:14 GMT
> There are other security reasons for requiring final. Say I had a
> security conscious class and I allowed some method that took a
> StringBuilder and appended some private objects to it. Now imagine
> someone malicious comes along and overrides
> StringBuilder.append(Object).

So the hypothetical security-conscious code should use toString() explicitly
before appending the data.   I'm not saying you are wrong about Sun's reasons,
but the whole thing seems to be part and parcel with the design error of making
String final.  Very little code /is/ security-conscious, and what there is
tends to be anything but trivial.  Making String (and pals) significantly less
flexible for everyone is a bad bargain if all that's gained is that some
complicated and difficult code is marginally less complicated and difficult.

   -- chris
Thomas Hawtin - 17 Jan 2006 13:59 GMT
>                Very little code /is/ security-conscious,

That is a big problem.

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Chris Uppal - 17 Jan 2006 17:42 GMT
> >                Very little code /is/ security-conscious,
>
> That is a big problem.

A lot depends on what exactly we are talking about when we say
"security-conscious".  If we mean that there is (far) too much code written
with (far) too little thought for security against /external/ attacks --
basically against hostile input -- then I agree whole-heartedly[*].  But if we
are (as I thought we were) discussing security against other code running on
the same JVM, then I'm much less sure.  My impression is that it's rare to have
to protect one's code against attack by -- say -- hostile subclasses, yet it's
that broad category of protection that the finality of String (and pals) seems
to be aimed at.

   -- chris

([*] and no, I don't claim to be specially good at it myself, though I'm not a
total looser.  Secure coding is something that comes with practise, feedback,
and constant attention, and I've had too little practise and almost no
feedback.)
Roedy Green - 16 Jan 2006 20:43 GMT
>But maybe I am wrong, since StringBuilder and StringBuffer
>are also final, even though they are mutable.

Two thoughts. When reading code, you have no doubts about what a
String or StringBuilder is up to.  As soon as you take off the final,
all bets are off. This is too big a temptation for unmaintainable
code.   You need something solid and unshifting for your foundations.
If you want your own String or StringBuilder you can write your own,
cannibalise even,  which then is clearly different. They are not that
complicated underneath.

The other reason is speed.  It is highly convenient for hotspot to
know that code is final, not just temporarily final until some dynamic
class loads and upsets the apple cart.  Imagine what havoc a custom
StringBuilder class being loaded could do to all the finely optimised
HotSpot native code for StringBuilder.  It also allows special tuning
for String and StringBuilder, knowing it will not be meddled with by
overriding.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Thomas Hawtin - 16 Jan 2006 21:29 GMT
> The other reason is speed.  It is highly convenient for hotspot to
> know that code is final, not just temporarily final until some dynamic
[quoted text clipped - 3 lines]
> for String and StringBuilder, knowing it will not be meddled with by
> overriding.

Actually, HotSpot ignores the final flag (although a few methods are
recognised as intrinsics, which may make a difference). And it's quite
happy to inline code called through interfaces.

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Roedy Green - 16 Jan 2006 21:41 GMT
On Mon, 16 Jan 2006 21:40:18 +0000, Thomas Hawtin
<usenet@tackline.plus.com> wrote, quoted or indirectly quoted someone
who said :

>Actually, HotSpot ignores the final flag (although a few methods are
>recognised as intrinsics, which may make a difference). And it's quite
>happy to inline code called through interfaces.

But then if someone later overrides the non final code, it has to go
into a panic, stop everything,and regenerate all optimised machine
code that calls the overridden class. It can no longer inline the code
it was presuming was for them moment as if final.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Chris Uppal - 17 Jan 2006 12:02 GMT
> The Java designers wanted String to be immutable, and this is
> probably the reason it is final. Same with Integer, etc.

I agree that immutability is a worthwhile design goal.  I don't think that
/guaranteed/ immutability is worth the price of making String (and pals) final.
If such a facility was required (as it might well be) then that could have been
implemented by a final subclass of String (which inherited String's immutable
API, and was final in order to lock that down).  I'd have been tempted to use
the ImmutableString subclass for the values of String literals.

   -- chris
Mike Schilling - 16 Jan 2006 17:59 GMT
>>     new String(Reader rdr, int maxLength)
>
[quoted text clipped - 4 lines]
> undesirable, even though your proposed methods otherwise make a lot of
> sense.

As opposed to System.out and Exception.printStackTrace(Writer) :-)

Java implementations must, by license, implement java.io every bit as fully
as java.lang.
Chris Uppal - 17 Jan 2006 12:07 GMT
[me:]
> > One problem with that is that it creates a dependency from the -- very
> > core --
[quoted text clipped - 4 lines]
>
> As opposed to System.out and Exception.printStackTrace(Writer) :-)

<grin/>

Um, well...  Yes.

I'll agree that Exception.printStackTrace() is, er, unfortunate.  I'm less
bothered by System.out.  String, like Object, and Class is one of the classes
that is needed just in order to describe the Java language and semantics, as
opposed to being merely part of the Java platform (for instance, they need
special treatment during bootstrapping).  Unfortunately, for that argument, so
is Throwable...

   -- chris
Mike Schilling - 18 Jan 2006 00:01 GMT
> [me:]
>> > One problem with that is that it creates a dependency from the -- very
[quoted text clipped - 19 lines]
> argument, so
> is Throwable...

While Reader is part of java.io, a Reader is, essentially, a source of chars
(and might not do any IO e.g. StringReader).  It makes perfect sense to me
to be able to constrct a String from a source or chars.  If it would make
you feel better to create an interface java.lang.CharacterSource which is a
superinterface of Reader, it's OK with me :-)
Chris Uppal - 18 Jan 2006 10:39 GMT
> While Reader is part of java.io, a Reader is, essentially, a source of
> chars (and might not do any IO e.g. StringReader).  It makes perfect
> sense to me to be able to constrct a String from a source or chars.

I agree.  It was (and is) only the inverted dependency link that offends.

> If
> it would make you feel better to create an interface
> java.lang.CharacterSource which is a superinterface of Reader, it's OK
> with me :-)

I'd settle for that ;-)

Tempting to suggest a String constructor that takes an
   Iterator<Character>
or -- to follow Sun guidelines -- an
   Iterable<Character>

But that would be horrendously inefficient, and would futher promulgate the
myth that Java's char type corresponds to Unicode characters.  Perhaps, for
both reasons, a constructor taking an
   Iterable<char[]>
would be better.

   -- chris


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.