Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / March 2006

Tip: Looking for answers? Try searching our database.

platform's default charset ?

Thread view: 
gk - 30 Jan 2006 10:14 GMT
what  is platform's default charset ?

String original = new String("A" + "\u00ea" + "\u00f1" +
                "\u00fc" + "C");
        try {
        byte[] utf8Bytes = original.getBytes("UTF8");
        byte[] defaultBytes = original.getBytes();
        String roundTrip = new String(utf8Bytes, "UTF8");
        String defaultTrip = new String(defaultBytes);

        System.out.println("roundTrip = " + roundTrip); // output-1
        System.out.println("defaultTrip = " + defaultTrip); // output-2

QUESTION :

why output-1 and output-2 are same ?

REASON OF THIS QUESTION :

String original = new String("A" + "\u00ea" + "\u00f1" +
                "\u00fc" + "C");

this is a unicode string  and it looks like  "AêñüC"

How could the second output output-2  produces  the same output as
output-1 ?

the ouput-2 has been encoded/decoded  into "platform's default charset"
. as i have used

byte[] defaultBytes = original.getBytes();

and

String defaultTrip = new String(defaultBytes);

for the output-2

(My System is windows XP ) ......so how that could produce the same
output as output-1 which uses encoding UTF-8  ?

do yo want to say, windows XP supporting UTF-8 ? so, by default it
picks up the UTF-8 encoding ?

in which place this 2 output i.e output-1 and output-2 wnt be same ?

is it in linux ? solaris  ?
or where this two output are not same .

thank you
Thomas Weidenfeller - 30 Jan 2006 12:25 GMT
> what  is platform's default charset ?

Charset.defaultCharset()

> How could the second output output-2  produces  the same output as
> output-1 ?

Why do you think they should be different at all? You start with the
same Unicode string. Then you convert it into two (possibly different)
byte representations. Then you convert the byte representations with the
correct *matching reverse operation* back to two Unicode strings.

The version where you use the UTF-8 byte encoding can't fail. It is made
to represent Unicode characters, and you provide Unicode characters for
a start. From Java's point of view it is even a very trivial operation,
since the VM uses a modified UTF-8 encoding internally, so there isn't
much to do when converting to a UTF-8 byte sequence.

The only way the version which uses the platform's default encoding
could fail would be if the platform's encoding could not represent a
particular character in a platform-specific byte sequence. In that case
you wouldn't get a full round trip conversion for such characters. This
is, however, very unlikely, since you did chose Unicode characters which
are all well in the Latin 1 range. This is the second most common
character encoding after seven bit ASCII, and many character encodings
encompass Latin 1 in one way or the other (the first 256 Unicode
characters are actually the Latin 1 characters).

/Thomas
Signature

The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/

opalpa@gmail.com opalinski from opalpaweb - 30 Jan 2006 13:23 GMT
"The Java 2 platform uses the UTF-16 representation in char  arrays and
in the String and StringBuffer  classes"
(http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html)

> From Java's point of view it is even a very trivial operation,
> since the VM uses a modified UTF-8 encoding internally

When one talks about Java using a modified UTF-8 it normally refers to
Java representing UTF-8 a little different than most implementaitons.
http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8_in_Java

Java uses UTF-16 interanally.

Opalinski
opalpa@gmail.com
http://www.geocities.com/opalpaweb/
Alex Buell - 30 Jan 2006 13:25 GMT
On 30 Jan 2006 05:23:00 -0800 "opalpa@gmail.com opalinski from
opalpaweb" <opalpa@gmail.com> waved a wand and this message magically
appeared:

> Java uses UTF-16 interanally.

"inter-anally"? Teehee.

Signature

http://www.munted.org.uk

"Honestly, what can I possibly say to get you into my bed?" - Anon.

Roedy Green - 30 Jan 2006 14:44 GMT
On 30 Jan 2006 05:23:00 -0800, "opalpa@gmail.com opalinski from
opalpaweb" <opalpa@gmail.com> wrote, quoted or indirectly quoted
someone who said :

>Java uses UTF-16 interanally.

what that a typo or a Freudian slip or a slur?
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

opalpa@gmail.com opalinski from opalpaweb - 30 Jan 2006 20:52 GMT
me> Java uses UTF-16 interanally.
Alex>  "inter-anally"? Teehee.
Roedy> what that a typo or a Freudian slip or a slur?

Too many message windows to too many sexpartners.  All this
simultanallity; poor linear mind gets vexed.

Lol.

Cheers.
Chris Uppal - 30 Jan 2006 14:44 GMT
> the VM uses a modified UTF-8 encoding internally, so there isn't
> much to do when converting to a UTF-8 byte sequence.

This is almost certainly untrue for any given JVM. It's true that some of the
/external interfaces/ to the JVM, notably JNI and the classfile format, do use
the modified version of UTF-8, but that in no way constrains, or (probably)
reflects, the internal representation of Java Strings.

If we are talking about the Sun implementations, then Strings are represented
(quite explicitly at Java level) as char[] arrays which hold Unicode data
represented as UTF-16 sequences of 16-bit integers.  Of course, there might be
other versions of the platform which have different implementations of String.
I suppose it's not impossible that one of them could use byte[] arrays in
not-actually-UTF-8 format, but I find it hard to imagine a convincing
motivation.

BTW, converting Sun's bastardised imitation of UTF-8 into real UTF-8 is /not/
trivial.  Converting not-actually-UTF-8 into UTF-8 involves (logically) the
same steps as converting not-actually-UTF-8 to UTF-16, decoding that to
Unicode, and finally encoding that as UTF-8.

   -- chris
Roedy Green - 30 Jan 2006 21:25 GMT
On Mon, 30 Jan 2006 14:45:13 -0000, "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>  Of course, there might be
>other versions of the platform which have different implementations of String.
>I suppose it's not impossible that one of them could use byte[] arrays in
>not-actually-UTF-8 format, but I find it hard to imagine a convincing
>motivation.

To index and process strings you need them in 16 bit form.  However,
for storage of strings not actively being processed I could imagine
some sort of caching scheme that converts them to UTF-8 for more
compact storage.  All string handling functions would have to be aware
of the two formats and automatically unpack Strings when accessed for
anything other than referencing the string as a whole.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

gk - 31 Jan 2006 05:32 GMT
> > what  is platform's default charset ?
>
> Charset.defaultCharset()

this does not exists .

look here

http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html
Chris Uppal - 31 Jan 2006 10:48 GMT
> > Charset.defaultCharset()
> this does not exists .

It's new in 1.5.

   -- chris
Roedy Green - 31 Jan 2006 13:45 GMT
On Tue, 31 Jan 2006 10:48:17 -0000, "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>> > Charset.defaultCharset()
>> this does not exists .
>
>It's new in 1.5.
prior to that you had look at a System property.  It might even have
been restricted to signed applets.  See
http://mindprod.com/jgloss/encoding.html  I should have it all
documented there.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Piotr Kobzda - 13 Mar 2006 14:57 GMT
> On Tue, 31 Jan 2006 10:48:17 -0000, "Chris Uppal"
> <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
[quoted text clipped - 10 lines]
> http://mindprod.com/jgloss/encoding.html  I should have it all
> documented there.

Less restrictive alternative than System properties querying is:

String defaultEncodingName = new
java.io.OutputStreamWriter(System.out).getEncoding();

Regards,
piotr
gk - 31 Jan 2006 05:45 GMT
> The only way the version which uses the platform's default encoding
> could fail would be if the platform's encoding could not represent a
[quoted text clipped - 5 lines]
> encompass Latin 1 in one way or the other (the first 256 Unicode
> characters are actually the Latin 1 characters).

bit confused.

do you mean, the defaulf character set for all the platform is
"unicode",

because  the DOC says,

String(byte[] bytes)
         Constructs a new String by decoding the specified array of
bytes using the platform's default charset.

so, when i am doing the reverse thingie, if i dont mention the encoding
format , the default charset will be invoked and they may produce
different strings on different platforms.

do you mean, all the platforms have UTF-8 character set by default ?

do you mean,  when i called ,  String defaultTrip = new
String(defaultBytes);  the UTF-8 has been called ?.....but how that
cold be possible  ? may be  linux uses some other encoding as default ,
solaris uses some other encoding as default.....so, this would produce
some other strings .............even, if they (platforms) have UTF-8
chars, how  UTF-8 wold be called by default (because i have not
mentioned in the constructor  ) and so they are bound to produce
different results ?

i dont have have other platforms, so  i am not able to test it in
another platforms.

i did it only in win-xp.

it is still confusing .

please explain.

and who knows , whats the default charset of other platforms ......so,
this might produce some other strings
gk - 31 Jan 2006 05:56 GMT
i discoveded this

import java.nio.charset.Charset;
class StringTest
{
    public static void main(String[] args)
    {
        String defaultEncodingName = System.getProperty( "file.encoding" );
        System.out.println(defaultEncodingName);
    }
}

output:
=====
Cp1252

SO, my platform supports only Cp1252 encoding.

According to DOC >>

byte[]     getBytes()
         Encodes this String into a sequence of bytes using the
platform's default charset, storing the result into a new byte array.

AND

String(byte[] bytes)
         Constructs a new String by decoding the specified array of
bytes using the platform's default charset.

and According to my code here,

byte[] defaultBytes = original.getBytes();
String defaultTrip = new String(defaultBytes);

they should work with platform's default charset  and  that is "Cp1252"
( my discovery)

note, this is not unicode !!.......

but when i printed

System.out.println("defaultTrip = " + defaultTrip);

it prints a unicode !!!!!.....this should have printed some other
complex odd looking sring...is  not it ?
Roedy Green - 31 Jan 2006 13:46 GMT
>SO, my platform supports only Cp1252 encoding.

unless you specifically ask for something else. That is just the
default for Readers/Writer and String <=> byte[] conversion.

See http://mindprod.com/jgloss/encoding.html
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 31 Jan 2006 20:01 GMT
On Tue, 31 Jan 2006 13:46:47 GMT, Roedy Green
<my_email_is_posted_on_my_website@munged.invalid> wrote, quoted or
indirectly quoted someone who said :

>>SO, my platform supports only Cp1252 encoding.
>
>unless you specifically ask for something else. That is just the
>default for Readers/Writer and String <=> byte[] conversion.
>
>See http://mindprod.com/jgloss/encoding.html

see http://mindprod.com/jgloss/fileio.html
for how to specify a different encoding for Reader/Writer

see http://mindprod.com/jgloss/conversion for how to specify a
different one for String <=> byte[] conversion.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

gk - 01 Feb 2006 06:49 GMT
here are some points i have  taken note from your comments

1) java strings are simply chars ......may be we could think these are
as unicode chars.

so, String str="one big string" .....is a bunc of unicode  chars....

2) there is no encoding  involved while we talk about
Strings.......encoidng will come into picture, when we do the String
<=> byte[]  conversion.

3) we could use any encoidng to encode these  bunch of unicode chars
into byte[] array.....if those ebcoding recognises these unicode chars
, then we are safe...becuase  when we revert back, there will be no
loss of data.

4) I is always suggested to use  UTF-8 encoding while we convert it
into byte[] and vice versa.

BUT, i am not comfortable when i run this  "Roedy Green's"  code
(http://mindprod.com/jgloss/conversion)

String s = "abc";
// string -> byte[]
byte [] b = s.getBytes( "8859_1" /* encoding */ );
// byte[] -> String
String t = new String( b , "Cp1252" /* encoding */ );

This code prints  t="abc"  !!

see, we encoded the string   via  "8859_1"  and retrieved via
""Cp1252"" ...and we get the original string.

i also tried...

              String s = "abc";
        // string -> byte[]
        byte [] b = s.getBytes( "windows-1250" /* encoding */ );
        // byte[] -> String
        String t = new String( b , "Cp1252" /* encoding */ );
        System.out.println(t);

again got  t="abc"

there is No loss of data.

so, this means, each encoding recognises other encoding.....and thats
why they are able to revert back.

but, this is not good.....it is not expected that one encoding would be
recognised by other encoding !!....because, if that happens any body
can hack any binary documents  written in unknown encoding like
this......the thief need not to know, whether  the owner has encode the
file in UTF-8, or "8859_1", or "Cp1252" or " "windows-1250" etc
etc.....because, the thief knows encoding are brothers , and they
recognise each other...so, he could decode by any encoding.

P.S : MIND IT..... i am talking about  Cryptrography ....but here in
this example we are loosing the meaning of the word "encoding".
gk - 01 Feb 2006 06:55 GMT
sorry, i meant ...i am NOT talking abot  Cryptrography  and the
different versions of encoding.

i am talking about these simple charset encoding .
Roedy Green - 01 Feb 2006 10:08 GMT
>sorry, i meant ...i am NOT talking abot  Cryptrography  and the
>different versions of encoding.
>
>i am talking about these simple charset encoding .

so am I.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Chris Uppal - 01 Feb 2006 08:11 GMT
> so, this means, each encoding recognises other encoding.....and thats
> why they are able to revert back.

Not quite.  Your argument is sensible but what you don't (yet ;-) know is that
all or nearly all character encodings overlap for a certain range of
characters.  Specifically, the printable ASCII characters have the same
numerical values in CP1252, ISO8859-1, and nearly all other character encodings
(including ASCII).  What's more the Unicode assigned code-points (numbers to
you and me) for those characters are the same too.

So the String ABC contains the chars with numerical values 0x61 0x62 0x63.  If
we translate that to bytes using ISO8859-1 then we will get bytes with values
0x61 0x62 0x63.  But don't let that mislead you, outside that limited range
(essentially the printable characters in the range 32-127) things become very
different.

In a way that overlap is very handy.  It means that if someone sends me an
old-fashioned, 8-bit, text file (not Unicode) written in English then the
chances are that I'll be able to read it without me having to try to find out
what codepage the author used to create it.  Which is a good thing because (a)
there's a good chance that the author hasn't got the faintest idea what a
code-page /is/ let alone which one s/he used to create the file, and (b) I
don't want to mess around trying to change code-page.  Unfortunately, that only
works for text using the restricted range of characters.  As soon as you start
using accented characters, or characters from non-English orthographies, the
whole thing breaks down and life becomes very awkward.  Which is what Unicode
is /intended/ to avoid.

But in a way, it's a very Bad Thing too.  Because of the overlap, it's very
hard (at least for people handling mostly English text) to see when they've
made a mistake with their programming.  Or when they've carelessly, or
sloppily, made assumptions about the code-page in use.  It would be nice to
have (perhaps as part of the standard JDK) a debugging Charset which mapped
Unicode data to some sort of recognisable gibberish -- case-inverted or even
"rot13" would do.  For all I know, there could be one there already, and I've
missed it...

   -- chris
Thomas Hawtin - 01 Feb 2006 09:45 GMT
> But in a way, it's a very Bad Thing too.  Because of the overlap, it's very
> hard (at least for people handling mostly English text) to see when they've
[quoted text clipped - 4 lines]
> "rot13" would do.  For all I know, there could be one there already, and I've
> missed it...

UTF-16LE should more or less fit the bill. Perhaps UTF-16BE would work
better with single characters (not entirely sure what happens with a
single byte), although it is more common.

export LANG=tr_TR,UTF-16LE

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Chris Uppal - 01 Feb 2006 11:44 GMT
[me:]
> > It would be nice to have (perhaps as part of the standard JDK) a
> > debugging Charset which mapped Unicode data to some sort of
[quoted text clipped - 3 lines]
> UTF-16LE should more or less fit the bill. [...]
> export LANG=tr_TR,UTF-16LE

That's a thought.  Not too sure about those NUL bytes though (haven't tried
it yet).

BTW, for anyone who's interested, I rummaged around the Web a little and found
a rot13 Charset, and the corresponding CharsetProvider, at the website for Ron
Hitchens's "Java NIO" book (which I haven't read).  The website is
   http://www.javanio.info/
the code (which is /not/ free for commercial use) is in:
   filearea/bookexamples/unpacked/com/ronsoft/books/nio/charset
under the above root.  See the files:
   RonsoftCharsetProvider.java
   Rot13Charset.java

The first of those files provides sketchy instructions for installing the new
Charset; note that the instructions contain a typo; the filename
   META-INF/services/java.nio.charsets.spi.CharsetProvider
shoud be
   META-INF/services/java.nio.charset.spi.CharsetProvider
(no 's' on the end of charset).

   -- chris
Roedy Green - 01 Feb 2006 12:39 GMT
On Wed, 1 Feb 2006 11:46:07 -0000, "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>BTW, for anyone who's interested, I rummaged around the Web a little and found
>a rot13 Charset, and the corresponding CharsetProvider, at the website for Ron
>Hitchens's "Java NIO" book (which I haven't read).  The website is
>    http://www.javanio.info/

If you feel up to rolling your own, the instructions for how to do it
are at http://mindprod.com/jgloss/encoding.html#ROLLYOUROWN

It is a bunch of mindless housekeeping BS plus writing a decodeLoop
and encodeLoop method to interconvert byte[] <=> char[]
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 01 Feb 2006 10:16 GMT
On Wed, 1 Feb 2006 08:11:56 -0000, "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
quoted someone who said :

> Which is a good thing because (a)
>there's a good chance that the author hasn't got the faintest idea what a
>code-page /is/ let alone which one s/he used to create the file, and (b) I
>don't want to mess around trying to change code-page.

And the encoding used is NOT embedded at the head of the document the
way you might imagine it would be handled.  The receiver just has to
KNOW what encoding it is.

This reminds me back in the early 80s I wrote one of the first
electronic medical billing programs for doctors for whom this was a
complete novelty and status symbol.  On a demo, one doctor was
horrified, "You mean you have to TYPE; it doesn't just KNOW?"

Another doctor was furious at my incompetence when he discovered that
he would lose keying when he rebooted his machine in the middle of
data entry.  I tried to explain that he should not reboot. There was
no need to. He replied that he simply LIKED rebooting and he was not
about to change his nervous habit.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 01 Feb 2006 10:16 GMT
On Wed, 1 Feb 2006 08:11:56 -0000, "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>It would be nice to
>have (perhaps as part of the standard JDK) a debugging Charset which mapped
>Unicode data to some sort of recognisable gibberish -- case-inverted or even
>"rot13" would do.  For all I know, there could be one there already, and I've
>missed it...

what do you do with this?
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Chris Uppal - 01 Feb 2006 14:05 GMT
[me:]
> > It would be nice to
> > have (perhaps as part of the standard JDK) a debugging Charset
> > which mapped Unicode data to some sort of recognisable gibberish --
> > case-inverted or even "rot13" would do.[...]
>
> what do you do with this?

The problem for me, and I think for other programmers, is that you
can't /see/ when something is happening using the wrong Charset.  Since
I'm only an English speaker, the only sample text I can read uses
English characters throughout, and so if I use a wrong Charset there
won't be any obvious differences (as "gk" found).  So I'd like to be
able to either set the default Charset to something that is instantly
recognisable if it gets used when I'm not expecting it, or explicitly
use my debugging charset, so that I can follow the data through and see
that it is used everywhere that I intend.

Just a debugging aid.  I'd have little use for it if I were -- say
-- Korean.

It would probably be helpful as a teaching tool too (although I am not
a teacher), since it would emphasise the difference between the
character sequences in String (or similar) and the byte sequences
produced by encoding -- differences that can be lost on those who's
native language is ASCII-compatible.

   -- chris
Roedy Green - 01 Feb 2006 21:22 GMT
On 01 Feb 2006 14:05:25 GMT, "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>The problem for me, and I think for other programmers, is that you
>can't /see/ when something is happening using the wrong Charset.  Since
[quoted text clipped - 5 lines]
>use my debugging charset, so that I can follow the data through and see
>that it is used everywhere that I intend.

A very simple one might convert char s -> byte f, or simply that
implemented some ligatures, see
http://mindprod.com/jgloss/ligature.html to give a early American look
to the page.

It then becomes a fully legit Charset you might use in real life.
It can piggy back on any other charset adding ligaturisation to it.

See http://mindprod.com/encoding.html#ROLLYOUROWN 

for how to proceed. Even a newbie could tackle this one.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

ozgwei - 03 Feb 2006 12:36 GMT
> The problem for me, and I think for other programmers, is that you
> can't /see/ when something is happening using the wrong Charset.  Since
[quoted text clipped - 14 lines]
> produced by encoding -- differences that can be lost on those who's
> native language is ASCII-compatible.

Have you tried EBCDIC? The encoding name is Cp1047. But I don't know
whether it is available in JVMs other than IBM's...
Roedy Green - 03 Feb 2006 22:46 GMT
>Have you tried EBCDIC? The encoding name is Cp1047. But I don't know
>whether it is available in JVMs other than IBM's...

There are scores of national variants for EBCDIC.

Check out my chart at http://mindprod.com/jgloss/encoding.html for
which ones are supported.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Chris Uppal - 04 Feb 2006 12:55 GMT
[me:]
> > The problem for me, and I think for other programmers, is that you
> > can't see when something is happening using the wrong Charset.
[...]
> Have you tried EBCDIC? The encoding name is Cp1047. But I don't know
> whether it is available in JVMs other than IBM's...

Thanks for the suggestion.

   java -Dfile.encoding=Cp1047 my.test.Application

produces satisfying gibberish ;-)

(Actually it's probably /too/ gibberishish, Thomas's suggested UTF16
works a little better.)

   -- chris
Chris Uppal - 31 Jan 2006 11:04 GMT
> bit confused.

I'm not certain,  but I /think/ that you might be misunderstanding the
relationship between Strings and Charsets.

A String has /no/ Charset, and is not associated with any particular byte
encoding.  (Technically this is only true if you are using the right APIs, but
it close enough to being true to be a good approximation to start from[*]).
That's to say a String contains pure Unicode data, not in any encoding, just
pure characters.  (Compare the way that an int contains pure integer data,
separate from any encoding as big-endian or little-endian, or anything else).
A Charset is only involved when you need to convert a String to bytes (or the
other way around) in order to communicate with external systems or save the
data to file.

So, in your original example, after
   String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
you have a String, original, which contains pure Unicode.

If you new do:
   byte[] utf8Bytes = original.getBytes("UTF8");
then you have the original data encoded as UTF-8.  And later:
   String roundTrip = new String(utf8Bytes, "UTF8");
which gives you a new String containing pure Unicode data, assembled by
decoding the UTF-8 bytes.  Since UTF-8 is (by design) capable of encoding any
Unicode data, no information will have been lost, and roundTrip will be the
same as original.

When you do the same using the platform-default Charset:
   byte[] defaultBytes = original.getBytes();
   String defaultTrip = new String(defaultBytes);
The only thing that is different is that you are using a different Charset.
So, if that Charset happens to be capable of encoding every character in the
original String, no data will have been lost and roundTrip will be the same as
original.  If you had used any Unicode characters in original which could /not/
be encoded in the platform default Charset then the operation would have
failed.  Since the platform default Charset is machine-specific, that means
that you don't really know what'd gong to happen when you convert Strings into
byte[] arrays using it -- which is why using the platform default Charset is
usually a bad idea.

But the important thing to realise is that Strings don't have Charsets.
Charsets are only used when converting Strings to byte sequences.

   -- chris

([*] We can talk more about that approximation, if you want, but it best to get
the current confusion cleared up first)
Roedy Green - 30 Jan 2006 12:53 GMT
>what  is platform's default charset ?

see http://mindprod.com/jgloss/encoding.html

for how to find out. Oddly it is a secret for unsigned Applets.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 30 Jan 2006 12:54 GMT
>    byte[] utf8Bytes =3D original.getBytes("UTF8");
>        byte[] defaultBytes =3D original.getBytes();
>        String roundTrip =3D new String(utf8Bytes, "UTF8");
>        String defaultTrip =3D new String(defaultBytes);

try dumping out the byte encodings. That will solve your mystery.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.