Java Forum / General / June 2006
What is the encoding of this String?
howachen@gmail.com - 26 Jun 2006 16:25 GMT I try to print a string using Eclipse (console set to UTF8)
byte b[] = {-28, -72, -83}; String str = new String(b); System.out.println(str);
a chinese charater was shown, but the UTF8 value of that character should not be "{-28, -72, -83}"
so can anyone can tell me what exactly {-28, -72, -83} is?
thanks!
Robert Klemme - 26 Jun 2006 16:29 GMT > I try to print a string using Eclipse (console set to UTF8) > [quoted text clipped - 6 lines] > > so can anyone can tell me what exactly {-28, -72, -83} is? The default encoding likely converted this byte sequence to something else. You could print hex values of each char and look them up in Unicode charsets at http://www.unicode.org/
Kind regards
robert
howachen@gmail.com - 26 Jun 2006 16:40 GMT > > I try to print a string using Eclipse (console set to UTF8) > > [quoted text clipped - 14 lines] > > robert Hex value = [e, 4, b, 8, a, d] (Using apache commons to convert to the byte array)
I can't find any related character from unicode.org I heard that Java Unicode is modified from the standard?
thanks...
Mike Schilling - 26 Jun 2006 17:02 GMT >> > I try to print a string using Eclipse (console set to UTF8) >> > [quoted text clipped - 17 lines] > Hex value = [e, 4, b, 8, a, d] (Using apache commons to convert to the > byte array) Yiu don't want a byte array here, you want the characters in the string. Try
for (int i = 0; i < str.length(); i++) System.out.println((int)str.charAt(i));
howachen@gmail.com - 26 Jun 2006 17:08 GMT Mike Schilling
> >> > I try to print a string using Eclipse (console set to UTF8) > >> > [quoted text clipped - 23 lines] > for (int i = 0; i < str.length(); i++) > System.out.println((int)str.charAt(i)); sorry, this make no difference...
the hex values of your scripts are also output : [e, 4, b, 8, a, d]
Thomas Fritsch - 26 Jun 2006 17:13 GMT > I try to print a string using Eclipse (console set to UTF8) > > byte b[] = {-28, -72, -83}; > String str = new String(b); // You should really specify the wanted UTF-8 encoding here, instead // of assuming that the system's default encoding is UTF-8: String str = new String(b, "UTF-8");
> System.out.println(str); //dump the hex values: for (int i = 0; i < str.length(); i++) System.out.println("["+i+"]=0x"+Integer.toHexString(str.charAt(i)));
> a chinese charater was shown, but the UTF8 value of that character > should not be "{-28, -72, -83}" > > so can anyone can tell me what exactly {-28, -72, -83} is? With the code above I found that str is "\u4e2d" which means "middle" in chinese, according to <http://www.unicode.org/charts/unihan.html>
 Signature Thomas
margie mago - 26 Jun 2006 17:16 GMT > I try to print a string using Eclipse (console set to UTF8) > [quoted text clipped - 8 lines] > > thanks! Try:
String str = new String(b, "UTF-8");
howachen@gmail.com - 26 Jun 2006 17:27 GMT margie mago
> > I try to print a string using Eclipse (console set to UTF8) > > [quoted text clipped - 12 lines] > > String str = new String(b, "UTF-8"); you are right, but the problem of my post :
"What is the encoding of this String?" , is it Java Unicode representation? UTF16, UTF16 LE ?
Mike Schilling - 26 Jun 2006 20:08 GMT margie mago ??:
> howachen@gmail.com wrote: > [quoted text clipped - 14 lines] > > String str = new String(b, "UTF-8"); you are right, but the problem of my post :
"What is the encoding of this String?" , is it Java Unicode representation? UTF16, UTF16 LE ?
Strings in Java don't have encodings. They're 16-bit unicode values. Encodings apply to the representation of strings (and individual characters) as bytes The line shown above:
String str = new String(b, "UTF-8");
means "convert b to a String, assuming that b is encoded in UTF-8."
howachen@gmail.com - 27 Jun 2006 04:12 GMT Mike Schilling
> margie mago ??: > [quoted text clipped - 23 lines] > > Strings in Java don't have encodings. They're 16-bit unicode values. unicode is a kind of encoding, isn't ?
String str = new String(b, "UTF-8"); //UTF8
String str = new String(b); //WHAT IS THIS?
Mike Schilling - 27 Jun 2006 06:14 GMT Mike Schilling ??:
>> Strings in Java don't have encodings. They're 16-bit unicode values.
>unicode is a kind of encoding, isn't ? "Encoding" in Java specifically means "way of representing 16-bit unicode characters in 8-bit bytes". Characters in a Java string *are* 16-bit unicode. In that sense, they're not encoded, because they're in their native form.
>String str = new String(b, "UTF-8"); //UTF8
>String str = new String(b); //WHAT IS THIS? Every instance of a JVM has a default encoding. As the Sun reference (http://java.sun.com/j2se/corejava/intl/reference/faqs/index.html#default-encoding) says:
The default encoding is selected by the JRE based on the host operating system and its locale. For example, in the US locale on Windows, windows-1252 is used. In the Simplified Chinese locale on Solaris, GB2312, GBK, GB18030, or UTF-8 can be the default encoding, depending on the selection made when logging into Solaris. The default encoding is significant because the JRE commonly exchanges text with the host operating system in the default encoding. The default encoding has to match the encoding used by the host operating system to ensure correct interaction. An application can determine the default encoding by calling the Charset.defaultCharset method, available since J2SE 5. In older versions of the Java platform, you can use the expression (new OutputStreamWriter(new ByteArrayOutputStream())).getEncoding()
Both new String(byte[]) and Strng.getBytes() use the default encoding, as do constructors like new FileReader(), new FileWriter(), etc. That is,
byte b[] = {-28, -72, -83}; String str = new String(b);
will do different things on different platforms, and possibly different things on the same platform at different times. If I were writing a lint-like tool for Java, I'd make use of the default encoding a warning.
Chris Uppal - 27 Jun 2006 12:24 GMT > > unicode is a kind of encoding, isn't ? > > "Encoding" in Java specifically means "way of representing 16-bit unicode > characters in 8-bit bytes". Characters in a Java string *are* 16-bit > unicode. In that sense, they're not encoded, because they're in their > native form. I don't really like that way of looking at it -- I think it's misleading. Here's how I see it:
There are two ways to think of Java Strings.
The first is the way that we are /supposed/ to be able to think about them, and it is usually the best way. But, unfortunately, it is technically incorrect. The second is technically correct but is harder to think about and may cause confusion.
So here's the first way. Strings are collections of characters. Characters are Unicode characters. And as such Strings and chars are pure Unicode data. There is no "encoding" involved at all (since encoding is how you translate pure Unicode data into sequences of bytes -- and Java's Strings are not sequences of bytes). So you manipulate Strings and chars directly without worrying about encodings (which are irrelevant). It's only when you want to convert between Strings and sequences of bytes (e.g. writing to file) that you have to consider what encoding to use (and you always /do/ have to consider it since files don't hold Strings, but only sequences of bytes. If you want to put Strings into a file then you /have/ to choose an encoding -- if you don't then the system will choose one for you, which isn't often what you want it to do).
That's the simple version of the story. Now the second version, which is technically accurate, but much nastier.
Due to an unfortunate set of circumstances Java has hardwired the idea that there are <= 2**16 Unicode characters. That assumption is incorrect. It is unfortunate that Unicode didn't go public on that until a few months after Java became set in stone (although there /must/ have been people working for Sun who knew all about it long before that). It's even more unfortunate that the size of a char /was/ set in stone; and very, very, unfortunate that instead of responding to the problem instantly, the Java designers spent about a decade apparently hoping that the problem would just go away by itself. It didn't and instead the situation grew worse and worse...
Anyway, brickbats aside, what has happened is that since the 16-bit limit on a char cannot be changed, Sun have been forced to redefine what a String /is/. It is no longer considered to be "pure Unicode data", but is now considered to be formally a sequence of 16-bit values which /encode/ a Unicode string using UTF-16. So now, even though Strings are not sequences of bytes, it is now technically correct to say that Java's Strings are encoded in UTF-16.
Fortunately, for many purposes, we can still use the simpler picture ("Strings are pure Unicode"), since that works perfectly well provided we are only using characters in the 16-bit range of Unicode (as the OP's example was). But if we have to deal with characters outside that range, then we have to use the second, more complicated, picture to understand what's going on.
-- chris
Mike Schilling - 27 Jun 2006 16:49 GMT > Anyway, brickbats aside, what has happened is that since the 16-bit limit > on a [quoted text clipped - 6 lines] > UTF-16. So now, even though Strings are not sequences of bytes, it is now > technically correct to say that Java's Strings are encoded in UTF-16. Though, not being byte-oriented, they're none of the usual UTF-16 encodings: not LE, not BE, and no BOM. Converting a string to a byte array using UTF-16 is *not* an identity transformation. So even if your last sentence is technically correct, I think it will cause confusion.
Two notes:
1. AFAICT, the only change that would have to be made to Java to represent all Unicode characters natively would be to change the size of char to three bytes. To say it another way, if char hadn't been defined as a 2-byte integer type in the first place, there would have been no difficulty accomodating the extended Unicode range.
2. .NET, which came along much later, faithfully copied Java's mistakes in this area.
Chris Uppal - 28 Jun 2006 10:54 GMT > Though, not being byte-oriented, they're none of the usual UTF-16 > encodings: not LE, not BE, and no BOM. Converting a string to a byte > array using UTF-16 is *not* an identity transformation. So even if your > last sentence is technically correct, I think it will cause confusion. Well, I /did/ warn that the technically correct picture was likely to be confusing. There are two notions of encoding being used at the same time :-(
You probably know this, but for the record:
Unicode distinguishes between "encoding forms" and "encoding schemes". The former are ways of representing Unicode data as sequences of logical integers in some bounded range. These integers are called "code units". UTF-16 is an encoding form using 16-bit code units. Encoding /forms/, otoh, are the physical representation of such encoded integers as sequences of bytes such as can be written to file. For 8-bit encodings like UTF-8 there is no real need to distinguish between encoding schemes and encoding forms, but for UTF-16 we need the specify the byte order before we can translate 16-bit integers into bytes. Hence there are two concrete encoding schemes, UTF-16BE and UTF-16LE, with different bytes orders. There's a third encoding scheme in that family, which is also called "UTF-16" (with no adornment), which is used when either the byte order is specified by a BOM, or where is it determined unambiguously from context.
So, Java's strings are Unicode data represented in the encoding /form/ UTF-16 (which may be represented in physical RAM as UTF-16LE or UTF-16BE, as determined by the machine architecture. But there's no reason to know or care which, unless you are working with JNI or some-such -- and almost certainly not even then).
BTW, I don't claim to be able to remember the singularly opaque Unicode terminology for this stuff -- I had to go look it up....
> 1. AFAICT, the only change that would have to be made to Java to represent > all Unicode characters natively would be to change the size of char to > three bytes. To say it another way, if char hadn't been defined as a > 2-byte integer type in the first place, there would have been no > difficulty accomodating the extended Unicode range. Agreed.
> 2. .NET, which came along much later, faithfully copied Java's mistakes in > this area. Odd that...
;-)
To be fair: the motivation may not have been a lemming-like urge to replicate Java's little mistakes, but a lemming-like urge to maintain compatibility with the Win32 APIs which .NET is supposed to make obsolete...
-- chris
Chris Uppal - 26 Jun 2006 17:26 GMT > byte b[] = {-28, -72, -83}; > String str = new String(b); [quoted text clipped - 4 lines] > > so can anyone can tell me what exactly {-28, -72, -83} is? Java's signed bytes are a pain in the arse. /Everyone/ else in the world thinks of bytes as unsigned (including the Unicode consortium) but Java wants to be different....
So for most people, exactly the same pattern of bits would be described as: 0xE4 0xB8 0xAD which is the UTF-8 encoding of a Unicode string consisting of a single character: U+4E2D which is a character in the unified CKJ area.
You can use unsigned values to initalise Java byte arrays: byte[] b = { (byte)0xE4, (byte)0xB8, (byte)0xAD }; which is more verbose, but (IMO) a Hell of a lot clearer.
Also, when you are printing out byte values, if you want them to look like unsigned values, you can write (for instance):
for (int i = 0; i <.b.length; i++) System.out.println( b[i] & 0xFF);
-- chris
howachen@gmail.com - 26 Jun 2006 17:34 GMT Chris Uppal
> > byte b[] = {-28, -72, -83}; > > String str = new String(b); [quoted text clipped - 27 lines] > > -- chris GREAT!
Thanks!
Mike Schilling - 27 Jun 2006 06:20 GMT >> byte b[] = {-28, -72, -83}; >> String str = new String(b); [quoted text clipped - 9 lines] > wants > to be different.... Ask yourself: how often do I want to do arithmetic on 8-bit quantities (and thus sign-extend them when converting to 16 or 32 bits) vs. how often I want to manipulate 8-bit octets which it would be idiotic to sign-extend?
.NET gets this right. C# has both signed and unsigned bytes, but "byte" means unsigned, and the framework methods that manipulate arrays of bytes all take arrays of unsigned ones.
Chris Uppal - 27 Jun 2006 12:31 GMT > Ask yourself: how often do I want to do arithmetic on 8-bit quantities > (and thus sign-extend them when converting to 16 or 32 bits) vs. how > often I want to manipulate 8-bit octets which it would be idiotic to > sign-extend? I have now asked ;-)
My answer (to myself) was that I can't remember /ever/ wanting to work with values in the range -128..+127. (I did think I had one example -- dealing with 8-bit audio on Windows -- but when I checked it turned out that 8-bit is handled specially: it uses unsigned for 8-bit, but signed integers for higher resolution audio).
I'm not saying that no one anywhere has ever needed to do so, but if they have then I would be interested to know what sort of programming problem pruduced that requirement.
Values from a (domestic) fridge temperature probe perhaps ?
-- chris
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|