cheers.
If I do
byte[] b = chinese.getBytes( "UTF-8" );
b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?
On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
<blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
quoted someone who said :
>b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
>per character?
I suspect your parents punished you for curiosity as a toddler.
EXPERIMENT!
import java.io.UnsupportedEncodingException;
public class Chinese
{
/**
* test harness
*
* @param args not used
*/
public static void main ( String[] args ) throws
UnsupportedEncodingException
{
System.out.println( System.getProperty( "file.encoding" ));
String chinese = "\u4e2d\u5c0f";
// explicit choice of encoding, UTF-8 supports everything
including Chinese.
byte[] b = chinese.getBytes( "UTF-8" );
for ( int i=0; i<b.length; i++ )
{
System.out.println( Integer.toHexString( 0xff & b[i] ));
}
// prints
// Cp1252
// e4
// b8
// ad
// e5
// b0
// 8f
// why those chars?
// BOM is ef bb bf, so that is not it.
// see http://mindprod.com/jgloss/utf.html#UTF8ENCODER
// codes >= 0x800 take 3 bytes to encode.
}
}

Signature
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
bugbear - 30 Aug 2007 10:38 GMT
> On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
> <blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
[quoted text clipped - 5 lines]
> I suspect your parents punished you for curiosity as a toddler.
> EXPERIMENT!
Or read the manual;
http://unicode.org/
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
I'd always prefer a clear definitive spec
to the results of experiment.
Reverse engineering complex systems
can be time consuming and error prone.
BugBear
Crouchez - 30 Aug 2007 18:12 GMT
>> On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
>> <blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
[quoted text clipped - 18 lines]
>
> BugBear
I prefer the experiments personally - those technical manuals are usually
way to wordy
bugbear - 31 Aug 2007 11:10 GMT
>>>> bytes per character?
>>> I suspect your parents punished you for curiosity as a toddler.
[quoted text clipped - 14 lines]
> I prefer the experiments personally - those technical manuals are usually
> way to wordy
Them's the breaks.
Unless you're patient enough to experiement diligently,
there will probably be cases you haven't considered.
I'm not sure how long you'd have had to experiment
to discover that getBytes is local dependent.
BugBear
Crouchez - 30 Aug 2007 18:13 GMT
> On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
> <blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
[quoted text clipped - 41 lines]
> }
> }
Thanks Roedy, nice site there - often comes in useful for all types of java
stuff
> cheers.
>
[quoted text clipped - 4 lines]
> b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
> per character?
not always.
Steve
Crouchez - 30 Aug 2007 18:15 GMT
>> cheers.
>>
[quoted text clipped - 8 lines]
>
> Steve
When is it not?
Thomas Fritsch - 30 Aug 2007 18:58 GMT
>>> cheers.
>>>
[quoted text clipped - 10 lines]
>
> When is it not?
You can find out yourself, either by experimenting
System.out.println("\u0000".getBytes("UTF-8");
System.out.println("\u007F".getBytes("UTF-8");
System.out.println("\u0080".getBytes("UTF-8");
System.out.println("\u07FF".getBytes("UTF-8");
System.out.println("\u0800".getBytes("UTF-8");
System.out.println("\uFFFF".getBytes("UTF-8");
or more easily by reading the UTF-8 documentation
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

Signature
Thomas
--
Thomas
Thomas Fritsch - 30 Aug 2007 19:14 GMT
> You can find out yourself, either by experimenting
> System.out.println("\u0000".getBytes("UTF-8");
[quoted text clipped - 3 lines]
> System.out.println("\u0800".getBytes("UTF-8");
> System.out.println("\uFFFF".getBytes("UTF-8");
Oops, I meant
System.out.println("\u0000".getBytes("UTF-8").length);
....
> or more easily by reading the UTF-8 documentation
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

Signature
Thomas
Lew - 30 Aug 2007 21:31 GMT
>>> byte[] b = chinese.getBytes( "UTF-8" );
>>>
>>> b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
>>> per character?
"steve" wrote:
>> not always.
> When is it not?
When the encoding in use specifies otherwise.

Signature
Lew
> cheers.
>
[quoted text clipped - 4 lines]
> b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
> per character?
So chinese characters take up 3 bytes with utf-8 and 2 with 'native
encodings'?? Imagine the extra bandwidth for a chinese server if it uses
UTF-8! +0.5!
John W. Kennedy - 31 Aug 2007 01:19 GMT
>> cheers.
>>
[quoted text clipped - 8 lines]
> encodings'?? Imagine the extra bandwidth for a chinese server if it uses
> UTF-8! +0.5!
Which is why sensible people do not use UTF-8 for Chinese. UTF-8 is
designed to be efficient for text that is mostly ASCII, but sometimes
not. That does not describe Chinese. Use UTF-16.

Signature
John W. Kennedy
"The pathetic hope that the White House will turn a Caligula into a
Marcus Aurelius is as naïve as the fear that ultimate power inevitably
corrupts."
-- James D. Barber (1930-2004)