Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / August 2007

Tip: Looking for answers? Try searching our database.

Unicode chinese

Thread view: 
Crouchez - 29 Aug 2007 04:47 GMT
String chinese = "\u4e2d\u5c0f";
System.out.println(chinese.getBytes().length);

Why does this return 2?
Knute Johnson - 29 Aug 2007 04:52 GMT
> String chinese = "\u4e2d\u5c0f";
> System.out.println(chinese.getBytes().length);
>
> Why does this return 2?

The font on the console may not be able to draw it.  Try it with an
appropriate font in a JComponent of some variety.

Signature

Knute Johnson
email s/nospam/knute/

sadiruddin@gmail.com - 29 Aug 2007 08:31 GMT
It runs 6 for me.
bugbear - 29 Aug 2007 10:15 GMT
> String chinese = "\u4e2d\u5c0f";
> System.out.println(chinese.getBytes().length);
>
> Why does this return 2?

http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#getBytes()

"The behavior of this method when this string cannot be encoded in the default charset is unspecified."

  BugBear
Andreas Leitgeb - 29 Aug 2007 11:37 GMT
>> String chinese = "\u4e2d\u5c0f";
>> System.out.println(chinese.getBytes().length);
>> Why does this return 2?
> http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html#getBytes()
> "The behavior of this method when this string cannot be encoded in the default charset is unspecified."

While it's not specified, and could theoretically change over time,
the current implementation seems to encode your string as two
questionmarks, which account for length==2.

The other one, who answered that it gave "6" for him, likely
has an utf-8 based system-encoding (or utf-8 itself).

On Unix-systems, the system-encoding generally depends on the
environment variable LANG (and possibly overridden by certain
LC_... variables whose names I never remember).
For Windows, I don't know.
Thomas Fritsch - 29 Aug 2007 10:42 GMT
> String chinese = "\u4e2d\u5c0f";
> System.out.println(chinese.getBytes().length);
>
> Why does this return 2?

String.getBytes() uses the platform's default charset. See
<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/String.html#getBytes()>

If the platform's default charset is "Cp1252" (like on my system and may
be on Crouchez's), then chinese.getBytes() returns 2 bytes. By the way:
the 2 bytes are {63,63} which is just {'?','?'} because the encoding
can't decode characters beyond '\u00ff'.

If the platform's default charset is "UTF-8" (like probably on
sadiruddin's system), then chinese.getBytes() returns 6 bytes.

Signature

Thomas

Roedy Green - 29 Aug 2007 12:45 GMT
On Wed, 29 Aug 2007 03:47:16 GMT, "Crouchez"
<blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
quoted someone who said :

>String chinese = "\u4e2d\u5c0f";
>System.out.println(chinese.getBytes().length);
>
>Why does this return 2?

I modified your code a little, so it will make the problem clear:

public class Chinese
  {
  /**
   * test harness
   *
   * @param args not used
   */
  public static void main ( String[] args )
     {
     System.out.println( System.getProperty( "file.encoding" ));
     String chinese = "\u4e2d\u5c0f";
     byte[] b = chinese.getBytes();
     for ( int i=0; i<b.length; i++ )
        {
        System.out.println( b[i]);
        }
     // prints
     // Cp1252
     // 63
     // 63
     // in other words ??.  Those tho chars are not available in your
default encoding.
     }
  }

I further modified you code to choose the encoding explicitly:

import java.io.UnsupportedEncodingException;
public class Chinese
  {
  /**
   * test harness
   *
   * @param args not used
   */
  public static void main ( String[] args ) throws
UnsupportedEncodingException
  {
     System.out.println( System.getProperty( "file.encoding" ));
     String chinese = "\u4e2d\u5c0f";
     // explicit choice of encoding, designed to support Chinese.
     byte[] b = chinese.getBytes( "Big5-HKSCS" );
     for ( int i=0; i<b.length; i++ )
        {
        System.out.println( 0xff & b[i]);
        }
     // prints
     // Cp1252
     // 164
     // 164
     // 164
     // 112  more like you would expect.
  }
  }

Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Crouchez - 29 Aug 2007 17:50 GMT
> On Wed, 29 Aug 2007 03:47:16 GMT, "Crouchez"
> <blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
[quoted text clipped - 61 lines]
>   }
>   }

Why have you done an AND on this?
System.out.println( 0xff & b[i]);
Roedy Green - 30 Aug 2007 04:17 GMT
On Wed, 29 Aug 2007 16:50:41 GMT, "Crouchez"
<blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
quoted someone who said :

>Why have you done an AND on this?
>System.out.println( 0xff & b[i]);

see http://mindprod.com/jgloss/unsigned.html
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Crouchez - 30 Aug 2007 18:21 GMT
> On Wed, 29 Aug 2007 16:50:41 GMT, "Crouchez"
> <blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
[quoted text clipped - 4 lines]
>
> see http://mindprod.com/jgloss/unsigned.html

It baffles me a lot of that. I remember doing floating point and binary
stuff on paper years ago and never used it for real. Whats the main use for
bitwise and bit shifting?
Roedy Green - 31 Aug 2007 05:11 GMT
On Thu, 30 Aug 2007 17:21:07 GMT, "Crouchez"
<blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
quoted someone who said :

>Whats the main use for
>bitwise and bit shifting?

see http://mindprod.com/jgloss/binary.html
http://mindprod.com/jgloss/xor.html
http://mindprod.com/jgloss/masking.html

Consider the Font.BOLD|Font.ITALIC. It lets you combine binary
attributes.
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

bugbear - 31 Aug 2007 10:42 GMT
> On Thu, 30 Aug 2007 17:21:07 GMT, "Crouchez"
> <blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
[quoted text clipped - 9 lines]
> Consider the Font.BOLD|Font.ITALIC. It lets you combine binary
> attributes.

Quite handy when image processing bitonal images too.

  BugBear
Crouchez - 29 Aug 2007 17:22 GMT
cheers.

If I do

byte[] b = chinese.getBytes( "UTF-8" );

b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
per character?
Roedy Green - 30 Aug 2007 04:25 GMT
On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
<blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
quoted someone who said :

>b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
>per character?

I suspect your parents punished you for curiosity as a toddler.
EXPERIMENT!

import java.io.UnsupportedEncodingException;
public class Chinese
  {
  /**
   * test harness
   *
   * @param args not used
   */
  public static void main ( String[] args ) throws
UnsupportedEncodingException
  {
     System.out.println( System.getProperty( "file.encoding" ));
     String chinese = "\u4e2d\u5c0f";
     // explicit choice of encoding, UTF-8 supports everything
including Chinese.
     byte[] b = chinese.getBytes( "UTF-8" );
     for ( int i=0; i<b.length; i++ )
        {
        System.out.println( Integer.toHexString( 0xff & b[i] ));
        }
     // prints
     // Cp1252
     // e4
     // b8
     // ad
     // e5
     // b0
     // 8f

     // why those chars?
     // BOM is ef bb bf, so that is not it.
     // see http://mindprod.com/jgloss/utf.html#UTF8ENCODER
     // codes >= 0x800 take 3 bytes to encode.
  }
  }
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

bugbear - 30 Aug 2007 10:38 GMT
> On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
> <blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
[quoted text clipped - 5 lines]
> I suspect your parents punished you for curiosity as a toddler.
> EXPERIMENT!

Or read the manual;

http://unicode.org/
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

I'd always prefer a clear definitive spec
to the results of experiment.

Reverse engineering complex systems
can be time consuming and error prone.

  BugBear
Crouchez - 30 Aug 2007 18:12 GMT
>> On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
>> <blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
[quoted text clipped - 18 lines]
>
>   BugBear

I prefer the experiments personally - those technical manuals are usually
way to wordy
bugbear - 31 Aug 2007 11:10 GMT
>>>> bytes per character?
>>> I suspect your parents punished you for curiosity as a toddler.
[quoted text clipped - 14 lines]
> I prefer the experiments personally - those technical manuals are usually
> way to wordy

Them's the breaks.

Unless you're patient enough to experiement diligently,
there will probably be cases you haven't considered.

I'm not sure how long you'd have had to experiment
to discover that getBytes is local dependent.

    BugBear
Crouchez - 30 Aug 2007 18:13 GMT
> On Wed, 29 Aug 2007 16:22:45 GMT, "Crouchez"
> <blah@bllllllahblllbllahblahblahhh.com> wrote, quoted or indirectly
[quoted text clipped - 41 lines]
>   }
>   }

Thanks Roedy, nice site there - often comes in useful for all types of java
stuff
steve - 30 Aug 2007 13:41 GMT
> cheers.
>
[quoted text clipped - 4 lines]
> b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
> per character?

not always.

Steve
Crouchez - 30 Aug 2007 18:15 GMT
>> cheers.
>>
[quoted text clipped - 8 lines]
>
> Steve

When is it not?
Thomas Fritsch - 30 Aug 2007 18:58 GMT
>>> cheers.
>>>
[quoted text clipped - 10 lines]
>
> When is it not?
You can find out yourself, either by experimenting
  System.out.println("\u0000".getBytes("UTF-8");
  System.out.println("\u007F".getBytes("UTF-8");
  System.out.println("\u0080".getBytes("UTF-8");
  System.out.println("\u07FF".getBytes("UTF-8");
  System.out.println("\u0800".getBytes("UTF-8");
  System.out.println("\uFFFF".getBytes("UTF-8");
or more easily by reading the UTF-8 documentation
  http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

Signature

Thomas

--
Thomas

Thomas Fritsch - 30 Aug 2007 19:14 GMT
> You can find out yourself, either by experimenting
>    System.out.println("\u0000".getBytes("UTF-8");
[quoted text clipped - 3 lines]
>    System.out.println("\u0800".getBytes("UTF-8");
>    System.out.println("\uFFFF".getBytes("UTF-8");
Oops, I meant
    System.out.println("\u0000".getBytes("UTF-8").length);
    ....
> or more easily by reading the UTF-8 documentation
>    http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

Signature

Thomas

Lew - 30 Aug 2007 21:31 GMT
>>> byte[] b = chinese.getBytes( "UTF-8" );
>>>
>>> b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
>>> per character?

"steve" wrote:
>> not always.

> When is it not?

When the encoding in use specifies otherwise.

Signature

Lew

Crouchez - 30 Aug 2007 18:11 GMT
> cheers.
>
[quoted text clipped - 4 lines]
> b.length = 6. But why 6 when I thought chinese characters take up 2 bytes
> per character?

So chinese characters take up 3 bytes with utf-8 and 2 with 'native
encodings'?? Imagine the extra bandwidth for a chinese server if it uses
UTF-8! +0.5!
John W. Kennedy - 31 Aug 2007 01:19 GMT
>> cheers.
>>
[quoted text clipped - 8 lines]
> encodings'?? Imagine the extra bandwidth for a chinese server if it uses
> UTF-8! +0.5!

Which is why sensible people do not use UTF-8 for Chinese. UTF-8 is
designed to be efficient for text that is mostly ASCII, but sometimes
not. That does not describe Chinese. Use UTF-16.

Signature

John W. Kennedy
"The pathetic hope that the White House will turn a Caligula into a
Marcus Aurelius is as naïve as the fear that ultimate power inevitably
corrupts."
  -- James D. Barber (1930-2004)



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.