Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / March 2006

Tip: Looking for answers? Try searching our database.

What is the best charset to choose for binary serialization

Thread view: 
mtp - 27 Mar 2006 12:03 GMT
Hello,

i need to binary serialize some strings in a Java application. Since
there is no restriction at all on the strings, i need to handle all the
characters that java.lang.String handles.

What is the "innner" charset of String class? Since Java must store
characters in memory, it must use some kind of internal charset. If i
use the same, i won't have any trouble, i believe... am i right?

So what is the best charset?

Thanks
tom fredriksen - 27 Mar 2006 14:37 GMT
> What is the "innner" charset of String class? Since Java must store
> characters in memory, it must use some kind of internal charset. If i
> use the same, i won't have any trouble, i believe... am i right?

Read the api doc, the answer is there in plain sight.

/tom
lewmania942@yahoo.fr - 27 Mar 2006 18:06 GMT
Hi tom,

> > What is the "innner" charset of String class? Since Java must store
> > characters in memory, it must use some kind of internal charset. If i
> > use the same, i won't have any trouble, i believe... am i right?
>
> Read the api doc, the answer is there in plain sight.

If I check the Java 1.5 String API doc I do indeed see that UTF-16
is used.

What if the OP is using Java 1.4 ?  (many in the real world are still
stuck with pre-1.5 Java)   It certainly isn't "in plain sight" as it
is in 1.5.

What "answer" should he find? UTF-16? I'm 100% sure several
JVM have used UCS-2 internally in the past.  And UCS-2 is *not*
identical to UTF-16 (even if they're very similar).

AFAIK Java 1.4 only support all "Unicode 3.0 code units", not all
"Unicode 3.1+ code points". So an 1.4 JVM may very well use
the UCS-2 encoding internally and still be compliant to the
1.4 specs. This is *not* the case for an 1.5 JVM: the (older) UCS-2
encoding isn't sufficient.

In the part you quoted, I see two questions.  How's your
post explaining if the OP will have problem or not using that
same encoding? (and what would be that "same" encoding?
UTF-16? UCS-2?)

I find the OP's post to be a legitimate question that deserves
more than a "RTFM".  I may have made mistakes in my
explanation, but at least I tried to help him.

And Chris Smith gave a very nice and gentle explanation,
proposing, amongst other, to use UTF-8 (like I did), and
even explaining UTF-8 gotchas (which I wasn't aware of).

Now that may be just me, but I find Chris Smith's answer
to be gentle and insightful, not yours...

Moreover, not so long ago on this group (thanks Google),
you insisted that ASCII was an 8 bit encoding... So if I was
the OP I'd take any advice coming from you regarding
characters set/encoding/etc. with a huge grain of salt for
I wouldn't think you'd be the definitive authority on the
subject.

Good day to you and sorry I feel condescending (but note
that I did find your answer to the OP condescending and
that certainly influenced the tone of my reply here)
Roedy Green - 27 Mar 2006 19:53 GMT
On 27 Mar 2006 09:06:45 -0800, "lewmania942@yahoo.fr"
<lewmania942@yahoo.fr> wrote, quoted or indirectly quoted someone who
said :

>If I check the Java 1.5 String API doc I do indeed see that UTF-16
>is used.
>
>What if the OP is using Java 1.4 ?

then there is no 32 bit support. Strings are composed of 16-bit
unicode. the lo/li surrogates are just treated as ordinary characters.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Chris Smith - 27 Mar 2006 16:43 GMT
> i need to binary serialize some strings in a Java application. Since
> there is no restriction at all on the strings, i need to handle all the
[quoted text clipped - 3 lines]
> characters in memory, it must use some kind of internal charset. If i
> use the same, i won't have any trouble, i believe... am i right?

There are actually a couple character sets that meet your requirements.  
They include UTF-16BE, UTF-16LE, and UTF-8.  The difference between the
first two (which differ only in endianness) and the last is that UTF-8
is optimized to reduce the file size of files that contain mostly ASCII
characters, while the UTF-16 encodings will be smaller when the file
contains random characters chosen from throughout the entire Unicode
character set, or if it contains mostly characters not in the ISO Latin
1 (ISO8859-1) range, which is a superset of ASCII.  It's worth noting
that Java's UTF-8 is *not* the same as the UTF-8 used throughout the
remainder of the computing world, so you shouldn't assume compatibility
with UTF-8 character decoders written in other languages.

Internally, Java Strings are stored logically in UTF-16.  The endianness
is unspecified, because the String class will use Java primitive data
types, whose endianness is never observable by a Java application.

Signature

www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Mark Thornton - 27 Mar 2006 16:59 GMT
> 1 (ISO8859-1) range, which is a superset of ASCII.  It's worth noting
> that Java's UTF-8 is *not* the same as the UTF-8 used throughout the
> remainder of the computing world, so you shouldn't assume compatibility
> with UTF-8 character decoders written in other languages.

Doesn't the "modified UTF-8" only apply to DataOutputStream,
DataInputStream and related classes plus some JNI related stuff. The
encoding used by java.nio.charset classes should be the true UTF-8.

Mark Thornton
lewmania942@yahoo.fr - 27 Mar 2006 17:09 GMT
Hi,

short answer: you can use UTF-8 and you shouldn't have
any problem.

Now I'll try to answer to your questions ;)

> Hello,
>
> i need to binary serialize some strings in a Java application. Since
> there is no restriction at all on the strings, i need to handle all the
> characters that java.lang.String handles.

The characters handled by java.lang.String depends on the version
of Java you're using...  Up to Java 1.4 you'll "only" be able to
handle correctly Unicode 3.0 code points.

>From Java 1.5, you can handle "all" the Unicode code points (and
the String class got new methods to this effect, like
codepointAt(...)).

> What is the "innner" charset of String class?

You shouldn't care.  All you should care is what encoding is
available when serializing and deserializing your strings.

That said, I'll try to answer your question.

The String class is based on the underlying char primitive which,
unfortunately, is 16 bits wide. Java was designed at a time where
Unicode didn't have more than 65536 codepoints defined yet... And
at that time a Java char was equivalent to an "Unicode code unit"
(check the Character class's API doc for the terminology).

This has very funny implications, like:

"some Unicode 3.1 and above string".length()

not returning the length in "Unicode codepoints" but in "Java char's".

> Since Java must store  characters in memory, it must use
> some kind of internal charset.

Before Java 1.5 it was known that the internal representation for
several JMV was UCS-2 (UTF-16 without surrogates).  But AFAIK
this was not specified by the spec (now I may be wrong).

I've read in this group, years ago, that people have used this fact
to do very fast DB to/from JVM string exchanges (eg by configuring
the DB to use UCS-2).

In Java 1.5 both String and Character's API docs mention that
UTF-16 is used (with surrogates support).

> So what is the best charset?

There's not really an answer to that. UTF-8 is pretty common and is
mandated by the spec to be present in every J(2)SE JVM (you'll
still have to catch an exception that, by the spec, is impossible to
be thrown when doing "getBytes("UTF-8").

So usually it's a safe bet to go with UTF-8 encoding.
Roedy Green - 27 Mar 2006 18:21 GMT
>What is the "innner" charset of String class? Since Java must store
>characters in memory, it must use some kind of internal charset. If i
>use the same, i won't have any trouble, i believe... am i right?

UTF-16.  see http://mindprod.com/jgloss/utf.html

However, there is no way for you to get at that char array directly.
You can of course use the Java's serialisation which will use writeUTF
which uses a bastardised UTF-8.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

mtp - 28 Mar 2006 10:40 GMT
>>What is the "innner" charset of String class? Since Java must store
>>characters in memory, it must use some kind of internal charset. If i
[quoted text clipped - 5 lines]
> You can of course use the Java's serialisation which will use writeUTF
> which uses a bastardised UTF-8.

Thx to all for these valuable informations. I will use UTF-8 since our
compagny do not sell a lot in Japan right now ;)
Alex Hunsley - 28 Mar 2006 15:13 GMT
>>> What is the "innner" charset of String class? Since Java must store
>>> characters in memory, it must use some kind of internal charset. If i
[quoted text clipped - 8 lines]
> Thx to all for these valuable informations. I will use UTF-8 since our
> compagny do not sell a lot in Japan right now ;)

Is it really any cost just to do it correctly now and use UTF-16? Might
save a headache later. Or maybe not, who knows? :]
Oliver Wong - 28 Mar 2006 16:12 GMT
>> Thx to all for these valuable informations. I will use UTF-8 since our
>> compagny do not sell a lot in Japan right now ;)
>
> Is it really any cost just to do it correctly now and use UTF-16? Might
> save a headache later. Or maybe not, who knows? :]

   Yes, there is a cost. If you use only ASCII characters in your document,
then UTF-8 will use 1 byte per character. UTF-16 will use 2 bytes per
character.

   If you mainly use Asian characters (for example), UTF-8 will use 3 bytes
per character, UTF-16 will use 2 bytes per character.

   So the choice between UTF-8 and UTF-16 depends on what you expect to
appear in your documents.

   - Oliver
opalpa@gmail.com opalinski from opalpaweb - 28 Mar 2006 16:18 GMT
UTF-8 works well for Japanese too...

Opalinski
opalpa@gmail.com
http://www.geocities.com/opalpaweb/
Oliver Wong - 28 Mar 2006 20:08 GMT
<opalpa@gmail.com> wrote in message
news:1143559093.245680.145550@t31g2000cwb.googlegroups.com...
> UTF-8 works well for Japanese too...

   UTF-16 "works better" though, if the metric used is size of bitstream.
Characters with codepoints between \u0800 and \uFFFF take up 3 bytes in
UTF-8, but only 2 bytes in UTF-16. This includes most Asian scripts
(Chinese, Japanese, Korean, Yi, Mongolian, Tibetan, Thai, etc.).

   - Oliver
opalpa@gmail.com opalinski from opalpaweb - 28 Mar 2006 15:41 GMT
UTF-8 is the best charset.  It is IMO the best design.

Opalinski
opalpa@gmail.com
http://www.geocities.com/opalpaweb/


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.