Java Forum / General / March 2006
Transmitting strings via tcp from a windows c++ client to a Java server
qqq111 - 19 Feb 2006 20:02 GMT Hi all,
We have a C++ client which runs on Windows and that needs to transmit char* / wchar* strings to and from a Java server.
The client should correctly handle both 'standard' languages & east Asian languages (i.e. using wchar).
Now, I'm sure there is a best practice for doing so , I just haven't found it yet :-)
My best bet would be always encoding the string in UTF-8 before sending it via the net, but I could be wrong.
Your help will be highly appreciated.
Thanks,
Gilad
Roedy Green - 19 Feb 2006 20:28 GMT > Now, I'm sure there is a best practice for doing so , I just haven't > found it yet :-) How about UTF-8 encoding? It handles all the 16 bit chars. It is reasonable efficient for American English using just 8-bit chars. It does not have an endian ambiguity. HTTP has heard of it and it tend to be an accepted encoding.
You could use a 1 byte length byte giving either char or bytes insides Or you could use a Java-style big endian length field compatible with DataInputStream.readUTF see http://mindprod.com/jgloss/utf.html
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
qqq111 - 20 Feb 2006 09:01 GMT Hi Roedy,
The only problem I have with UTF-8 is its poor supported in Windows. In fact, I did not manage to find Win C++ api that converts strings to UTF-8.
My other thought was to use UTF-16/UCS-2 format, internally used by both Win (client) and Java (server), but as you have stated, there's the endian issue.
BTW, your site is at a high position at my Java-best list :-)
Best, Gilad
Roedy Green - 24 Feb 2006 09:00 GMT >The only problem I have with UTF-8 is its poor supported in Windows. >In fact, I did not manage to find Win C++ api that converts strings to >UTF-8. It is not hard. I posted the code for it at http://mindprod.com/jgloss/utf.html
The code is in Java but I think it would likely compile as C with the right typedefs.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
qqq111 - 24 Feb 2006 10:26 GMT Hi Roedy,
> I posted the code for [ UTF-8 enc/dec ] Apparently Win does have the api for UTF-8/other formats enc/dec. encoding: WideCharToMultiByte(CP_UTF8... ) decoding: MultiByteToWideChar (CP_UTF8...)
Note that for the conversions to succeed, your C++ app s/b compiled with a _UNICODE flag.
Best, Gilad
Roedy Green - 24 Feb 2006 12:50 GMT On Fri, 24 Feb 2006 09:00:32 GMT, Roedy Green <my_email_is_posted_on_my_website@munged.invalid> wrote, quoted or indirectly quoted someone who said :
>It is not hard. I posted the code for it at >http://mindprod.com/jgloss/utf.html > >The code is in Java but I think it would likely compile as C with the >right typedefs. I have improved the code, to provide both encode and decode, and a test harness you can use to ensure that they both give the same results an the Sun classes.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Chris Uppal - 24 Feb 2006 13:32 GMT > http://mindprod.com/jgloss/utf.html Roedy, I don't want to sound too hostile, but that page is full of errors and is /very badly/ misleading.
UTF-8 is a standard. It has /nothing at all/ to do with the fomat used in JNI, classfiles, and in the ObjectOutputStream.writeUtf8() method. /Nothing/. You should not conflate the two.
UTF-8 does not include a prepended length count.
UTF-8 takes between 1 and 4 bytes (inclusive) to encode a Uncode character. You encoder does not work properly for either: * Unicode characters outside the 16-bit range. * java.lang.Strings containing logical characters outside that range (for which you have to decode the UTF-16 before you can encode again into UTF-8).
The UTF-8 decoder has similar problems, and in addition does not perform the mandatory checks for illegal uses of non-shortest-form encodings (necessary for security).
Unicode characters outside the 16-bit range are /not/ represented as surrogate pairs in UTF-8. That /only/ happens in UTF-16.
I stongly recommend that you review that page, and remove all references to Sun's perversion, except a warning that ObjectOutputStream.writeUtf8() does not write valid UTF-8. Move the desciption of Sun's encoding onto a different page if you think there's any value in describing it. Also you should either fix the en/decoder code examples, or make it very much more obvious that they don't do en/decode standard-compliant UTF-8 (i.e. don't work).
-- chris
Chris Uppal - 24 Feb 2006 14:10 GMT I wrote:
> [...] warning that ObjectOutputStream.writeUtf8() does not write valid UTF-8. That should be expanded:
DataOutputStream.writeUTF() does not write valid UTF-8. Nor do the other IO class implementing java.io.DataOutput, such as ObjectOutputStream and RandomAccessStream. Similarly the corresponding readUTF() methods do not decode UTF-8 correctly.
-- chris
Roedy Green - 24 Feb 2006 20:17 GMT On Fri, 24 Feb 2006 14:10:34 -0000, "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly quoted someone who said :
> DataOutputStream.writeUTF() does not write valid UTF-8. Nor do the > other IO class implementing java.io.DataOutput, such as ObjectOutputStream > and RandomAccessStream. Similarly the corresponding readUTF() methods > do not decode UTF-8 correctly. At times I feel like at the top a steep ski hill when I start a little essay. Once you put something out there, you are committed to getting it right, no matter how long it takes you.
The simplest little things turn into black holes for time.
all you said sounded correct except, I am pretty sure I read up that UTF-8 had been extended to use surrogate pairs to encode 32-bit. That is not just a Sun thing.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Chris Uppal - 27 Feb 2006 09:09 GMT > all you said sounded correct except, I am pretty sure I read up that > UTF-8 had been extended to use surrogate pairs to encode 32-bit. That > is not just a Sun thing. It's perfectly possible that you did read that. It's not true, though. A great deal of junk has been written about Unicode.
-- chris
Roedy Green - 27 Feb 2006 12:58 GMT On 27 Feb 2006 09:09:47 GMT, "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly quoted someone who said :
>It's perfectly possible that you did read that. It's not true, though. >A great deal of junk has been written about Unicode. I have rewritten the essay and written an experiment explorer program to back up much of what I say.
see http://mindprod.com/jgloss/utf.html
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Chris Uppal - 28 Feb 2006 11:52 GMT > I have rewritten the essay and written an experiment explorer program > to back up much of what I say. > > see http://mindprod.com/jgloss/utf.html Thanks for making the changes.
I haven't actually checked the code -- it seems safe to assume it does what you say it does -- but with that proviso it seems pretty much OK. I still think you could usefully make it clearer that your example en/decoding code is not actually useful (because incomplete), I know you /do/ say that, but it's burried away and (IMO) gives the impression that it "doesn't really matter".
However, there is still one major error. It's near the bottom under "Exploring Java's UTF Support". First off, it still isn't plain that 2 out of the four options you mention (1 and 3) have /nothing at all/ to do with UTF-8. The so-called "modified UTF-8" format is not compatible (upwards or downwards) with UTF-8. So I don't think you should mix references to the two together, and certainly not intermingle them as if they were all of comparable relevance. Specifically, the page states (slightly further up, under "DataOutputStream.writeUTF()") that the length is "followed by a standard UTF-8 byte encoding of the String"; that is simply not true. You note already that Quasi-UTF-8 encodes 0x0 differently from UTF-8, which all by itself is enough to make writeUTF() useless for interoperability with standards compliant encodings. However there is also a major difference in how it encodes characters off the BMP. Eg. the Uncode character: U+10302 will encode in UTF-8 as (taken from the Uncode Standard 4.0.1, table 3.3): 0xF0 0x90 0x8C 0x82 whereas under Sun's scheme it encodes as: 0xED 0xA0 0x80 0xED 0xBC 0x82 (I'm using unsigned bytes here).
BTW, you also express some opinions on the (non-)value of the >16-bit Unicode characters. I have no problem with your expressing your opinions on your own webpages. I just wanted to add that I don't agree with them.
-- chris
Roedy Green - 28 Feb 2006 21:14 GMT On 28 Feb 2006 11:52:21 GMT, "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly quoted someone who said :
>However, there is still one major error. It's near the bottom under >"Exploring Java's UTF Support". First off, it still isn't plain that 2 [quoted text clipped - 9 lines] >make writeUTF() useless for interoperability with standards compliant >encodings I disagree. The only difference for16-bit is the way 0 is encoded, and the Sun encoding comes out in the wash even when you decode making no special provision for it. You are making a mountain out of a null. They behave 99% the same way so it makes sense to discuss them both under the http://mindprod.com/jgloss/utf.html It is even less of a difference from a practical point of view than the presence of absence of BOMs.
Personally, I don't see the point of any great rush to support 32-bit Unicode. The new symbols will be rarely used. Consider what's there. The only one I would conceivably use are musical symbols and Mathematical Alphanumeric symbols (especially the German black letters so favoured in real analysis). The rest I can't imagine ever using unless I took up a career in anthropology, i.e. linear B syllabary (I have not a clue what it is), linear B ideograms (Looks like symbols for categorising cave petroglyphs), Aegean Numbers (counting with stones and sticks), Old Italic (looks like Phoenecian), Gothic (medieval script), Ugaritic (cuneiform), Deseret (Mormon), Shavian (George Bernard Shaw's phonetic script), Osmanya (Somalian), Cypriot syllabary, Byzantine music symbols (looks like Arabic), Musical Symbols, Tai Xuan Jing Symbols (truncated I-Ching), CJK extensions(Chinese Japanese Korean) and tags (letters with blank price tags).
I think 32-bit Unicode becomes a matter of the tail wagging the dog, spurred by the technical challenge rather than a practical necessity. In the process, ordinary 16-bit character handling is turned into a bleeding mess, for almost no benefit.
I think we should for the most part simply ignore 32-bit and continue using the String class as we always have presuming every character is 16-bits.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 01 Mar 2006 05:36 GMT On 28 Feb 2006 11:52:21 GMT, "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly quoted someone who said :
>Eg. the Uncode character: > U+10302 [quoted text clipped - 4 lines] > 0xED 0xA0 0x80 0xED 0xBC 0x82 >(I'm using unsigned bytes here). I have done some tsk tsking over this that should warm your heart cockles. I have also include exploration of codepoints in the test program. I have also shown how 21 bit code points are encoded, though I have not put code into the sample UTF code to handle codepoints by decoding UTF-16 and recoding as UTF-8. I wanted to explain to how thisg worked, not confuse the heck out of people with code they won't likely ever use.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Chris Uppal - 20 Feb 2006 12:10 GMT > We have a C++ client which runs on Windows and that needs to transmit > char* / wchar* strings to and from a Java server. > > The client should correctly handle both 'standard' languages & east > Asian > languages (i.e. using wchar). The obvious options are:
Use UTF-8. Advantages: Compact /if/ you send mostly ASCII text. Easily readable (for debugging) /if/ you send mostly ASCII text. No byte-order issues. Disadvantages: Consumes more bandwidth if you send mostly non-ASCII. Requires explicit en/de-coding on the Windows box (perfectly possible, but you have to write the code for it).
Use: UTF16-LE Advantages: Compact in the cases where UTF-8 is not. Requires no special handling in the Windows code (since that's the native format for a wstring) and you always have to specify an encoding at the Java end so it makes no difference which encoding you use from the Java point-of-view. Disadvantages: Consumes more bandwidth if you send mostly ASCII text.
Without knowing your requirements, I'd can't guess which option would be best for you, but I don't think any other options make sense.
Some other points to consider.
If you choose UTF8 then don't use java.io.DataInputStream.readUTF() or the corresponding write method They doesn't do what the method names suggest.
If you choose UTF16-LE then you should consider whether a BOM (byte order mark) is forbidden, tolerated, or required by your protocol. Alternatively you could mandate merely UTF16 (either byte order) and /require/ a BOM -- that would give you flexibility if you anticipate creating non Windows clients (which I doubt).
If you choose UTF8 then you should consider whether a BOM forbidden or tolerated by your protocol.
If your choice between UTF-8 and -16 is significantly swayed by bandwidth considerations, then it might be worthwhile considering using zlib compression. Java already understands that, and it's easy to use the ZLIB1.DLL from Windows code.
If your protocol is of the form: <character count><character data> then you should be very clear about what you mean by a "character", especially if you use UTF16 (where there may be more 16-bit wchars / Java chars than actual Unicode characters). Is the BOM (if any) included in the count ?
-- chris
qqq111 - 21 Feb 2006 18:07 GMT Very interesting input, Chris. It does seem that UTF-8 is the right way for us...
1. Our data will mainly consist of ASCII text
2. It turns out Windows does have an API for to/from UTF-8 conversions. See WideCharToMultiByte -and- MultiByteToWideChar (code page s/b set to CP_UTF8)
3. Our system does not use DataInputStream, but rather: CharsetEncoder/Decoder.
4. Each of our msgs is indeed preceded by a length field (as fixed-size text field). Length is measured in Java characters and dup by 2 to obtain size in bytes
5. The BOM issue is, frankly, news to me. If I limit myself to UTF-8 strings only, and stick to standard Win/Java api at both client & server end, do I need to worry about BOM ?
Thanks in advance,
Gilad
Chris Uppal - 22 Feb 2006 12:20 GMT > .... But first a request. /Please/ follow Usenet etiquette and say who you are replying to and quote selectively from the post as you reply. Normally I just ignore people who don't follow "The Rules"; I'm making an exception in this case on a whim ;-)
> 4. Each of our msgs is indeed preceded by a length field > (as fixed-size text field). Length is measured in Java > characters and dup by 2 to obtain size in bytes That algorithm will not give you the size in bytes of a UTF-8 encoded string. There is no way to compute the length of the UTF-8 encoding of a Unicode sequence that does not involve scanning every character. The easiest thing, of course, is just to let the platform do the encoding and then transmit the length of the resulting byte array. If you want to calculate the length yourself, then it's a bit messy -- the main problem is that in Java or Windows the input data is encoded as UTF-16 so you have to undo that encoding and then re-encode the result as UTF-8. Not especially difficult, but more work than you might expect if you are used to relying on strlen() and the like.
It would work for UTF-16. But if you decide to stick with UTF-8 (which sounds better to me) then I suggest you prototype your receiving code (for both platforms) before you set the protocol in stone.
Whatever you do, make very sure that your documentation (formal or informal) of the protocol is /very/ clear about the meaning of the size field. Remember that the word "character" is ambiguous -- it could mean Java char-s, C++ wchar-s, or (most confusingly) Unicode characters. An inexperienced programmer could even assume it meant "byte".
> 5. The BOM issue is, frankly, news to me. If I limit myself to > UTF-8 strings only, and stick to standard Win/Java api at > both client & server end, do I need to worry about BOM ? I doubt it. The important thing is to have made a conscious (and documented) decision. I would probably decide that a BOM must not be used, unless there's something in your project's requirements that I don't know about.
-- chris
qqq111 - 23 Feb 2006 19:16 GMT Hi,
> Normally I just ignore people who don't follow "The Rules" Thanks for not ignoring me ;-)
> That algorithm will not give you the size in bytes of a UTF-8 encoded string You're right, of course.
> [easiest way to calc utf-8 buffer len ] is just to let the platform > do the encoding and then transmit the length of the resulting byte array That is what we'll probably do, in the end.
> make very sure [doc] is /very/ clear about the meaning of the size field Agree - very important to clearly state 'type of length' .
As a side note: you've mentioned zlib in a prior post. We do plan to compress parts to the network-transferred data. We plan, however on using an open source lib called LZMA ( http://www.7-zip.org), which achieves impressive compression ratios at a reasonable CPU cost (see: http://tukaani.org/lzma/ ). Do you feel we've missed any important considerations here?
Thanks again,
Gilad
Chris Uppal - 24 Feb 2006 11:17 GMT > > Normally I just ignore people who don't follow "The Rules" > > Thanks for not ignoring me ;-) Thank /you/ for listening!
> We do plan to compress parts to the network-transferred data. > We plan, however on using an open source lib called LZMA > ( http://www.7-zip.org), > which achieves impressive compression ratios at a reasonable CPU cost > (see: http://tukaani.org/lzma/ ). > Do you feel we've missed any important considerations here? I don't know anything about that library or compression scheme myself (beyond what it says on the website). It certainly looks OK, and using the same library for your C++ and Java code would probably make things easier (if only support queries). The only /potential/ issue I'd raise[*] is that the [de]compression times are highly asymmetrical with compression being rather compute-intensive. If the bulk of the compression happens on the clients, leaving the server to do (mostly) only decompression, then that will work very well for you. But if the situation is the other way around, then I'd want to do a bit of measuring and a few sums before committing to LZMA. I'm not suggesting that /would/ be a problem, just something to check (which you may well have done already).
([*] Apart from a suggestion that you get your lawyers to OK the license -- which is my standard line for anything with LGPL.)
-- chris
Roedy Green - 24 Feb 2006 09:04 GMT On Mon, 20 Feb 2006 12:10:49 -0000, "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly quoted someone who said :
>If you choose UTF8 then you should consider whether a BOM forbidden or >tolerated by your protocol. the BOM for UTF-8 looks like this:
EF BB BF
It is a misnomer. You don't need a byte order mark for UTF-8 since are no lo-hi bytes to order. It is more like a file signature to indicate a UTF-8 encoded file. Otherwise it will at a casual glance look no different from any native platform encoding.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|