I hope I'm missing something obvious here ...
My app reads text data over a socket. It's actually done via libcurl
(http://curl.haxx.se) but that doesn't matter here; the point is that I
am presented with a void * pointing to a block of raw data and a long
giving the number of bytes in the block, and that's it.
However, since my app comprises both server and client I (the client)
know the data is line-oriented text. All it needs to do is break it into
lines and print them to stdout, skipping certain lines. This is not
hard, using a bunch of <string.h> routines like strchr and strstr, plus
fputs(). I have this all working - when the text is 8-bit ASCII.
But for full generality I'd like the server to deliver text in the UCS-2
charset (Unicode). I figured handling this on the client side would be a
simple matter of transposing char to wchar_t and strlen() to wcslen(),
etc. But it turns out that wchar_t on my platform (and on many,
including Linux and Solaris) is 4 bytes wide. So I've got a stream of
16-bit characters from the server, and mechanisms for handling 8- and
32-bit character streams on the client!
Is there a common/elegant solution here? I could allocate a buffer twice
as big as the incoming data and promote to 4-byte chars before operating
on it but that would be inelegant to say the least. Not to mention the
platforms where wchar_t is 2 bytes. I guess I could pass sizeof(wchar_t)
to the server and have it respond with 2- or 4-byte data based on that,
but that would mean a doubling of bandwidth consumption. What do people
usually do about this "impedance mismatch"?
Thanks,
Henry Townsend
Roedy Green - 25 Sep 2005 21:38 GMT
>I guess I could pass sizeof(wchar_t)
>to the server and have it respond with 2- or 4-byte data based on that,
>but that would mean a doubling of bandwidth consumption. What do people
>usually do about this "impedance mismatch"?
How about telling C you have a stream of bytes, then taking the UTF-16
apart yourself. There was a long discussion here about how UTF-16 is
encoded.
See http://mindprod.com/jgloss/utf.html
In there anything in your C libraries equivalent to Java UTF-8
encodings? that will give you an array of 32-bit chars with an 8-bit
stream?
I do not catch whether this data is mixed binary/text or pure text.
If mixed you might use LEDataStream to prepare file that look like C
structures. It encodes the strings as counted UTF-8. You could take
those apart yourself.
I think the key to this is realising the encoding is not a big deal.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Again taking new Java programming contracts.