Suppose one needs to both store (in a file) and transmit (via sockets)
data that will be mostly text, but with the occasional chunk of binary
(e.g. GIF images).
It seems to me that there are three possible ways:
1) Use a Reader (intended for text) and write the binary data directly
as 16 bits to a character.
I assume this _won't_ work, at least not reliably, because various
translations will be done that would mess up the binary data?
2) Use a Reader (intended for text) and encode the binary data as text
in hex, base64 or similar. This would work, though I was hoping for a
more elegant solution.
3) Use a Stream (intended for binary) and write strings as sequences of
16-bit integers.
Is it safe to do this? That is, if you put a Java String through a
channel that treats it as a literal sequence of 16-bit integers, are you
guaranteed to get the same character sequence out the other end? Or are
there Unicode complications, bank switching to squeeze different chunks
of the 32-bit code point space into the space of 16 bit Java characters,
that sort of thing that might mean (char)1234 on system A doesn't mean
the same character as (char)1234 on system B?
In general, what's the recommended way to do this - what do people
normally do if they want to put images in an XML file, say? Is there a
fourth way I haven't thought of?
Thanks,

Signature
"Always look on the bright side of life."
To reply by email, replace no.spam with my last name.
Thomas Hawtin - 19 Nov 2006 20:44 GMT
> Suppose one needs to both store (in a file) and transmit (via sockets)
> data that will be mostly text, but with the occasional chunk of binary
[quoted text clipped - 7 lines]
> I assume this _won't_ work, at least not reliably, because various
> translations will be done that would mess up the binary data?
If you use a Reader you will need to decide how to encode character data
onto the stream. Beware, much of the Java library is booby trapped. For
instance, if you used java.io.InputStreamReader(InputStream), then you
are leaving the library to make up the character encoding decision for
you. In this case, it uses whatever the machine happens to be set to
use. If you choose, say UTF-8, then every value of char will be preserved.
> 2) Use a Reader (intended for text) and encode the binary data as text
> in hex, base64 or similar. This would work, though I was hoping for a
> more elegant solution.
No, not elegant.
> 3) Use a Stream (intended for binary) and write strings as sequences of
> 16-bit integers.
>
> Is it safe to do this? That is, if you put a Java String through a
That should work. char is a 16-bit value. UTF-8 would be more conventional.
> channel that treats it as a literal sequence of 16-bit integers, are you
> guaranteed to get the same character sequence out the other end? Or are
> there Unicode complications, bank switching to squeeze different chunks
> of the 32-bit code point space into the space of 16 bit Java characters,
> that sort of thing that might mean (char)1234 on system A doesn't mean
> the same character as (char)1234 on system B?
There are char values that represent surrogate pairs. However, the
Unicode code-points they represent are above 0x10000. So there should be
no loss of information (although not every sequence of octets represent
valid UTF-8).
> In general, what's the recommended way to do this - what do people
> normally do if they want to put images in an XML file, say? Is there a
> fourth way I haven't thought of?
I believe XML either uses out-of-channel binary data (XHTML img, for
instance) or Base64 encoding. You can have a perfectly valid XML
document that is just a Base64 blob between a pair of tags. XML does not
necessarily mean interoperable.
Much better is to use a binary data format, and encode Strings as UTF-8.
You could even cheat and use serialisation, if you don't mind a
Java-only protocol.
Tom Hawtin
Arne Vajhøj - 19 Nov 2006 20:55 GMT
> Suppose one needs to both store (in a file) and transmit (via sockets)
> data that will be mostly text, but with the occasional chunk of binary
[quoted text clipped - 14 lines]
> 3) Use a Stream (intended for binary) and write strings as sequences of
> 16-bit integers.
I would suggest:
4) use DataInputStream/DataOutputStream
5) have both InputStream/OutputStream and BufferedReader/PrintWriter and
a protocol that enables both ends to switch between them
Arne
Thomas Hawtin - 19 Nov 2006 21:10 GMT
> 4) use DataInputStream/DataOutputStream
writeUTF/readUTF only allows strings of up to 65535 bytes of modified
UTF-8. (If your String was all NUL characters, then it could be at most
65535/3 characters long).
> 5) have both InputStream/OutputStream and BufferedReader/PrintWriter and
> a protocol that enables both ends to switch between them
You would have to be surprisingly careful getting that to work. The
Readers/Writers will probably over read/under write, so switching will
be difficult.
Tom Hawtin
Arne Vajhøj - 19 Nov 2006 21:51 GMT
>> 4) use DataInputStream/DataOutputStream
>
> writeUTF/readUTF only allows strings of up to 65535 bytes of modified
> UTF-8. (If your String was all NUL characters, then it could be at most
> 65535/3 characters long).
I can usually live with lines smaller than that. Maybe the original
poster can too.
>> 5) have both InputStream/OutputStream and BufferedReader/PrintWriter and
>> a protocol that enables both ends to switch between them
>
> You would have to be surprisingly careful getting that to work. The
> Readers/Writers will probably over read/under write, so switching will
> be difficult.
I am not so worried about the output - flush should do that.
But maybe it would be wise on the input side to only use
InputStreamReader and not BufferedReader.
Arne
Thomas Hawtin - 19 Nov 2006 22:11 GMT
>>> 4) use DataInputStream/DataOutputStream
>>
[quoted text clipped - 4 lines]
> I can usually live with lines smaller than that. Maybe the original
> poster can too.
In this day and age, I'd find it very surprising to come across such a
limitation. Several years I once became very unpopular because an
implementation of an API I was using couldn't cope with strings of more
than a certain length.
>>> 5) have both InputStream/OutputStream and BufferedReader/PrintWriter and
>>> a protocol that enables both ends to switch between them
[quoted text clipped - 4 lines]
>
> I am not so worried about the output - flush should do that.
So long as you don't mind not inconsiderable inefficiency. And don't
forget to flush every time.
> But maybe it would be wise on the input side to only use
> InputStreamReader and not BufferedReader.
Still wont work consistently.
Tom Hawtin
Arne Vajhøj - 19 Nov 2006 22:35 GMT
>>>> 4) use DataInputStream/DataOutputStream
>>>
[quoted text clipped - 9 lines]
> implementation of an API I was using couldn't cope with strings of more
> than a certain length.
You did notice that I am talking about lines - not text fragments ?
>>>> 5) have both InputStream/OutputStream and BufferedReader/PrintWriter
>>>> and
[quoted text clipped - 8 lines]
> So long as you don't mind not inconsiderable inefficiency. And don't
> forget to flush every time.
Only when switching mode.
>> But maybe it would be wise on the input side to only use
>> InputStreamReader and not BufferedReader.
>
> Still wont work consistently.
Why ?
Arne
Thomas Hawtin - 19 Nov 2006 23:12 GMT
>>>> You would have to be surprisingly careful getting that to work. The
>>>> Readers/Writers will probably over read/under write, so switching
[quoted text clipped - 6 lines]
>
> Only when switching mode.
Mode switching may be frequent. Particularly if it is dealing with lots
of small sections of text.
>>> But maybe it would be wise on the input side to only use
>>> InputStreamReader and not BufferedReader.
>>
>> Still wont work consistently.
>
> Why ?
Even if you ask for a single character, Sun's implementation attempts to
grab a block of three bytes.
Tom Hawtin
Arne Vajhøj - 20 Nov 2006 00:35 GMT
>>>> But maybe it would be wise on the input side to only use
>>>> InputStreamReader and not BufferedReader.
[quoted text clipped - 5 lines]
> Even if you ask for a single character, Sun's implementation attempts to
> grab a block of three bytes.
Ouch.
That is not very friendly towards other using the same stream.
#5 is out.
Arne
Mike Schilling - 20 Nov 2006 05:32 GMT
>>>>> But maybe it would be wise on the input side to only use
>>>>> InputStreamReader and not BufferedReader.
[quoted text clipped - 11 lines]
>
> #5 is out.
The thing is, any text data within the stream should be clearly delimited
(either with markers or by length.) It's simple enough to read it into a
byte array, wrap that with a ByteArrayInputStream, and read *that* with an
InputStreamReader.
Russell Wallace - 19 Nov 2006 23:27 GMT
Thanks to everyone who replied!
> I can usually live with lines smaller than that. Maybe the original
> poster can too.
I'm of the school of thought that says hardcoded limits are at best a
venial sin; but I can easily roll my own UTF-8 methods (or just use a
Deflater for better compression), so DataInputStream/DataOutputStream
look like the way to go.
Hmm, suppose one were to wrap said UTF-8 methods in a class
ModifiedDataOutputStream extends DataOutputStream... is there a
convention as to what actual name to use for ModifiedDataOutputStream?

Signature
"Always look on the bright side of life."
To reply by email, replace no.spam with my last name.
Thomas Hawtin - 19 Nov 2006 23:38 GMT
> Hmm, suppose one were to wrap said UTF-8 methods in a class
> ModifiedDataOutputStream extends DataOutputStream... is there a
> convention as to what actual name to use for ModifiedDataOutputStream?
Privately in ObjectOutputStream writeLongUTF is used for eight byte
(long) length followed by modified UTF-8 body, and writeString to decide
whether to use that or the old format (signified by a type byte).
Tom Hawtin
Chris - 19 Nov 2006 20:57 GMT
> Suppose one needs to both store (in a file) and transmit (via sockets)
> data that will be mostly text, but with the occasional chunk of binary
[quoted text clipped - 26 lines]
> normally do if they want to put images in an XML file, say? Is there a
> fourth way I haven't thought of?
You should encode the text and transmit everything as a stream of bytes.
There are methods built in to Java to handle this for you. The methods
are reliable, reversable, and they will encode most western characters
in a single byte. The best encoding to use is UTF-8, because it works
with all characters in Unicode in a clean way.
Example:
String str = "My String";
FileOutputStream fos = new FileOutputStream("/myfile.out");
OutputStreamWriter writer = new OutputStreamWriter(fos, "UTF-8");
writer.write(str);
writer.flush();
// write your images to the FileOutputStream directly and
// bypass the writer