Java Forum / General / September 2006
Mixing text and binary I/O
Ivan Voras - 25 Aug 2006 22:43 GMT In implementing a network protocol, there's a text (ASCII) phase and a binary phase. The ideal thing to use would be BufferedReader, but it doesn't allow reading raw bytes. The next best thing (though slower) would be DataInputStream, but its readLine() method is deprecated for silly reasons (IMO). Any other suggestions?
Mike Schilling - 25 Aug 2006 23:11 GMT > In implementing a network protocol, there's a text (ASCII) phase and a > binary phase. The ideal thing to use would be BufferedReader, but it > doesn't allow reading raw bytes. The next best thing (though slower) > would be DataInputStream, but its readLine() method is deprecated for > silly reasons (IMO). Any other suggestions? The simplest thing would be to read each message into a byte array and convert the bytes in the text portion appropriately (using, for instance, an InputStream Reader on top of a ByteArrayInputStream.)
Ivan Voras - 26 Aug 2006 00:12 GMT Mike Schilling wrote:
> The simplest thing would be to read each message into a byte array and > convert the bytes in the text portion appropriately (using, for instance, an > InputStream Reader on top of a ByteArrayInputStream.) Rolling my own is not a problem, it just seems like it should belong in the basic library.
Mike Schilling - 26 Aug 2006 01:01 GMT > Mike Schilling wrote: > [quoted text clipped - 5 lines] > Rolling my own is not a problem, it just seems like it should belong in > the basic library. You'd need to define a standard way to express the boundary between the binary and text portions.
frankgerlach@gmail.com - 26 Aug 2006 01:06 GMT When converting a java char or String to bytes, you should *always* specify the encoding, which can be "UTF-8", "ISO-8859-1", "ASCII" etc. *Never* use the default encoding - this is system dependent. Use String.getBytes("ASCII"), do not use String.getBytes() !
Ivan Voras - 26 Aug 2006 19:05 GMT Mike Schilling wrote:
> You'd need to define a standard way to express the boundary between the > binary and text portions. Um, "this byte position here" (i.e. ftell()) is good enough, no need to overengineer it. :)
In case of complex encodings like UTF-8, I'd expect (and will probably create for my case) its behaviour to be like this:
- Backed by a buffer (the usual way, probably byte[]) - readByte() reads from the buffer, handles buffering of new data, etc. - readChar() reads as much bytes as it needs to reconstitute a character, in case of UTF-8 it could be one or several - it doesn't matter. If it encounters an invalid byte (by the expectations set by used encoding), raise proper exception because it's an encoding error in the stream. - Introduce private or protected pushByte() and pushChar() that do the reverse of readXXX, on the buffer. "Fixup" the fact that one character can have more bytes by initially making the buffer 4+ bytes longer, but don't use this extra space when filling the buffer in readByte(). Like in C, make pushXXX work only for a single byte/character. - Modify readLine() to use readChar(), reads characters until CR+LF; can use existing logic that reads one char after CR to see if it's LF and push it back if it isn't. - Every other readXXX method uses readByte() as usual.
The intended result: freely mix bytes and characters. In the extreme (but supported!) case, the stream can have a UTF-8 character (encoded by one or several bytes) followed by a "raw" byte, followed by a UTF-8 character, etc. The programmer is responsible to know how the stream is formatted.
Soren Kuula - 26 Aug 2006 22:23 GMT > Mike Schilling wrote:
> In case of complex encodings like UTF-8, I'd expect (and will probably > create for my case) its behaviour to be like this: [quoted text clipped - 6 lines] > used encoding), raise proper exception because it's an encoding error in > the stream. I think that java.nio.CharsetEncoder and CharsetDecoder do just that.
BTW, I agree with Frank that you should take charcter encoding seriously!! Do not assume anything, and do not use defaults. Otherwise, you will end up with something that never really works -- in other places than yours, on other computers than yours.
Søren
Mike Schilling - 27 Aug 2006 07:51 GMT > In case of complex encodings like UTF-8, I'd expect (and will probably > create for my case) its behaviour to be like this: > > - Backed by a buffer (the usual way, probably byte[]) In fact, I think you can build it on top of an InputStream, which is more flexible and more general, since all you need is a source of bytes.
> - readByte() reads from the buffer, handles buffering of new data, etc. Let the underlying stream handle buffering.
> - readChar() reads as much bytes as it needs to reconstitute a > character, in case of UTF-8 it could be one or several - it doesn't > matter. If it encounters an invalid byte (by the expectations set by > used encoding), raise proper exception because it's an encoding error in > the stream. I don't know how to build this in general. It's mostly straightforward to build for a specific encoding, say UTF-8, but CharsetDecoder has no method that means "decode exactly one character". (I suppose you could give it one byte, then two, then three, etc. until it stoips returning a failure status, but that seems inelegant.) Even in UTF-8, you get oddities where a codepoint > FFFF returns two characters; returning the first consumes 4 bytes, and returning the second consumes 0 bytes. In other words, you'd have to be careful with logic like "I know that this set of characters occupies bytes 3-10, and I've processed all of them, so I'll switch to reading bytes again."
> - Introduce private or protected pushByte() and pushChar() that do the > reverse of readXXX, on the buffer. "Fixup" the fact that one character [quoted text clipped - 4 lines] > use existing logic that reads one char after CR to see if it's LF and > push it back if it isn't. More precisely, reads until CR, LF, or CRLF. You're right that pushing back a non-LF after CR is easy enough.
> - Every other readXXX method uses readByte() as usual. > [quoted text clipped - 3 lines] > character, etc. The programmer is responsible to know how the stream is > formatted. Ivan Voras - 27 Aug 2006 10:33 GMT > ivoras wrote: >> - readChar() reads as much bytes as it needs to reconstitute a [quoted text clipped - 6 lines] > build for a specific encoding, say UTF-8, but CharsetDecoder has no method > that means "decode exactly one character". (I suppose you could give it one Hmm, ok. This is a slight problem (and IMO a candidate for rectifying), but for my current purpose, I can limit it to UTF-8 and use the sort-of implementation in DataInputStream. I can always point people to file a problem report with Java if they need 4-byte characters :) (just kidding)
Chris Uppal - 27 Aug 2006 11:19 GMT > > I don't know how to build this in general. It's mostly straightforward > > to build for a specific encoding, say UTF-8, but CharsetDecoder has no [quoted text clipped - 4 lines] > but for my current purpose, I can limit it to UTF-8 and use the sort-of > implementation in DataInputStream. Based on a previous attempt to use CharsetDecoder "raw", I suggest that you at least consider not using one at all, but doing your own UTF-8 {en/de}code logic instead.
-- chris
Mike Schilling - 27 Aug 2006 16:21 GMT >> > I don't know how to build this in general. It's mostly straightforward >> > to build for a specific encoding, say UTF-8, but CharsetDecoder has no [quoted text clipped - 10 lines] > logic > instead. Here's a place where I wish Java had output parameters; the signature I'd want for readChar is
/** @returns a character in the range 0 to 65535, or -1 at EOF * @param moreDecoded returns true if another character is available without consuming more bytes */ int readChar(out boolean moreDecoded);
As it is, I suppose a moreDecoded() method is the least of evils, i.e. better than forcing the client to check that the returned character is in the range D800-DBFF.
Stefan Ram - 27 Aug 2006 17:18 GMT > /** @returns a character in the range 0 to 65535, or -1 at EOF > * @param moreDecoded returns true if another character is available [quoted text clipped - 4 lines] >better than forcing the client to check that the returned character is in >the range D800-DBFF. Possibly, the Java SE way would be to implement the following interface.
http://download.java.net/jdk7/docs/api/java/util/Iterator.html
Mike Schilling - 28 Aug 2006 02:24 GMT >> /** @returns a character in the range 0 to 65535, or -1 at EOF >> * @param moreDecoded returns true if another character is available [quoted text clipped - 6 lines] > > Possibly, the Java SE way would be to implement the following interface. I don't think so. An Iterator knows when there's nothing more to return. Part of the assumption here is that the client knows (based on the protocol definition) how many characters to ask for before switching back to binary. There's nothing to tell the Iterator that.
Unless you mean that each call to readChar() returns an Iterator<Character>. But that seems awfully cumbersome, both for the implementation, which has to wrap each decoded char into a Character and then wrap *that* in an Iterator, and for the client which has to unwrap each of those..
Stefan Ram - 28 Aug 2006 03:05 GMT >>>/** @returns a character in the range 0 to 65535, or -1 at EOF >>>* @param moreDecoded returns true if another character is available [quoted text clipped - 6 lines] >definition) how many characters to ask for before switching back to binary. >There's nothing to tell the Iterator that. I have not read the whole thread, but was just responding to what I have quoted. If one wants something like
int readChar( out boolean moreDecoded )
, then this can be done with an iterator.
The »out« means that »readChar« tells its client whether there are more characters by this out-parameter. When you now say that the client already knows how many charaters will be coming, you might be talking about something else beyond the scope of my answer. I was just refering to
int readChar( out boolean moreDecoded )
in isolation.
>Unless you mean that each call to readChar() returns an >Iterator<Character>. But that seems awfully cumbersome, both >for the implementation, which has to wrap each decoded char >into a Character and then wrap *that* in an Iterator, and for >the client which has to unwrap each of those.. For a sequence of multiple iterations using an iterator, there is no need to create a new iterator object for each iteration. The same iterator object might be reused using "set" instead of "wrap". This, however, is not possible with java.lang.Character, because it is immutable.
An iterator object also might implement other methods than those of the interface "Iterator" to read information from its client.
Ivan Voras - 27 Aug 2006 16:48 GMT > Based on a previous attempt to use CharsetDecoder "raw", I suggest that you at > least consider not using one at all, but doing your own UTF-8 {en/de}code logic > instead. Agreed. The "sort-of" implementation in DataInputStream is good enough for me to adapt it.
Chris Uppal - 26 Aug 2006 10:43 GMT > > The simplest thing would be to read each message into a byte array and > > convert the bytes in the text portion appropriately (using, for > > instance, an InputStream Reader on top of a ByteArrayInputStream.) > > Rolling my own is not a problem, it just seems like it should belong in > the basic library. It would require a fairly major redesign of the standard library -- there is no way for the current application of the Decorator pattern to express the idea that a decorator is responsible for pushing buffered-but-unused data back onto the underlying stream.
A pity, really. If the design were fixed (and they might as well get buffering and random access fixed too, while they were at it), then many things would become much easier.
-- chris
Dale King - 01 Sep 2006 03:48 GMT > In implementing a network protocol, there's a text (ASCII) phase and a > binary phase. The ideal thing to use would be BufferedReader, but it > doesn't allow reading raw bytes. The next best thing (though slower) > would be DataInputStream, but its readLine() method is deprecated for > silly reasons (IMO). Any other suggestions? In general there is no good way to do this using an InputStreamReader wrapping the raw InputStream. The only way would require your protocol to contain information that tells you how many of the following bytes are part of the ASCII phase. You then read out those bytes, wrap it in a ByteArrayInputStream and InputStreamReader. You cannot just read from the InputStreamReader and then go back to InputStream. The problem is that the character decoder can buffer up a few bytes and read ahead into the InputStream.
You have to look beyond the original I/O classes and look into the new I/O (NIO) classes that were introduced in JDK1.4. They will be able to handle this better because the buffering can be more explicit. You can use a ByteBuffer from which a CharsetDecoder extracts bytes. But it will not have the problem of reading too far, becuase it can look into the ByteBuffer without gobbling up the bytes.
See the NIO documentation.
 Signature Dale King
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|