Java Forum / General / November 2005
Dual binary/character streams?
Adam Warner - 06 Nov 2005 09:30 GMT Hi all,
Suppose a stream contains text and binary data. The text will describe how many bytes to read as binary data before switching back to reading text. It appears Java provides no library upon which to reasonably build this functionality!
Let's make up an example:
"character data" #10 __________"character data continues" ^ ^ |10octets|
The token #10 means: read 10 bytes of binary data. Thereafter continue reading characters in the default character set.
An InputStream supports reading binary data. But an InputStreamReader is permitted to act like a BufferedReader: "To enable the efficient conversion of bytes to characters, more bytes may be read ahead from the underlying stream than are necessary to satisfy the current read operation."
Thus an InputStreamReader cannot be relied upon to just read a character. It may read ahead, removing the binary data from the InputStream.
Is there a character reader for Java that only reads the number of bytes necessary to satisfy a read() request?
Regards, Adam
Roedy Green - 06 Nov 2005 09:39 GMT On Sun, 06 Nov 2005 22:30:37 +1300, Adam Warner <usenet@consulting.net.nz> wrote, quoted or indirectly quoted someone who said :
>Let's make up an example: > > "character data" #10 __________"character data continues" > ^ ^ > |10octets| you use a DataInputStream. You read the binary with readInt readDouble etc.
You read the character data, presumably 8-bit encoded as bytes. then convert the byte array to a string using the desired encoding.
// byte[] -> String String t = new String( b , "Cp1252" /* encoding */ );
If you have control over the stream, you get the person sending it you you to encode the strings in counted UTF-8 format. Then you can read them easily with readUTF.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 06 Nov 2005 09:42 GMT On Sun, 06 Nov 2005 09:39:05 GMT, Roedy Green <my_email_is_posted_on_my_website@munged.invalid> wrote, quoted or indirectly quoted someone who said :
>> "character data" #10 __________"character data continues" >> ^ ^ >> |10octets| > >you use a DataInputStream. You read the binary with readInt >readDouble etc. if these are little-endian, use LEDataInputStream. see http://mindprod.com/products1.html#LEDATASTREAM
or talk the other end into generating the binary in network order.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Adam Warner - 06 Nov 2005 11:20 GMT > On Sun, 06 Nov 2005 22:30:37 +1300, Adam Warner > <usenet@consulting.net.nz> wrote, quoted or indirectly quoted someone [quoted text clipped - 11 lines] > You read the character data, presumably 8-bit encoded as bytes. then > convert the byte array to a string using the desired encoding. Thanks for the suggestion Roedy. I'm attempting to avoid any presumption about the default character set (it could for example be UTF-8 or UTF-16) so this isn't a general solution.
> // byte[] -> String > String t = new String( b , "Cp1252" /* encoding */ ); At this point one doesn't know where the characters terminate and the binary data begins.
One could say InputStreamReader is missing a readByte() method. It is permitted to read ahead bytes yet it provides no way to access those subsequent bytes.
By reading ahead and not providing a readByte method the Java standard library appears to provide no reasonable way to decode a char (in an arbitrary character encoding) within a binary stream while preserving the rest of the binary data.
> If you have control over the stream, you get the person sending it you > you to encode the strings in counted UTF-8 format. Then you can read > them easily with readUTF. The character encoding is not fixed. And readUTF is Java-specific junk. It's impressive how Sun managed to come up with a way to waste 50% more space than four byte encoded UTF-8 code points.
Regards, Adam
Roedy Green - 06 Nov 2005 11:50 GMT On Mon, 07 Nov 2005 00:20:29 +1300, Adam Warner <usenet@consulting.net.nz> wrote, quoted or indirectly quoted someone who said :
>At this point one doesn't know where the characters terminate and the >binary data begins. If you can't tell, your protocol is broken. You will have to do something to fix it. I suggest using counted UTF strings.
Maybe you mean you have to KNOW the lengths in your code to read the stream., that they are not embedded in the stream and there is no format description in the stream.
There is no way you can process a stream without knowing the encoding. The encoding may be 7-bit ASCII, but you still have to know what it is.
You can COPY such a stream, but you can't process it.
The beauty of UTF-8 is that it works for any platform and you don't have to customize it for different locales.
If this stream is a legacy, and you can't change its format at all, and this stream was actually read and processed at one point in history, there must be some hidden assumptions you can take advantage of. e.g. null terminated strings, the encoding used, fixed lengths of fields, a file header ...
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Adam Warner - 06 Nov 2005 12:46 GMT > On Mon, 07 Nov 2005 00:20:29 +1300, Adam Warner > <usenet@consulting.net.nz> wrote, quoted or indirectly quoted someone [quoted text clipped - 4 lines] > > If you can't tell, your protocol is broken. No. You are simply unable to solve the stated issue: "By reading ahead and not providing a readByte method the Java standard library appears to provide no reasonable way to decode a char (in an arbitrary character encoding) within a binary stream while preserving the rest of the binary data."
> You will have to do something to fix it. I suggest using counted UTF > strings. [quoted text clipped - 5 lines] > There is no way you can process a stream without knowing the encoding. > The encoding may be 7-bit ASCII, but you still have to know what it is. This is not the issue. InputStreamReader has a default encoding and a named encoding can also be specified. Unfortunately it may read extra bytes from the underlying binary stream without providing a way to access them as binary data.
> You can COPY such a stream, but you can't process it. > [quoted text clipped - 6 lines] > null terminated strings, the encoding used, fixed lengths of fields, a > file header ... There is no hidden assumption. The decoding information is contained in the character stream. Conceptually it's a type of bivalent stream: <http://www.franz.com/support/documentation/7.0/doc/socket.htm#socket-characteris tics-1> ("Bivalent means that the stream will accept text and binary stream functions. That is, you can write-byte or write-char, read-byte or read-char.")
The protocol is: Decode and interpret a string token. The interpretation of the token determines whether the next datum in the stream will be read as a character or a byte.
Given the specification of InputStreamReader this protocol appears to be difficult to implement. A simple solution is unlikely.
Regards, Adam
Chris Uppal - 06 Nov 2005 13:33 GMT > One could say InputStreamReader is missing a readByte() method. It is > permitted to read ahead bytes yet it provides no way to access > those subsequent bytes. One other problem -- more than just being unable to retrieve bytes that it has read ahead -- is that those bytes might form an invalid or illegal sequences for the given encoder. Logically it should not throw an error until it was asked for the "character" at the illegal position, but I bet it's not implemented that way.
-- chris
Chris Uppal - 06 Nov 2005 12:43 GMT > Suppose a stream contains text and binary data. The text will describe how > many bytes to read as binary data before switching back to reading text. > It appears Java provides no library upon which to reasonably build this > functionality! Does your format have a reliable way of spotting the end of a stream of character data /without/ decoding it ? E.g. in HTTP the headers can specify the length of the (binary) body, but the headers can be separated reliably from the body before they are decoded. Or, failing that, is there a hard limit to how many bytes of character data are allowed in one "chunk" (so that you can make a copy of that data and decode it independently) ?
If not then the format is rather awkwardly designed, and you will have to mess around with more complicated code to unravel it character-by-character I suggest using a java.nio.charset.CharsetDecoder directly.
BTW, since you will have to work character-by-character, even if you were able to use a stock InputSteamReader (if it didn't read ahead), it wouldn't be buying you much at all compared with using your own CharsetDecoder.
BTW2. don't forget that Unicode characters, unlike Java chars, are not limited to 16bits. So one logical character of input may require two actual chars of output.
-- chris
Adam Warner - 06 Nov 2005 13:42 GMT >> Suppose a stream contains text and binary data. The text will describe >> how many bytes to read as binary data before switching back to reading [quoted text clipped - 3 lines] > Does your format have a reliable way of spotting the end of a stream of > character data /without/ decoding it ? No. While I can come up with a different format (e.g. encoding the binary data in base 64) I'd like to solve the problem as specified.
> E.g. in HTTP the headers can specify the length of the (binary) body, > but the headers can be separated reliably from the body before they are > decoded. Or, failing that, is there a hard limit to how many bytes of > character data are allowed in one "chunk" (so that you can make a copy > of that data and decode it independently) ? Since I'll be supporting arbitrary precision integers I guess the character data is effectively unlimited.
> If not then the format is rather awkwardly designed, and you will have > to mess around with more complicated code to unravel it > character-by-character I suggest using a > java.nio.charset.CharsetDecoder directly. The NIO could be helpful. But I still wouldn't know where to cut off a chunk from the stream without potentially splitting a character and breaking the decoding.
> BTW, since you will have to work character-by-character, even if you > were able to use a stock InputSteamReader (if it didn't read ahead), it [quoted text clipped - 4 lines] > limited to 16bits. So one logical character of input may require two > actual chars of output. Indeed. Java chars are not only sufficient for building code points but also serve as input for decoding graphemes via IBM's ICU4J library: <http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/BreakIterator.html>
Thanks for the ideas Chris.
Regards, Adam
Chris Uppal - 06 Nov 2005 17:17 GMT > No. While I can come up with a different format (e.g. encoding the binary > data in base 64) I'd like to solve the problem as specified. Hmm. I'm starting to think that you might want to take that option...
> > If not then the format is rather awkwardly designed, and you will have > > to mess around with more complicated code to unravel it [quoted text clipped - 4 lines] > chunk from the stream without potentially splitting a character and > breaking the decoding. What I had in mind was a simple loop where, at each step, you feed 1 byte to the CharsetDecoder and get back 0, 1, or 2 chars.
Unfortunately I was wrong. Although the documentation doesn't say so, and although the design is clearly set up to be used like that, it doesn't work. At least the UTF-8 decoder doesn't work if used like that. It doesn't retain enough state to remember that it has seen the start of an encoded character, and so it cannot be trusted to decode sucessfully across buffer boundaries (I don't know whether that's a bug or simply that it isn't expected to be able to do so). So I think that the loop has to look more like
0) clear a small buffer 1) get the next byte 2) append it to the small buffer 3) attempt to decode that into up to 2 chars 4) if that works[*] then process the chars and goto (0) 5) goto (1)
and that -- when expressed using the magic of nio ByteBuffers and CharBuffers -- looks as it'd be extremely messy...
([*] by "works" I mean produces at least 1 char)
-- chris
Adam Warner - 07 Nov 2005 07:35 GMT >> No. While I can come up with a different format (e.g. encoding the >> binary data in base 64) I'd like to solve the problem as specified. [quoted text clipped - 12 lines] > What I had in mind was a simple loop where, at each step, you feed 1 > byte to the CharsetDecoder and get back 0, 1, or 2 chars. Nice idea that's unfortunately necessary because CharsetDecoder omits a decodeChar() method.
> Unfortunately I was wrong. Although the documentation doesn't say so, > and although the design is clearly set up to be used like that, it [quoted text clipped - 3 lines] > sucessfully across buffer boundaries (I don't know whether that's a bug > or simply that it isn't expected to be able to do so). I've worked around this.
> So I think that the loop has to look more like > [quoted text clipped - 9 lines] > > ([*] by "works" I mean produces at least 1 char) My approach is similar. I fill a byte buffer (currently 1024 bytes) using a bulk read operation. Without any further copying I supply successive windows of the byte buffer to the charset decoder which places the decoded result into a char buffer of size 2. If there is no result the byte buffer position is reset to its previous value and the buffer limit is increased by 1 (leading to a visible byte window of 1, 2, 3, ... bytes). If the limit exceeds the size of the byte buffer then the few bytes yet to be decoded are copied back to the start of the byte buffer and the rest of the byte buffer is filled via another bulk read operation. Eventually a character is read.
There will be bugs in the implementation below. I have successfully run a test class that includes a selection of Unicode code points and 256 bytes of binary data. To execute:
javac BivalentInputStream.java BivalentInputStreamTest.java && java BivalentInputStreamTest
Many thanks for the feedback Chris.
Regards, Adam
import java.io.InputStream; import java.nio.ByteBuffer; import java.nio.CharBuffer; import java.nio.charset.Charset; import java.nio.charset.CharsetDecoder; import java.nio.charset.CoderResult;
public class BivalentInputStream { public static int bufSize=1024; private InputStream in; private ByteBuffer bb=ByteBuffer.allocate(bufSize); private byte[] ba=bb.array(); private int maxLimit; private CharsetDecoder decoder; private CharBuffer cb=CharBuffer.allocate(2); //support surrogate chars private char[] ca=cb.array(); /** @return The number of bytes read into the buffer. */ private int saneBulkRead(byte[] b, int offset) throws java.io.IOException { int brokenNumBytesRead=in.read(b, offset, b.length-offset); if (brokenNumBytesRead==-1) return 0; return brokenNumBytesRead; } public BivalentInputStream(InputStream in) throws java.io.IOException { this.in=in; maxLimit=saneBulkRead(ba, 0); bb.limit(1); decoder=Charset.defaultCharset().newDecoder(); } public BivalentInputStream(InputStream in, Charset cs) throws java.io.IOException { this.in=in; maxLimit=saneBulkRead(ba, 0); bb.limit(1); this.decoder=cs.newDecoder(); } private char cachedSurrogate; private boolean storedSurrogate=false; /** @return '\uFFFF' if the stream is exhausted or the remaining bytes do not comprise a 16-bit char. */ public char readChar() throws java.io.IOException { if (storedSurrogate==true) { storedSurrogate=false; return cachedSurrogate; } int codePoint=readCodePoint(); if (codePoint==-1) return '\uFFFF'; if (codePoint>0xFFFF) { char[] chars=Character.toChars(codePoint); storedSurrogate=true; cachedSurrogate=chars[1]; return chars[0]; } return (char) codePoint; }
/** @return -1 if the stream is exhausted or the remaining bytes do not comprise a Unicode code point. */ public int readCodePoint() throws java.io.IOException { //Buffer refill logic if (bb.position()==maxLimit) { if (maxLimit==0) return -1; //refill the byte buffer after moving the remaining bytes up to position 0 int remainingBytes=maxLimit-bb.position(); System.arraycopy(ba, bb.position(), ba, 0, remainingBytes); maxLimit=saneBulkRead(ba, remainingBytes); if (maxLimit==0) return -1; //remaining bytes do not comprise a code point maxLimit+=remainingBytes; bb.position(0); bb.limit(remainingBytes+1); } cb.position(0); int bbStartPos=bb.position(); decoder.reset(); CoderResult result=decoder.decode(bb, cb, true); decoder.flush(cb); if (result==CoderResult.UNDERFLOW) { if (bb.limit()<maxLimit) bb.limit(bb.limit()+1); return Character.codePointAt(ca, 0); } bb.position(bbStartPos); bb.limit(bb.limit()+1); return readCodePoint(); }
/** @return -1 if the stream is exhausted. */ public int readByte() throws java.io.IOException { if (bb.position()==maxLimit) { if (maxLimit==0) return -1; //refill the byte buffer maxLimit=saneBulkRead(ba, 0); bb.position(0); bb.limit(1); } if (bb.limit()<maxLimit) bb.limit(bb.limit()+1); return ((int) bb.get()) & 0xFF; } }
//////////////////////////////////////////////////////////////////////////////
import java.io.*;
public class BivalentInputStreamTest { static int numCharUnits=0; public static byte[] buildTestArray() throws java.io.IOException { ByteArrayOutputStream baos=new ByteArrayOutputStream(); DataOutputStream dos=new DataOutputStream(baos); //write code points String intro="Hello, World"; dos.writeChars(intro); numCharUnits+=intro.length();
for (int i=0; i<0x110000; i+=128) { //avoid writing lone surrogates if (((i>=0xD800 && i<=0xDBFF) || (i>=0xDC00 && i<=0xDFFF))!=true) { char[] chars=Character.toChars(i); dos.writeChars(new String(chars)); numCharUnits+=chars.length; } } //write binary data for (int i=0; i<256; ++i) { dos.writeByte(i); }
dos.flush(); dos.close(); return baos.toByteArray(); } public static void printByteArrayDifferences(byte[] array1, byte[] array2) { System.out.println("array1.length="+array1.length+ "; array2.length="+array2.length); byte[] smaller=array1, larger=array2; if (array1.length>array2.length) { smaller=array2; larger=array1; } for(int i=0; i<smaller.length; ++i) { if (array1[i]!=array2[i]) System.out.println("position "+i+": "+(((int) array1[i]) & 0xFF)+ " "+(((int) array2[i]) & 0xFF)); } for (int i=smaller.length; i<larger.length; ++i) { System.out.println("position "+i+": "+(((int) larger[i]) & 0xFF)); } } public static void main(String[] args) throws java.io.IOException { byte[] ba=buildTestArray(); ByteArrayInputStream bais=new ByteArrayInputStream(ba); BivalentInputStream in=new BivalentInputStream(bais, java.nio.charset.Charset.forName("UTF-16")); ByteArrayOutputStream baos=new ByteArrayOutputStream(); DataOutputStream dos=new DataOutputStream(baos); //read char data (for testing purposes using the stored number of char units) for (int i=0; i<numCharUnits; ++i) { char c=in.readChar(); dos.writeChar(c); } //read binary data for (int i=0; i<256; ++i) { dos.writeByte(in.readByte()); } //Compare the arrays dos.flush(); dos.close(); byte[] newBA=baos.toByteArray(); if (ba.equals(newBA)!=true) printByteArrayDifferences(ba, newBA); } }
Chris Uppal - 07 Nov 2005 11:59 GMT > There will be bugs in the implementation below. You might like a couple of test inputs, The following byte array defines a sequence of 4 Unicode code points, or 5 Java chars (sorry about the layout mangling).
Charset utf8 = Charset.forName("UTF-8"); byte[] bytes = new byte[] { 0x32, // = U+000032 (byte)0xD0, (byte)0xB0, // = U+000430 (byte)0xE4, (byte)0xBA, (byte)0x8C, // = U+004E8C (byte)0xF0, (byte)0x90, (byte)0x8C, (byte)0x82 // = U+010302 )
Also this sequence defines an /invalid/ UTF-8 sequence: byte[] bytes = new byte[] { (byte)0xB0, (byte)0xD0 // = invalid };
A couple of comments, if you want 'em:
> private int saneBulkRead(byte[] b, int offset) throws java.io.IOException { > int brokenNumBytesRead=in.read(b, offset, I see you prefer self-documenting code ;-) Nice...
> public int readCodePoint() throws java.io.IOException { > [...] [quoted text clipped - 6 lines] > return readCodePoint(); > } If the input data is mangled, then 'result' will be isMalformed() and no amount of extra data added to the end will fix it, so in that case the recursion will continue more-or-less indefinitely.
I /think/ you may also have a problem with the bb.limit(..) line. It assumes that there is enough space in bb which I don't think is necessarily the case.
-- chris
Adam Warner - 07 Nov 2005 22:33 GMT >> There will be bugs in the implementation below. > > You might like a couple of test inputs, The following byte array defines > a sequence of 4 Unicode code points, or 5 Java chars (sorry about the > layout mangling). Many thanks. I do have to improve handling of malformed data.
> A couple of comments, if you want 'em: > [quoted text clipped - 4 lines] > > I see you prefer self-documenting code ;-) Nice... An Enterprise API isn't complete until the documentation for x.plus(y) reads: /** @return The sum of x and y, unless the sum is 42 then -1 is returned. */
java.io.InputStream.read(byte[] b, int off, int len) returns the number of bytes written to the byte array. Except when it doesn't. A better language would support seamless multiple return values and their efficient implementation. If Java had multiple return values the first return value for this method could simply be the number of bytes written to the byte array. The second return value, to be optionally captured, could be a boolean denoting the end of stream. Instead of conflating two return values there could also be a separate isEndofStream() method.
As JVMs become capable of stack allocating many new objects via escape analysis there's potential for the efficient return of multiple values within an explicit new array. If Java the language is changed to support seamless multiple return values (like the recent introduction of variable arguments on the input side) then more consistent libraries are likely.
Regards, Adam
Roedy Green - 08 Nov 2005 02:57 GMT On Tue, 08 Nov 2005 11:33:40 +1300, Adam Warner <usenet@consulting.net.nz> wrote, quoted or indirectly quoted someone who said :
>As JVMs become capable of stack allocating many new objects via escape >analysis there's potential for the efficient return of multiple values >within an explicit new array. If Java the language is changed to support >seamless multiple return values (like the recent introduction of variable >arguments on the input side) then more consistent libraries are likely. Java the language is fine. The Jet people automatically allocate some objects on the stack. Allocating objects there would likely require an overhaul of the JVM.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 06 Nov 2005 23:14 GMT On Mon, 07 Nov 2005 02:42:44 +1300, Adam Warner <usenet@consulting.net.nz> wrote, quoted or indirectly quoted someone who said :
>> Does your format have a reliable way of spotting the end of a stream of >> character data /without/ decoding it ? > >No. While I can come up with a different format (e.g. encoding the binary >data in base 64) I'd like to solve the problem as specified. You say you CAN tell the end in the DECODED stream but not in the byte stream. How do you notice the end in the DECODED stream?
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Adam Warner - 07 Nov 2005 08:15 GMT > On Mon, 07 Nov 2005 02:42:44 +1300, Adam Warner > <usenet@consulting.net.nz> wrote, quoted or indirectly quoted someone [quoted text clipped - 8 lines] > You say you CAN tell the end in the DECODED stream but not in the byte > stream. How do you notice the end in the DECODED stream? If I call readCodePoint() upon a BivalentInputStream with valid character data then a Unicode code point is returned or -1 to signal the end of the stream. This is the first way of noticing the end of the decoded stream.
Alternatively I could decide that a newline code point terminates the end of decoding. Again this is easy to detect.
More complicated protocols are possible. A programming language could provide syntax to switch to binary decoding to reduce overhead when transferring code and data over a network. A kind of Binary XML could use this approach to switch to binary encoding. A tag such as <binary octets="12345"/> could read 12345 octets of binary data immediately following the closing > before switching back to reading text.
A bivalent approach avoids the overhead of encoding binary data in the current character set and the high CPU burden of compressing that data for transmission and decompressing it again at the other end and finally translating the characters back to binary data. One clearly needs control over the whole communication process because the transformed data is unlikely to be legal text unless the character set is a legacy encoding such as ISO-8859-1. And even if the resulting text is legal the binary data will be corrupted by different operating system newline conventions.
Regards, Adam
Roedy Green - 07 Nov 2005 09:08 GMT On Mon, 07 Nov 2005 21:15:06 +1300, Adam Warner <usenet@consulting.net.nz> wrote, quoted or indirectly quoted someone who said :
> BivalentInputStream I am not familiar with that class. Further I have never heard the term bivalent used outside the chemistry or genetics contexts.
What do you mean by "bivalent" in terms of datastreams? Do you just mean having two different encodings, e.g. encoded char and binary and some mechanism to toggle?
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 07 Nov 2005 08:23 GMT On Mon, 07 Nov 2005 02:42:44 +1300, Adam Warner <usenet@consulting.net.nz> wrote, quoted or indirectly quoted someone who said :
>No. While I can come up with a different format (e.g. encoding the binary >data in base 64) I'd like to solve the problem as specified. If you use counted UTF, the problem goes away. You don't have a slow Mickey Mouse solution. The String is handled with equal ease to any binary field. Why goof around with bailing wire?
see DataOutputStream.writeUTF and DataInputStream. readUTF
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Knute Johnson - 06 Nov 2005 18:28 GMT > Hi all, > [quoted text clipped - 25 lines] > Regards, > Adam I don't know why anybody would create a data file in this format but you are going to have to read it with an InputStream not a Reader. So the answer to your question is no! There must be some method of determining when you have found a 'binary is coming tag' or nobody could decode this data. Use in InputStream and look for the tag, collect your data and proceed. What are you going to do with the binary data? Is it images or something like that? Or is it going to be converted to characters too?
 Signature Knute Johnson email s/nospam/knute/
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|