Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / September 2006

Tip: Looking for answers? Try searching our database.

Mixing text and binary I/O

Thread view: 
Ivan Voras - 25 Aug 2006 22:43 GMT
In implementing a network protocol, there's a text (ASCII) phase and a
binary phase. The ideal thing to use would be BufferedReader, but it
doesn't allow reading raw bytes. The next best thing (though slower)
would be DataInputStream, but its readLine() method is deprecated for
silly reasons (IMO). Any other suggestions?
Mike  Schilling - 25 Aug 2006 23:11 GMT
> In implementing a network protocol, there's a text (ASCII) phase and a
> binary phase. The ideal thing to use would be BufferedReader, but it
> doesn't allow reading raw bytes. The next best thing (though slower)
> would be DataInputStream, but its readLine() method is deprecated for
> silly reasons (IMO). Any other suggestions?

The simplest thing would be to read each message into a byte array and
convert the bytes in the text portion appropriately (using, for instance, an
InputStream Reader on top of a ByteArrayInputStream.)
Ivan Voras - 26 Aug 2006 00:12 GMT
Mike Schilling wrote:

> The simplest thing would be to read each message into a byte array and
> convert the bytes in the text portion appropriately (using, for instance, an
> InputStream Reader on top of a ByteArrayInputStream.)

Rolling my own is not a problem, it just seems like it should belong in
the basic library.
Mike  Schilling - 26 Aug 2006 01:01 GMT
> Mike Schilling wrote:
>
[quoted text clipped - 5 lines]
> Rolling my own is not a problem, it just seems like it should belong in
> the basic library.

You'd need to define a standard way to express the boundary between the
binary and text portions.
frankgerlach@gmail.com - 26 Aug 2006 01:06 GMT
When converting a java char or String to bytes, you should *always*
specify the encoding, which can be "UTF-8", "ISO-8859-1", "ASCII" etc.
*Never*  use the default encoding - this is system dependent. Use
String.getBytes("ASCII"), do not use String.getBytes() !
Ivan Voras - 26 Aug 2006 19:05 GMT
Mike Schilling wrote:

> You'd need to define a standard way to express the boundary between the
> binary and text portions.

Um, "this byte position here" (i.e. ftell()) is good enough, no need to
overengineer it. :)

In case of complex encodings like UTF-8, I'd expect (and will probably
create for my case) its behaviour to be like this:

- Backed by a buffer (the usual way, probably byte[])
- readByte() reads from the buffer, handles buffering of new data, etc.
- readChar() reads as much bytes as it needs to reconstitute a
character, in case of UTF-8 it could be one or several - it doesn't
matter. If it encounters an invalid byte (by the expectations set by
used encoding), raise proper exception because it's an encoding error in
the stream.
- Introduce private or protected pushByte() and pushChar() that do the
reverse of readXXX, on the buffer. "Fixup" the fact that one character
can have more bytes by initially making the buffer 4+ bytes longer, but
don't use this extra space when filling the buffer in readByte(). Like
in C, make pushXXX work only for a single byte/character.
- Modify readLine() to use readChar(), reads characters until CR+LF; can
use existing logic that reads one char after CR to see if it's LF and
push it back if it isn't.
- Every other readXXX method uses readByte() as usual.

The intended result: freely mix bytes and characters. In the extreme
(but supported!) case, the stream can have a UTF-8 character (encoded by
one or several bytes) followed by a "raw" byte, followed by a UTF-8
character, etc. The programmer is responsible to know how the stream is
formatted.
Soren Kuula - 26 Aug 2006 22:23 GMT
> Mike Schilling wrote:

> In case of complex encodings like UTF-8, I'd expect (and will probably
> create for my case) its behaviour to be like this:
[quoted text clipped - 6 lines]
> used encoding), raise proper exception because it's an encoding error in
> the stream.

I think that java.nio.CharsetEncoder and CharsetDecoder do just that.

BTW, I agree with Frank that you should take charcter encoding
seriously!! Do not assume anything, and do not use defaults. Otherwise,
you will end up with something that never really works -- in other
places than yours, on other computers than yours.

Søren
Mike Schilling - 27 Aug 2006 07:51 GMT
> In case of complex encodings like UTF-8, I'd expect (and will probably
> create for my case) its behaviour to be like this:
>
> - Backed by a buffer (the usual way, probably byte[])

In fact, I think you can build it on top of an InputStream, which is more
flexible and more general, since all you need is a source of bytes.

> - readByte() reads from the buffer, handles buffering of new data, etc.

Let the underlying stream handle buffering.

> - readChar() reads as much bytes as it needs to reconstitute a
> character, in case of UTF-8 it could be one or several - it doesn't
> matter. If it encounters an invalid byte (by the expectations set by
> used encoding), raise proper exception because it's an encoding error in
> the stream.

I don't know how to build this in general.  It's mostly straightforward to
build for a specific encoding, say UTF-8, but CharsetDecoder has no method
that means "decode exactly one character".  (I suppose you could give it one
byte, then two, then three, etc. until it stoips returning a failure status,
but that seems inelegant.)  Even in  UTF-8, you get oddities where a
codepoint > FFFF returns two characters; returning the first consumes 4
bytes, and returning the second consumes 0 bytes.  In other words, you'd
have to be careful with logic like "I know that this set of characters
occupies bytes 3-10, and I've processed all of them, so I'll switch to
reading bytes again."

> - Introduce private or protected pushByte() and pushChar() that do the
> reverse of readXXX, on the buffer. "Fixup" the fact that one character
[quoted text clipped - 4 lines]
> use existing logic that reads one char after CR to see if it's LF and
> push it back if it isn't.

More precisely, reads until CR, LF, or CRLF.  You're right that pushing back
a non-LF after CR is easy enough.

> - Every other readXXX method uses readByte() as usual.
>
[quoted text clipped - 3 lines]
> character, etc. The programmer is responsible to know how the stream is
> formatted.
Ivan Voras - 27 Aug 2006 10:33 GMT
> ivoras wrote:
>> - readChar() reads as much bytes as it needs to reconstitute a
[quoted text clipped - 6 lines]
> build for a specific encoding, say UTF-8, but CharsetDecoder has no method
> that means "decode exactly one character".  (I suppose you could give it one

Hmm, ok. This is a slight problem (and IMO a candidate for rectifying),
but for my current purpose, I can limit it to UTF-8 and use the sort-of
implementation in DataInputStream. I can always point people to file a
problem report with Java if they need 4-byte characters :)  (just kidding)
Chris Uppal - 27 Aug 2006 11:19 GMT
> > I don't know how to build this in general.  It's mostly straightforward
> > to build for a specific encoding, say UTF-8, but CharsetDecoder has no
[quoted text clipped - 4 lines]
> but for my current purpose, I can limit it to UTF-8 and use the sort-of
> implementation in DataInputStream.

Based on a previous attempt to use CharsetDecoder "raw", I suggest that you at
least consider not using one at all, but doing your own UTF-8 {en/de}code logic
instead.

   -- chris
Mike Schilling - 27 Aug 2006 16:21 GMT
>> > I don't know how to build this in general.  It's mostly straightforward
>> > to build for a specific encoding, say UTF-8, but CharsetDecoder has no
[quoted text clipped - 10 lines]
> logic
> instead.

Here's a place where I wish Java had output parameters; the signature I'd
want for readChar is

   /** @returns a character in the range 0 to 65535, or -1 at EOF
   *    @param moreDecoded returns true if another character is available
without consuming more bytes
   */
   int readChar(out boolean moreDecoded);

As it is, I suppose a moreDecoded() method is the least of evils, i.e.
better than forcing the client to check that the returned character is in
the range D800-DBFF.
Stefan Ram - 27 Aug 2006 17:18 GMT
>    /** @returns a character in the range 0 to 65535, or -1 at EOF
>    *    @param moreDecoded returns true if another character is available
[quoted text clipped - 4 lines]
>better than forcing the client to check that the returned character is in
>the range D800-DBFF.

 Possibly, the Java SE way would be to implement the following interface.

http://download.java.net/jdk7/docs/api/java/util/Iterator.html
Mike Schilling - 28 Aug 2006 02:24 GMT
>>    /** @returns a character in the range 0 to 65535, or -1 at EOF
>>    *    @param moreDecoded returns true if another character is available
[quoted text clipped - 6 lines]
>
>  Possibly, the Java SE way would be to implement the following interface.

I don't think so. An Iterator knows when there's nothing more to return.
Part of the assumption here is that the client knows (based on the protocol
definition) how many characters to ask for before switching back to binary.
There's nothing to tell the Iterator that.

Unless you mean that each call to  readChar() returns an
Iterator<Character>.  But that seems awfully cumbersome, both for the
implementation, which has to wrap each decoded char into a Character and
then wrap *that* in an Iterator, and for the client which has to unwrap each
of those..
Stefan Ram - 28 Aug 2006 03:05 GMT
>>>/** @returns a character in the range 0 to 65535, or -1 at EOF
>>>*    @param moreDecoded returns true if another character is available
[quoted text clipped - 6 lines]
>definition) how many characters to ask for before switching back to binary.
>There's nothing to tell the Iterator that.

 I have not read the whole thread, but was just responding
 to what I have quoted. If one wants something like

int readChar( out boolean moreDecoded )

 , then this can be done with an iterator.

 The »out« means that »readChar« tells its client whether there
 are more characters by this out-parameter. When you now say
 that the client already knows how many charaters will be
 coming, you might be talking about something else beyond the
 scope of my answer. I was just refering to

int readChar( out boolean moreDecoded )

 in isolation.

>Unless you mean that each call to  readChar() returns an
>Iterator<Character>.  But that seems awfully cumbersome, both
>for the implementation, which has to wrap each decoded char
>into a Character and then wrap *that* in an Iterator, and for
>the client which has to unwrap each of those..

 For a sequence of multiple iterations using an iterator, there
 is no need to create a new iterator object for each iteration.
 The same iterator object might be reused using "set" instead
 of "wrap". This, however, is not possible with
 java.lang.Character, because it is immutable.

 An iterator object also might implement other methods than
 those of the interface "Iterator" to read information from
 its client.
Ivan Voras - 27 Aug 2006 16:48 GMT
> Based on a previous attempt to use CharsetDecoder "raw", I suggest that you at
> least consider not using one at all, but doing your own UTF-8 {en/de}code logic
> instead.

Agreed. The "sort-of" implementation in DataInputStream is good enough
for me to adapt it.
Chris Uppal - 26 Aug 2006 10:43 GMT
> > The simplest thing would be to read each message into a byte array and
> > convert the bytes in the text portion appropriately (using, for
> > instance, an InputStream Reader on top of a ByteArrayInputStream.)
>
> Rolling my own is not a problem, it just seems like it should belong in
> the basic library.

It would require a fairly major redesign of the standard library -- there is no
way for the current application of the Decorator pattern to express the idea
that a decorator is responsible for pushing buffered-but-unused data back
onto the underlying stream.

A pity, really.  If the design were fixed (and they might as well get buffering
and random access fixed too, while they were at it), then many things would
become much easier.

   -- chris
Dale King - 01 Sep 2006 03:48 GMT
> In implementing a network protocol, there's a text (ASCII) phase and a
> binary phase. The ideal thing to use would be BufferedReader, but it
> doesn't allow reading raw bytes. The next best thing (though slower)
> would be DataInputStream, but its readLine() method is deprecated for
> silly reasons (IMO). Any other suggestions?

In general there is no good way to do this using an InputStreamReader
wrapping the raw InputStream. The only way would require your protocol
to contain information that tells you how many of the following bytes
are part of the ASCII phase. You then read out those bytes, wrap it in a
ByteArrayInputStream and InputStreamReader. You cannot just read from
the InputStreamReader and then go back to InputStream. The problem is
that the character decoder can buffer up a few bytes and read ahead into
the InputStream.

You have to look beyond the original I/O classes and look into the new
I/O (NIO) classes that were introduced in JDK1.4. They will be able to
handle this better because the buffering can be more explicit. You can
use a ByteBuffer from which a CharsetDecoder extracts bytes. But it will
not have the problem of reading too far, becuase it can look into the
ByteBuffer without gobbling up the bytes.

See the NIO documentation.

Signature

 Dale King



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.