Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / December 2007

Tip: Looking for answers? Try searching our database.

.read() returns a char why?

Thread view: 
JM - 12 Dec 2007 14:45 GMT
Why do the Java Reader classes (File/Buffered/Stream) etc .read()
methods return an int not a char?

For example the javadoc for BufferedReader

... jdk1.6.0_03/docs/api/java/io/BufferedReader.html declares

"public int read()"

and then the javadoc indicates return value as:

"The character read, as an integer in the range 0 to 65535
(0x00-0xffff), or -1 if the end of the stream has been reached"

The only reason I have come up with is that the class wants to
indicate end-of-stream with a -1. Incidentally when did the character
(singular) become two bytes?

I am engineer and not a comp.sci so I'd appreciate some patience in
your reply.

Jonathan
Chris Dollin - 12 Dec 2007 14:53 GMT
> The only reason I have come up with is that the class wants to
> indicate end-of-stream with a -1. Incidentally when did the character
> (singular) become two bytes?

Java's chars have always been two bytes, so as to store 16-bit
Unicode characters.

(We'll pass quietly over the problems with Unicode now needing more than
16 bits for an unpacked character.)

Signature

Chris "whistling, but not in the dark" Dollin

Hewlett-Packard Limited registered office:                Cain Road, Bracknell,
registered no: 690597 England                                    Berks RG12 1HN

Mike Schilling - 12 Dec 2007 14:58 GMT
> Why do the Java Reader classes (File/Buffered/Stream) etc .read()
> methods return an int not a char?
[quoted text clipped - 12 lines]
> The only reason I have come up with is that the class wants to
> indicate end-of-stream with a -1.

That's exactly right.  If it returned a char, there would be no
"illegal" value left to indicate EOF.

> Incidentally when did the character
> (singular) become two bytes?

A char in Java is a 16-bit unicode (technically UTF-16) character, not
a byte.
JM - 15 Dec 2007 07:41 GMT
On Dec 12, 2:58 pm, "Mike Schilling" <mscottschill...@hotmail.com>
wrote:
> > Why do theJavaReader classes (File/Buffered/Stream) etc .read()
> > methods return an int not a char?
[quoted text clipped - 21 lines]
> A char inJavais a 16-bit unicode (technically UTF-16) character, not
> a byte.

Many thanks for everyone's replied. Now what does not make sense is
when I call BufferedWriter.write(int) only one 8 bit byte gets
written.

            BufferedWriter bw = new BufferedWriter(new FileWriter("a"));
            bw.write(1);
            bw.write(256);
            bw.close();
            System.exit(0);

Creates a file of length 2 (bytes) containing
            01
            3F
in file "a" and not 16 bits.

Makes no sense to me.

Jonathan
Lew - 15 Dec 2007 08:04 GMT
>             BufferedWriter bw = new BufferedWriter(new FileWriter("a"));

Don't use TAB characters in Usenet listings.  It makes them very hard to read.

>             bw.write(1);
>             bw.write(256);
[quoted text clipped - 7 lines]
>
> Makes no sense to me.

What is the default character encoding for your platform?

The Writer will translate the String into that encoding unless you specify a
different one.  Many encodings use only one byte per character, or one per the
each of the most common characters.  It seems that UTF-16 is not your default
encoding for files, eh?

Google for "character encoding" and "Unicode", and read the material about
these concepts on java.sun.com, then ask about what is left out in those
references.

Signature

Lew

Mike Schilling - 15 Dec 2007 18:09 GMT
> On Dec 12, 2:58 pm, "Mike Schilling" <mscottschill...@hotmail.com>
> wrote:
[quoted text clipped - 38 lines]
> 01
> 3F

note that 3F isn't 256; it's an ASCII question mark (?).  I'll explain
why below.

> in file "a" and not 16 bits.
>
> Makes no sense to me.

Internally, (that is, in memory), Java represents characters as
Unicode.  Externally (in files, on the wire, etc.), characters are
"encoded" into one or more  bytes, using some encoding.  The most
common ones are:

UTF-16: two bytes for each character.  Includes all of Unicode.
UTF-8: one byte for ASCII charatcers (0-127); two or three bytes for
other characters  Includes all of Unicode.
ASCII: one byte per character.  Includes only the first 127 Unicode
characters.
CP-1262: one byte per character, including all the ASCII characters
plus some MSoft-specific extension.  Includes 256 Unicode characters.
ISO-LATIN-1 one byte per character, including all the ASCII characters
plus some special characters usied in European languages. Includes 256
Unicode characters.

There are many others.  If you don't specify an encoding, as in your
example, Java chooses a default one which is system-dependent.
Encodings will, in general, replace characters they don't contain by a
question mark, which is what you're seeing.  (I don't know what your
system's default encoding is.  If you're on Windows, it's probably
CP-1262, but ASCII  would do the same thing, since neither of them
contains the character 256.).

This is a complicated subject, and I've omitted many issues (including
the fact that Unicode now requires 21 bits to represent all of its
characters, not 16).  I hope that this helped, but to really
understand it you'll need to find a more detailed writeup.  Here's a
start: http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings
Lew - 15 Dec 2007 18:32 GMT
> UTF-8: one byte for ASCII charatcers (0-127); two or three
or four
> bytes for other characters  Includes all of Unicode.

Signature

Lew

Mike Schilling - 15 Dec 2007 18:52 GMT
>> UTF-8: one byte for ASCII charatcers (0-127); two or three
> or four
>> bytes for other characters  Includes all of Unicode.

I was trying to keep things simple by pretending that Unicode is still
16 bits.  Time enough to introduce surrogate pairs later on.
Patricia Shanahan - 12 Dec 2007 15:01 GMT
> Why do the Java Reader classes (File/Buffered/Stream) etc .read()
> methods return an int not a char?
[quoted text clipped - 13 lines]
> indicate end-of-stream with a -1. Incidentally when did the character
> (singular) become two bytes?

Yes, read returns a wider type than char so that there is a spare value
to represent end-of-stream.

One of the continuing trends in computing has been increasing numbers of
bits to represent a character, from 6 to 7 to 8 to 16... Java char is 16
bits.

Patricia
John W. Kennedy - 13 Dec 2007 00:03 GMT
> One of the continuing trends in computing has been increasing numbers of
> bits to represent a character, from 6 to 7 to 8 to 16... Java char is 16
> bits.

Not if you go back far enough, though. The IBM 650 took 14 bits to
represent a character (double bi-quinary), and its market successor, the
707x series, took 10 (double 2-of-5).

Signature

John W. Kennedy
"The grand art mastered the thudding hammer of Thor
And the heart of our lord Taliessin determined the war."
  -- Charles Williams.  "Mount Badon"

Roedy Green - 13 Dec 2007 21:39 GMT
On Wed, 12 Dec 2007 19:03:54 -0500, "John W. Kennedy"
<jwkenne@attglobal.net> wrote, quoted or indirectly quoted someone who
said :

>Not if you go back far enough, though. The IBM 650 took 14 bits to
>represent a character (double bi-quinary), and its market successor, the
>707x series, took 10 (double 2-of-5).

In the olden days, each site would invent its own private 6-bit
encoding. I recall sitting with Vern Detwiler (later of MacDonald
Detwiler) looking at this new fangled 7-bit ASCII code and playing
with how we might make UBC's 6-bit code somewhat ASCII compatible for
the new IBM 7044.  We had to decide what characters to include.  Back
then popular characters included the word mark and record mark.

Later with the IBM 360 we had ENORMOUS 8-bit EBCDIC character sets
that came in a zillion variants. You still constrained yourself mainly
to upper case because printers used a rotating chain or band of
pre-formed characters, and extra chars slowed it down drastically.
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Lew - 12 Dec 2007 15:14 GMT
> Why do the Java Reader classes (File/Buffered/Stream) etc .read()
> methods return an int not a char?
[quoted text clipped - 12 lines]
> The only reason I have come up with is that the class wants to
> indicate end-of-stream with a -1.

It allows any value in the range of char to be represented as a positive
value.  -1 is therefore guaranteed to be distinct from any valid value.

If you return a char, you cannot get the value 32768 or larger.

> Incidentally when did the character (singular) become two bytes?

In Java's case, with the invention of Java.

Signature

Lew

Lew - 12 Dec 2007 15:15 GMT
> If you return a char, you cannot get the value 32768 or larger.

Oops, that's wrong.  If you return a *short* you cannot get such values.

Signature

Lew

Roedy Green - 12 Dec 2007 15:51 GMT
>Incidentally when did the character
>(singular) become two bytes?

with Java 1.0.  C++ is in transition from 8 to 16.

It is now much more common to have a document containing multiple
languages.  You can't encode it with only 8-bits per char.  So Java
from day one used Unicode, which has 16-bits per char.  Unicode-16 was
even big enough to include Chinese.  However, Unicode has since been
extended to 32-bits to allow Ugaritic (cuneiform), musical symbols,
Cypriot etc. Java has somewhat bailing wire support for 32-bit
Unicode.

See http://mindprod.com/jgloss/unicode.html

Of course this would make documents on average twice as big as they
used to be.  So UTF-8 was invented to make simple documents almost as
compact as if they have been encoded with an 8-bit national encoding.

see http://mindprod.com/jgloss/utf.html

Encoding is about how documents are encoded which is very complicated
and varied to deal with interchange with other computer languages and
legacy applications.   Internally they are all stored simply in
Unicode-16.  

See http://mindprod.com/jgloss/encoding.html
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.