On Dec 12, 2:58 pm, "Mike Schilling" <mscottschill...@hotmail.com>
wrote:
> > Why do theJavaReader classes (File/Buffered/Stream) etc .read()
> > methods return an int not a char?
[quoted text clipped - 21 lines]
> A char inJavais a 16-bit unicode (technically UTF-16) character, not
> a byte.
Many thanks for everyone's replied. Now what does not make sense is
when I call BufferedWriter.write(int) only one 8 bit byte gets
written.
BufferedWriter bw = new BufferedWriter(new FileWriter("a"));
bw.write(1);
bw.write(256);
bw.close();
System.exit(0);
Creates a file of length 2 (bytes) containing
01
3F
in file "a" and not 16 bits.
Makes no sense to me.
Jonathan
Lew - 15 Dec 2007 08:04 GMT
> BufferedWriter bw = new BufferedWriter(new FileWriter("a"));
Don't use TAB characters in Usenet listings. It makes them very hard to read.
> bw.write(1);
> bw.write(256);
[quoted text clipped - 7 lines]
>
> Makes no sense to me.
What is the default character encoding for your platform?
The Writer will translate the String into that encoding unless you specify a
different one. Many encodings use only one byte per character, or one per the
each of the most common characters. It seems that UTF-16 is not your default
encoding for files, eh?
Google for "character encoding" and "Unicode", and read the material about
these concepts on java.sun.com, then ask about what is left out in those
references.

Signature
Lew
Mike Schilling - 15 Dec 2007 18:09 GMT
> On Dec 12, 2:58 pm, "Mike Schilling" <mscottschill...@hotmail.com>
> wrote:
[quoted text clipped - 38 lines]
> 01
> 3F
note that 3F isn't 256; it's an ASCII question mark (?). I'll explain
why below.
> in file "a" and not 16 bits.
>
> Makes no sense to me.
Internally, (that is, in memory), Java represents characters as
Unicode. Externally (in files, on the wire, etc.), characters are
"encoded" into one or more bytes, using some encoding. The most
common ones are:
UTF-16: two bytes for each character. Includes all of Unicode.
UTF-8: one byte for ASCII charatcers (0-127); two or three bytes for
other characters Includes all of Unicode.
ASCII: one byte per character. Includes only the first 127 Unicode
characters.
CP-1262: one byte per character, including all the ASCII characters
plus some MSoft-specific extension. Includes 256 Unicode characters.
ISO-LATIN-1 one byte per character, including all the ASCII characters
plus some special characters usied in European languages. Includes 256
Unicode characters.
There are many others. If you don't specify an encoding, as in your
example, Java chooses a default one which is system-dependent.
Encodings will, in general, replace characters they don't contain by a
question mark, which is what you're seeing. (I don't know what your
system's default encoding is. If you're on Windows, it's probably
CP-1262, but ASCII would do the same thing, since neither of them
contains the character 256.).
This is a complicated subject, and I've omitted many issues (including
the fact that Unicode now requires 21 bits to represent all of its
characters, not 16). I hope that this helped, but to really
understand it you'll need to find a more detailed writeup. Here's a
start: http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings
Lew - 15 Dec 2007 18:32 GMT
> UTF-8: one byte for ASCII charatcers (0-127); two or three
or four
> bytes for other characters Includes all of Unicode.

Signature
Lew
Mike Schilling - 15 Dec 2007 18:52 GMT
>> UTF-8: one byte for ASCII charatcers (0-127); two or three
> or four
>> bytes for other characters Includes all of Unicode.
I was trying to keep things simple by pretending that Unicode is still
16 bits. Time enough to introduce surrogate pairs later on.