Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / May 2007

Tip: Looking for answers? Try searching our database.

Query:different coding systems

Thread view: 
Jack Dowson - 08 May 2007 10:46 GMT
Hello Everybody:
As we all know,FileReader and FileWriter are both character stream
classes.When I use FileReader to read a text file which combines letters
and Chinese Characters coding in ANSI's ascii.I know that each letter
holds one byte disk space to store while every Chinese Characters
occupies two.When that file has been read,it prints on the monitor
screen totally corresponds with it's content!
Now,here is my question:How does JVM identify one byte letter and two
byte Chinese Character?
Here is my program demo:
import java.io.*;
class FileReaderDemo{
 public static void main(String[] args) throws Exception{
   FileReader fr = new FileReader("text.txt");
   int ch =0;
   int words = 0;
   while((ch =fr.read())!= -1){
    System.out.print((char)ch);
    words++;
    }
   fr.close();
   System.out.println("\nThere are totally " + words + " characters in
this file!");
    }

And the text.txt is:
This is a test file!
这是一个测试文件!

The outcome is:
This is a test file!
这是一个测试文件!
There are totally 31 characters in this file!

Thanks!
Dowson.
Thomas Fritsch - 08 May 2007 12:44 GMT
> Hello Everybody:
> As we all know,FileReader and FileWriter are both character stream
> classes.
Yes!

> When I use FileReader to read a text file which combines letters
> and Chinese Characters coding in ANSI's ascii.
No, you don't. Chinese simply cannot be coded in ASCII. May be your text
file is encoded in UTF-8 (see below).

> I know that each letter
> holds one byte disk space to store while every Chinese Characters
> occupies two.When that file has been read,it prints on the monitor
> screen totally corresponds with it's content!
There is already a misconception on your side:
(1) Correct is that ASCII requires one byte per character, because
   ASCII can only encode the characters from 0x0000 to 0x007F, (into
   bytes 0x00 .. 0x7F), nothing more.
(2) ASCII simply cannot encode the Chinese chars (0x4E00 .. 0xA000).
The key is to understand that there is a difference between *byte*
streams (InputStream, OutputStream) and *char* streams (Reader, Writer).
A byte is in range 0x00..0xFF, a char is in range 0x0000..0xFFFF.
Files are always sequences of bytes, but in your Java code you want to
deal with chars. Therefore Java has to do a translation between byte
streams and char streams, which is called "encoding" or "decoding".

Unfortunately there are many different encoding algorithms. "ASCII" is
just of them, others are "ISO-8859-1", "UTF-16", "UTF-8" and many more.
Some encodings ("UTF-8", "UTF-16") are able to encode all possible 65536
chars into bytes. Some others can encode only a subset of chars into
bytes (ASCII: only chars from 0x0000 to 0x007F, ISO-8859-1: only chars
from 0x0000 to 0x00FF). "UTF-16" always encodes 1 char into 2 bytes.
"UTF-8" encodes 1 char into 1, 2 or 3 bytes (depending on the char).

You find more info and more links at
<http://mindprod.com/jgloss/encoding.html>

> Now,here is my question:How does JVM identify one byte letter and two
> byte Chinese Character?
*You* tell it which encoding algorithm will be used. For example you can
write:
 FileReader fr = new FileReader("text.txt", "UTF-8");
When you write:
 FileReader fr = new FileReader("text.txt");
that actually means
 FileReader fr = new FileReader("text.txt",
                     System.getProperty("file.encoding"));
If you choose the wrong encoding (for example: if you choose "UTF-16",
but your input file is actually encoded with "UTF-8"), then your program
simply will do wrong.
> Here is my program demo:
> import java.io.*;
[quoted text clipped - 20 lines]
> 这是一个测试文件!
> There are totally 31 characters in this file!
No, files always contain *bytes*, not *chars*.
Chars only occur within your Java program.

Signature

Thomas

Thomas Fritsch - 08 May 2007 15:50 GMT
> [...]
>> Now,here is my question:How does JVM identify one byte letter and two
[quoted text clipped - 7 lines]
>   FileReader fr = new FileReader("text.txt",
>                       System.getProperty("file.encoding"));
Sorry, the above was wrong.

There is no constructor FileReader(String fileName, String encoding).
Hence there is no way to explicitly specify an encoding with FileReader.
When you write:
 new FileReader("text.txt");
that essentially means
 new InputStreamReader(new FileInputStream("test.txt"))
which in turn means
 new InputStreamReader(new FileInputStream("test.txt"),
                       System.getProperty("file.encoding"))

Therefore I would strongly recommend *not* to use FileReader at all.
Instead use for example:
 new InputStreamReader(new FileInputStream("test.txt"),
                       "UTF-8")
so that the encoding you get is really the encoding you want.

Signature

Thomas

Greg R. Broderick - 08 May 2007 19:57 GMT
> Hello Everybody:
> As we all know,FileReader and FileWriter are both character stream
> classes.When I use FileReader to read a text file which combines letters
> and Chinese Characters coding in ANSI's ascii.

Chinese characters can not be coded in ASCII.

Some links to get you started in the wonderful world of international
character sets:

http://czyborra.com/
http://www.i18nguy.com/unicode/codepages.html
http://www.unicode.org/
http://www.faqs.org/rfcs/rfc2044.html

Cheers
GRB

Signature

---------------------------------------------------------------------
Greg R. Broderick                  usenet200705@blackholio.dyndns.org

A. Top posters.
Q. What is the most annoying thing on Usenet?
---------------------------------------------------------------------



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.