Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / July 2007

Tip: Looking for answers? Try searching our database.

How to read unicode

Thread view: 
JR - 02 Jul 2007 23:23 GMT
I have a java program that parses text files of metadata and does
various activities on it.  I recently was asked to start working with
Japanese Unicode characters but not sure where to begin if I need ot
do anything specific for this.  This program runs in a DOS window on a
Western character set PC.  Some questions that come to mind that I was
hoping to get input on:

1. Would it just work as is if I was running in a DOS window on a
Japanese version of Windows XP?
2. If in US, do I have to convert the characters from their graphical
representation to their Unicode numeric equivalent?
3. If so is there some way to parse the source data and convert it
from like MS Mincho to Unicode?
4.Can I save this data if converted as a standard text file?

Thanks.

JR
stefanomnn - 03 Jul 2007 08:33 GMT
HI!
for reading text file, i think what you need is knowing right
encoding.
eg. suppose it is UTF-16:

[code]
FileInputStream fileStream = new FileInputStream("yourFile");
BufferedReader reader = new BufferedReader(new
InputStreamReader(fileStream , "UTF-16"));
String line = reader.readLine();
[/code]

now you have correct rappresentation of your String.
i hope i helped you.
Roedy Green - 03 Jul 2007 14:13 GMT
>I have a java program that parses text files of metadata and does
>various activities on it.  

If you display characters in a GUI, you just use Unicode, and it the
GUI's problem to display them.  The only tricky part is selecting
fonts which support the Unicode characters you are using.
See http://mindprod.com/applets/fontshower.html

If you display characters on the console, it typically uses an 8-bit
encoding of some kind.  See http://mindprod.com/applets/fileio.html
for how to convert to various 8-bit encodings.

The default encoding should be suitable.

Lie to Windows and tell it you live in Japan to find out what that
default encoding is.

See http://mindprod.com/jgloss/encoding.html
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Chris Smith - 04 Jul 2007 20:52 GMT
> I have a java program that parses text files of metadata and does
> various activities on it.  I recently was asked to start working with
[quoted text clipped - 5 lines]
> 1. Would it just work as is if I was running in a DOS window on a
> Japanese version of Windows XP?

There are two ways to approach I/O.  One is to use the system default
character encoding.  The other is to specify a character encoding.  If
you've used the system default character encoding, then it would
probably work on a Japanese system with Japanese characters.  If you've
specified an encoding, then it probably won't.

You should always prefer specifying an encoding when possible.  However,
the encoding you use has to match the encoding of the "metadata text
files" you are reading.  If you can't control those, then your choice is
made for you.  You need to find out from whomever writes these files
what encoding they use.

> 2. If in US, do I have to convert the characters from their graphical
> representation to their Unicode numeric equivalent?

You can't draw characters to the console that aren't in the character
set for that console.  So you'll either need to convert your code to a
GUI, or give up on drawing Japanese characters on a non-Japanese
terminal.

> 3. If so is there some way to parse the source data and convert it
> from like MS Mincho to Unicode?

I don't know what MS Mincho is.  Sorry.

> 4.Can I save this data if converted as a standard text file?

Sure you can save it.  Again, you can save it either in a specific
encoding, or with the platform default.  If the text contains characters
that can't be encoded with that encoding, they will appear as '?'
characters.

Signature

Chris Smith

Oliver Wong - 04 Jul 2007 23:51 GMT
>> 2. If in US, do I have to convert the characters from their graphical
>> representation to their Unicode numeric equivalent?
[quoted text clipped - 8 lines]
>
> I don't know what MS Mincho is.  Sorry.

   It's the name of a font which contains glyph for Japanese characters
(and perhaps CJK characters in general) made by Microsoft. It comes with
Windows and usually when you're using a font that otherwise doesn't
support CJK characters (e.g. Times or Arial), Windows will silently
substitute the Mincho font instead, so it's one of the most common fonts
used for displaying CJK characters (at least in the Windows world).

   The poster also made this post which implies that (s)he is pretty
confused about how Unicode, font, and related topics works:
http://groups.google.ca/group/comp.lang.java.programmer/browse_thread/thread/853
bd25f432f9df5/8804136f5c810c41


<quote>
I have some text files with western characters in english, and
japanese fonts in them.
</quote>

   I saw that post before seeing this one, so I thought it was just
sloppy wording or mixed up terminology, but now it really sounds like the
OP is conflating fonts and text at the conceptual level.

   - Oliver
Roedy Green - 05 Jul 2007 00:24 GMT
>I recently was asked to start working with
>Japanese Unicode characters but not sure where to begin if I need ot
>do anything specific for this.

the first thing is to find out how this  file is encoded.

Possibilities include:

Cp930        Japanese Katakana-Kanji mixed with 4370 UDC, superset
of 5026
Cp939        Japanese Latin Kanji mixed with 4370 UDC, superset of
5035
Cp942        Japanese (OS/2) superset of 932
Cp942C        variant of Cp942. Japanese (OS/2) superset of Cp932
Cp943        Japanese (OS/2) superset of Cp932 and Shift-JIS.
Cp943C        Variant of Cp943. Japanese (OS/2) superset of Cp932
and Shift-JIS.
Cp33722        IBM-eucJP - Japanese (superset of 5050)

JIS        Japanese
JIS0201        JIS 0201, Japanese
JIS0212        JIS 0212, Japanese
JISAutoDetect        Detects and converts from Shift-JIS, EUC-JP,
ISO- 2022 JP (conversion to Unicode only)
JIS_X0201        Japanese
JIS_X0212-1990f        Japanese

Shift_JIS        Shift JIS. Japanese. A Microsoft code that
extends csHalfWidthKatakana to include kanji by adding a second byte
when the value of the first byte is in the ranges 81-9F or E0-EF.

See http://mindprod.com/jgloss/encoding.html

I am working on a little utility called EncodingRecogniser which
should help you. All it does is display any given file presuming any
of Java's supported encodings, telling you about BOMs.

I hope to post it some time tonight.
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Roedy Green - 05 Jul 2007 08:04 GMT
On Wed, 04 Jul 2007 23:24:28 GMT, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>I am working on a little utility called EncodingRecogniser which
>should help you. All it does is display any given file presuming any
>of Java's supported encodings, telling you about BOMs.

The utility is now posted with Java source.  You can use it online at
http://mindprod.com/applets/encodingrecogniser.html
or downoad it at
http://mindprod.com/products1.html#ENCODINGRECOGNISER

I added some whistles -- hex bytes and hex chars, and notification
where BOMs are detected.

--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
Greg R. Broderick - 05 Jul 2007 17:41 GMT
JR <jriker1@yahoo.com> wrote in news:1183415031.233708.186300
@q69g2000hsb.googlegroups.com:

> I have a java program that parses text files of metadata and does
> various activities on it.  I recently was asked to start working with
[quoted text clipped - 10 lines]
> from like MS Mincho to Unicode?
> 4.Can I save this data if converted as a standard text file?

First, I would recommend that you spend some time learning the difference
between character sets (e.g. unicode), encodings (e.g. UTF-8) and fonds (e.g.
MS Mincho).  Several web pages that I've found useful for this include:

http://czyborra.com/
http://www.i18nguy.com/unicode/codepages.html
http://www.unicode.org/
http://www.faqs.org/rfcs/rfc2044.html
http://www.faqs.org/rfcs/rfc2781.html

Cheers!
GRB

Signature

---------------------------------------------------------------------
Greg R. Broderick                  usenet200705@blackholio.dyndns.org

A. Top posters.
Q. What is the most annoying thing on Usenet?
---------------------------------------------------------------------



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.