I have some code that is importing data into another system with
various asian characters. If I hard code in the program for a
variable say "\u4f55\u304c\u3042\u308b\u304b\u3002" and compile it,
the characters show on the screen of the destination application as
japanese characters exactly as I would expect them to be displayed.
If I have those same characters in an external file and process them
it shows visually in the system exactly as it is in the text file with
the slash u... instead of showing the asian characters.
Anyone know why this would be? Why would hard coding a string in the
app be any different than reading in the same string from a text
file? How do I resolve this?
Reading it in like:
in = new BufferedReader(new FileReader("c:\test.txt");
String[] header = readColumns(in);
** Some logic goes on here as part of reading the first block
of text in the file **
Then I start reading in the rest
String[] record = readRecord(in, header);
Which calls readRecord:
String[] record = new String[columns.length];
for (int i=0;i<columns.length;i++)
record[i] = in.readLine().trim();
Thanks for any input.
JR
Joshua Cranmer - 05 Jul 2007 21:39 GMT
> I have some code that is importing data into another system with various
> asian characters. If I hard code in the program for a variable say
[quoted text clipped - 8 lines]
> app be any different than reading in the same string from a text file?
> How do I resolve this?
Because Java preprocesses its tokenizer by resolving all "\unnnn"
characters into their proper Unicode representation but text files (so
that Unicode can be used in ASCII-limited settings). So "\u4f55\u304c
\u3042\u308b\u304b\u3002" is processed by the compiler as if it where /
the actual characters/ U+4f55 U+304c U+3042... Your text file merely
contains the characters "\","u","4","f",etc.
In any case, your text file would look like this through a hex dump:
5c 75 34 66 35 35 5c 75 33 30 34 63 ...
javac sees the same string but internally converts into:
4f 55 30 4c ...
I refer you to JLS 3, §3.3 Unicode Escapes
Jeff Higgins - 06 Jul 2007 01:38 GMT
>I have some code that is importing data into another system with
> various asian characters. If I hard code in the program for a
[quoted text clipped - 27 lines]
>
> Thanks for any input.
:-) Paul Harvey Moment
Joshua Cranmer gave you the why for.
After you do your record[i] = in.readLine().trim();
You could scan record[i] for occurences of "u\hhhh"
Pattern.compile("[uU]\\\\\\p{XDigit}\\p{XDigit}\\p{XDigit}\\p{XDigit}");
when you find one you could trim the "u\" part and convert the remaining
four hex digits to a UTF8 String using something like:
String str = new String(fromHexString("4f55"), "UTF8");
where fromHexString() is from Roedy Green's web site:
<http://mindprod.com/jgloss/hex.html>
Jeff Higgins - 06 Jul 2007 03:43 GMT
> After you do your record[i] = in.readLine().trim();
> You could scan record[i] for occurences of "u\hhhh"
[quoted text clipped - 4 lines]
> where fromHexString() is from Roedy Green's web site:
> <http://mindprod.com/jgloss/hex.html>
Oops! should be:
Pattern.compile("\\\\[uU]\\p{XDigit}\\p{XDigit}\\p{XDigit}\\p{XDigit}");
Jeff Higgins - 06 Jul 2007 04:56 GMT
>>I have some code that is importing data into another system with
>> various asian characters. If I hard code in the program for a
[quoted text clipped - 27 lines]
>>
>> Thanks for any input.
Please disregard: this won't work!
> After you do your record[i] = in.readLine().trim();
> You could scan record[i] for occurences of "u\hhhh"
[quoted text clipped - 4 lines]
> where fromHexString() is from Roedy Green's web site:
> <http://mindprod.com/jgloss/hex.html>
JR - 06 Jul 2007 11:50 GMT
> >>I have some code that is importing data into another system with
> >> various asian characters. If I hard code in the program for a
[quoted text clipped - 42 lines]
>
> - Show quoted text -
Thanks for the replies. There has to be a way to deal with this. Why
else would Sun include that native2ascii.exe program with their JDK to
convert the data to the \u values for use in text files if there
wasn't a way to properly deal with them. Very frustrating. I am also
working on an InputStreamReader method that supposably is able to read
them in using the original Japanese characters unchanged, however
since this is a byte string and my entire app is string/line based,
kind of a pain assuming it will even work.
JR
JR - 07 Jul 2007 00:51 GMT
> > >>I have some code that is importing data into another system with
> > >> various asian characters. If I hard code in the program for a
[quoted text clipped - 55 lines]
>
> - Show quoted text -
InputStreamReader was the ticket. I had to create multiple loops to
emulate the readline() method that was used, however Japanese,
Chinese, Hindi, you name it, it's dealing with it.
Jeff Higgins - 07 Jul 2007 03:08 GMT
>> Thanks for the replies. There has to be a way to deal with this. Why
> InputStreamReader was the ticket. I had to create multiple loops to
> emulate the readline() method that was used, however Japanese,
> Chinese, Hindi, you name it, it's dealing with it.
Good deal! :-) Glad you got it worked out.
Thanks for the inspiration. I've learned a few
new things over the last day or two.
JH
Greg R. Broderick - 07 Jul 2007 18:51 GMT
JR <jriker1@yahoo.com> wrote in news:1183667276.083719.83510
@w5g2000hsg.googlegroups.com:
> I have some code that is importing data into another system with
> various asian characters. If I hard code in the program for a
[quoted text clipped - 8 lines]
> app be any different than reading in the same string from a text
> file? How do I resolve this?
This is because it is the java compiler that processes these unicode escapes,
not the java runtime. Your source code passes through the Java compiler, but
your text file does not.
c.f. puzzle #2 in <http://www.javapuzzlers.com/java-puzzlers-sampler.pdf>.
Writing a parser that converts the unicode escapes in a text file into their
corresponding unicode characters should be a relatively simple exercise.
> Reading it in like:
>
> in = new BufferedReader(new FileReader("c:\test.txt");
Don't use FileReader if you care about international character support! Read
the FileReader class javadoc for an explananation of why and a suggested
alternate implementation.
Cheers!
GRB

Signature
---------------------------------------------------------------------
Greg R. Broderick usenet200707@blackholio.dyndns.org
A. Top posters.
Q. What is the most annoying thing on Usenet?
---------------------------------------------------------------------