Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / July 2007

Tip: Looking for answers? Try searching our database.

processing of codepoints

Thread view: 
JR - 05 Jul 2007 21:27 GMT
I have some code that is importing data into another system with
various asian characters.  If I hard code in the program for a
variable say "\u4f55\u304c\u3042\u308b\u304b\u3002" and compile it,
the characters show on the screen of the destination application as
japanese characters exactly as I would expect them to be displayed.
If I have those same characters in an external file and process them
it shows visually in the system exactly as it is in the text file with
the slash u... instead of showing the asian characters.

Anyone know why this would be?  Why would hard coding a string in the
app be any different than reading in the same string from a text
file?  How do I resolve this?

Reading it in like:

            in = new BufferedReader(new FileReader("c:\test.txt");
            String[] header = readColumns(in);
      ** Some logic goes on here as part of reading the first block
of text in the file **

Then I start reading in the rest

String[] record = readRecord(in, header);

Which calls readRecord:

        String[] record = new String[columns.length];
        for (int i=0;i<columns.length;i++)
        record[i] = in.readLine().trim();

Thanks for any input.

JR
Joshua Cranmer - 05 Jul 2007 21:39 GMT
> I have some code that is importing data into another system with various
> asian characters.  If I hard code in the program for a variable say
[quoted text clipped - 8 lines]
> app be any different than reading in the same string from a text file?
> How do I resolve this?

Because Java preprocesses its tokenizer by resolving all "\unnnn"
characters into their proper Unicode representation but text files (so
that Unicode can be used in ASCII-limited settings). So "\u4f55\u304c
\u3042\u308b\u304b\u3002" is processed by the compiler as if it where /
the actual characters/ U+4f55 U+304c U+3042... Your text file merely
contains the characters "\","u","4","f",etc.

In any case, your text file would look like this through a hex dump:

5c 75 34 66 35 35 5c 75  33 30 34 63 ...

javac sees the same string but internally converts into:

4f 55 30 4c ...

I refer you to JLS 3, §3.3 Unicode Escapes
Jeff Higgins - 06 Jul 2007 01:38 GMT
>I have some code that is importing data into another system with
> various asian characters.  If I hard code in the program for a
[quoted text clipped - 27 lines]
>
> Thanks for any input.

:-) Paul Harvey Moment

Joshua Cranmer gave you the why for.
After you do your record[i] = in.readLine().trim();
You could scan record[i] for occurences of "u\hhhh"
Pattern.compile("[uU]\\\\\\p{XDigit}\\p{XDigit}\\p{XDigit}\\p{XDigit}");
when you find one you could trim the "u\" part and convert the remaining
four hex digits to a UTF8 String using something like:
String str = new String(fromHexString("4f55"), "UTF8");
where fromHexString() is from Roedy Green's web site:
<http://mindprod.com/jgloss/hex.html>
Jeff Higgins - 06 Jul 2007 03:43 GMT
> After you do your record[i] = in.readLine().trim();
> You could scan record[i] for occurences of "u\hhhh"
[quoted text clipped - 4 lines]
> where fromHexString() is from Roedy Green's web site:
> <http://mindprod.com/jgloss/hex.html>

Oops! should be:
Pattern.compile("\\\\[uU]\\p{XDigit}\\p{XDigit}\\p{XDigit}\\p{XDigit}");
Jeff Higgins - 06 Jul 2007 04:56 GMT
>>I have some code that is importing data into another system with
>> various asian characters.  If I hard code in the program for a
[quoted text clipped - 27 lines]
>>
>> Thanks for any input.

Please disregard: this won't work!

> After you do your record[i] = in.readLine().trim();
> You could scan record[i] for occurences of "u\hhhh"
[quoted text clipped - 4 lines]
> where fromHexString() is from Roedy Green's web site:
> <http://mindprod.com/jgloss/hex.html>
JR - 06 Jul 2007 11:50 GMT
> >>I have some code that is importing data into another system with
> >> various asian characters.  If I hard code in the program for a
[quoted text clipped - 42 lines]
>
> - Show quoted text -

Thanks for the replies.  There has to be a way to deal with this.  Why
else would Sun include that native2ascii.exe program with their JDK to
convert the data to the \u values for use in text files if there
wasn't a way to properly deal with them.  Very frustrating.  I am also
working on an InputStreamReader method that supposably is able to read
them in using the original Japanese characters unchanged, however
since this is a byte string and my entire app is string/line based,
kind of a pain assuming it will even work.

JR
JR - 07 Jul 2007 00:51 GMT
> > >>I have some code that is importing data into another system with
> > >> various asian characters.  If I hard code in the program for a
[quoted text clipped - 55 lines]
>
> - Show quoted text -

InputStreamReader was the ticket.  I had to create multiple loops to
emulate the readline() method that was used, however Japanese,
Chinese, Hindi, you name it, it's dealing with it.
Jeff Higgins - 07 Jul 2007 03:08 GMT
>> Thanks for the replies.  There has to be a way to deal with this.  Why

> InputStreamReader was the ticket.  I had to create multiple loops to
> emulate the readline() method that was used, however Japanese,
> Chinese, Hindi, you name it, it's dealing with it.

Good deal! :-) Glad you got it worked out.
Thanks for the inspiration. I've learned a few
new things over the last day or two.
JH
Greg R. Broderick - 07 Jul 2007 18:51 GMT
JR <jriker1@yahoo.com> wrote in news:1183667276.083719.83510
@w5g2000hsg.googlegroups.com:

> I have some code that is importing data into another system with
> various asian characters.  If I hard code in the program for a
[quoted text clipped - 8 lines]
> app be any different than reading in the same string from a text
> file?  How do I resolve this?

This is because it is the java compiler that processes these unicode escapes,
not the java runtime.  Your source code passes through the Java compiler, but
your text file does not.

c.f. puzzle #2 in <http://www.javapuzzlers.com/java-puzzlers-sampler.pdf>.

Writing a parser that converts the unicode escapes in a text file into their
corresponding unicode characters should be a relatively simple exercise.

> Reading it in like:
>
>              in = new BufferedReader(new FileReader("c:\test.txt");

Don't use FileReader if you care about international character support!  Read
the FileReader class javadoc for an explananation of why and a suggested
alternate implementation.

Cheers!
GRB

Signature

---------------------------------------------------------------------
Greg R. Broderick                  usenet200707@blackholio.dyndns.org

A. Top posters.
Q. What is the most annoying thing on Usenet?
---------------------------------------------------------------------



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.