Java Forum / General / July 2006
strings - reading utf8 characters such as japanese. how?
stefoid - 03 Jul 2006 08:39 GMT Hi. Ive got a problem. I have some code that takes a text file and breaks it into an array of substrings for displaying the text truncated to fit the screen width on word boundaries.
It just looks for the spaces.
trouble is, it crashes out with japenese text. There is a part of the code that looks at the next character to see if it is a space:
ch = str.substring(offset, offset + 1); isSpace = false;
// return when a new line is reached if (ch.equals("\n")) return offset+1;
currentWidth += font.stringWidth(ch);
if (ch.equals(" ")) isSpace = true;
and if it isnt a space, it adds the width of the character (in pixels) , and keeps going until it does find a space.
the problem with this is it assumes that each byte is a characater. In utf8, up to 3 bytes could be one character, so this code is trying to find the widths of characters representing each byte in a utf8 sequence, rather than the width of the utf8 character as a whole.
my additional problem is this is iAppli code, so I am limited to a 30K codebase, and I have hit the limit, so I cant write any more lines of code - I just have to change the existing code such that it doesnt generate any more bytecode.
what can I do to the above code so that I can count widths of utf8 characters instead of asc characters, without writing too much extra code - I need existing java library functions to do it for me, but I dont know what that fucntionality is.
Damian Driscoll - 03 Jul 2006 09:31 GMT > Hi. Ive got a problem. I have some code that takes a text file and > breaks it into an array of substrings for displaying the text truncated [quoted text clipped - 34 lines] > code - I need existing java library functions to do it for me, but I > dont know what that fucntionality is. have a look at: http://javaalmanac.com/egs/java.nio.charset/ConvertChar.html
Chris Uppal - 03 Jul 2006 10:09 GMT > what can I do to the above code so that I can count widths of utf8 > characters instead of asc characters, without writing too much extra > code - I need existing java library functions to do it for me, but I > dont know what that fucntionality is. Why are you working in UTF8 using Java Strings ? Indeed /how/ are you doing it -- I would put it somewhere between impossible and dangerously difficult and confusing.
If you want to load your information into /text/, let Java decode the external UTF-8 into Strings (of characters, already decoded as they are read in). If, possibly for space reasons, you have to work in UTF-8 internally, then you'd be far better off keeping the data in byte[] arrays.
-- chris
stefoid - 04 Jul 2006 01:10 GMT good question. An iAppli is something like an applet, designed to go into a cutdown java virtual machine to fit inside mobile devices. The available java libraires are greatly restricted - I have lang.string and lang.character to choose from (that relate to this problem). In addition to the 30K codebase limit, which i have reached - seriously, I am like 2 bytes off the maximum.
this is the only part of the code where I have to recognize individual characters. everything else is just read a string and output it to the screen, which works fine for utf8, cos its null terminated.
> > what can I do to the above code so that I can count widths of utf8 > > characters instead of asc characters, without writing too much extra [quoted text clipped - 11 lines] > > -- chris Chris Uppal - 04 Jul 2006 10:48 GMT [reorderd to remove top-posting]
[me:]
> > Why are you working in UTF8 using Java Strings ? Indeed /how/ are you > > doing it -- I would put it somewhere between impossible and dangerously [quoted text clipped - 10 lines] > characters. everything else is just read a string and output it to the > screen, which works fine for utf8, cos its null terminated. But you haven't really answered my question. I'll try again:
Are you saying that your iAppli doesn't support byte[] arrays ? I find that impossible to believe.
Are you handling your UTF-8 data as binary (in byte[] arrays) or are you somehow stuffing UTF-8 encoded data into Java Strings ? If the latter then (a) why ? and (b) how ?
When you read your data in, why don't you use the Java-provided stuff to decode the UTF-8 into native (decoded) Java Strings ? I could understand that you might want to stick with UTF-8 encoded data for space reasons, but then it doesn't make sense that you'd put that data into Strings (16 bits per character), which would double the space requirement over byte[] arrays for the same data. (Unless you stuffed two bytes into each Java char -- which would be downright perverse ;-)
Maybe this implementation lacks the character encoding stuff found everwhere in real Java ? If not then why are you not using it ? If it does, then I suspect you are hosed.
-- chris
Oliver Wong - 04 Jul 2006 15:08 GMT > good question. An iAppli is something like an applet, designed to go > into a cutdown java virtual machine to fit inside mobile devices. The > available java libraires are greatly restricted - I have lang.string > and lang.character to choose from (that relate to this problem). Maybe you should have mentioned this when you wrote
<quote> I need existing java library functions to do it for me, but I dont know what that fucntionality is. </quote>
else you're wasting people's times coming up with solutions that won't solve your problem.
> In > addition to the 30K codebase limit, which i have reached - seriously, I [quoted text clipped - 3 lines] > characters. everything else is just read a string and output it to the > screen, which works fine for utf8, cos its null terminated. My concern right now is that you might not know what you're talking about. Where are you getting the string data from? What is the type of the parameter of that string data? Is it String? Byte[]? byte[]? Something else?
What makes you believe it is UTF-8 encoded? What makes you think it's null terminated?
I don't want to start explaining how to convert UTF-8 binary data stuffed into Java Strings into "normal" Java Strings, unless I'm sure that's what is nescessary to solve your problem.
- Oliver
stefoid - 05 Jul 2006 02:45 GMT yeah, youre right, sorry I didnt mention that. I think youre also right in that I dont have a firm grasp of java strings, internal coding, etc...
This is the code that is used to read the utf8 text resources into strings:
" dis = Connector.openDataInputStream(resourcePath); text = new byte[bytes]; dis.readFully(text, 0, bytes); dis.close(); return new String(text); "
I didnt write it, but I wrote the code that uses the strings, and since the strings passed to my stuff seemed to print OK, I was happy to ignore where they came from. Now that guy has gone, and the strings are in japanese and problems begin.
Actually I have re-written the code that truncates the strings and solved my original problem. Its very inefficient, but it uses less lines of code than the original and still works, so I save bytes of code which is a godsend.
However, I have noticed another problem - the start of every utf8 encoded string resource starts with an unwanted 'dot' character which does not appear in the original text files. (whether it has passed through my truncating code or not - it still happens) I have tracked this down to (I think) the fact that java uses a modified utf8 encoding scheme, and the text files I am inputting are generated with Word which will be writing them in normal utf8. I assume thats the problem, anyway. I have yet to work out how to fix it. I am looking for a convert program that will convert the the utf8 text files to modified utf8 format .. seems easiest and preserves precious bytes of code.
any help appreciated.
> > good question. An iAppli is something like an applet, designed to go > > into a cutdown java virtual machine to fit inside mobile devices. The [quoted text clipped - 31 lines] > > - Oliver stefoid - 05 Jul 2006 03:08 GMT I should add, here is what the cldc has available (cut down java for wireless devices and pdas)
java.io: Interfaces -------- DataInput DataOutput
Classes ------- ByteArrayInputStream ByteArrayOutputStream DataInputStream DataOutputStream InputStream InputStreamReader OutputStream OutputStreamWriter PrintStream Reader Writer
java.lang: Classes --------- Boolean Byte Character Class Double Float Integer Long Math Object Runtime Short String StringBuffer System Thread Throwable
and something called microedition connectors API:
Interfaces --------- Connection ContentConnection Datagram DatagramConnection InputConnection OutputConnection StreamConnection StreamConnectionNotifier Classes ---------- Connector
Oliver Wong - 05 Jul 2006 15:31 GMT "stefoid" <spambucket666au@yahoo.com.au> wrote in message news:1152063931.018410.223650@b68g2000cwa.googlegroups.com...
> This is the code that is used to read the utf8 text resources into > strings: [quoted text clipped - 9 lines] > ignore where they came from. Now that guy has gone, and the strings > are in japanese and problems begin. The problem is that you're using the default encoding instead of specifying the encoding to be UTF-8.
> Actually I have re-written the code that truncates the strings and > solved my original problem. Its very inefficient, but it uses less > lines of code than the original and still works, so I save bytes of > code which is a godsend. I don't know if it's relevant, but I haven't seen "the code that truncates the string".
> However, I have noticed another problem - the start of every utf8 > encoded string resource starts with an unwanted 'dot' character which [quoted text clipped - 6 lines] > convert program that will convert the the utf8 text files to modified > utf8 format .. seems easiest and preserves precious bytes of code. UTF-8 encoded files sometimes have byte-ordering mark (BOM) at the beginning. Incidentally, Java doesn't use UTF-8 internally; it uses (a modified) UTF-16. The two formats are significantly different. I think if you use a reader, and specify the encoding as UTF-8, it'll take care of handling the BOM for you.
> any help appreciated. > [quoted text clipped - 38 lines] >> >> - Oliver
>I should add, here is what the cldc has available (cut down java for > wireless devices and pdas) [most of it snipped]
> InputStreamReader Right, so after you get your DataInputStream, you should wrap it around an InputStreamReader. I don't know if the constructors on CLDC are the same as JavaSE, but in JavaSE, it'd look like this:
<code> InputStream is = /*get your input stream somehow. In your case, it looks like Connector.openDataInputStream(resourcePath)*/ InputStreamReader isr = new InputStreamReader(is, "UTF-8"); </code>
From there, you use the isr.read() method to read 1 character at a time (note that a character is a 16 bit value, and not an 8 bit value). If .read() returns -1, that means it reached the end of the stream.
Normally, in JavaSE, you'd also wrap your InputStreamReader into a BufferedReader. In addition to improving performance via buffering, BufferedReader also provides a convenience method readLine() which will return a whole line of text to you, instead of only 1 character at a time. Unfortunately, BufferedReader wasn't in the list of classes you provided, so you might have to construct the string manually from the individual characters.
- Oliver
Oliver Wong - 05 Jul 2006 15:36 GMT > I don't know if it's relevant, but I haven't seen "the code that > truncates the string". Cancel that. I just realized that you're referring to the code in your first post, where you play around with fonts and string widths.
- Oliver
Oliver Wong - 05 Jul 2006 16:13 GMT > "stefoid" <spambucket666au@yahoo.com.au> wrote in message > news:1152063931.018410.223650@b68g2000cwa.googlegroups.com... [quoted text clipped - 6 lines] >> dis.close(); >> return new String(text); " Actually, I took another look at the String API. Again, this is from J2SE, so I don't know if it'll work for you, but apparently you can specify the charset to use in the String constructor as well. So you might be able to replace the last line with:
return new String(text, "UTF-8");
- Oliver
stefoid - 06 Jul 2006 03:08 GMT Thanks Oliver.
I did find some example code somewhere that suggested using a reader and specifying "UTF-8". I tried that, and it didnt make any difference - I still get the weird character at the start of every string.
I think it makes sense that there could be something weird at the start of the text file. I may have to get a hex editor onto it. I printed out the hex bytes I obtained from the string in the code and it looks like UTF-8 to me (roughly).
> > "stefoid" <spambucket666au@yahoo.com.au> wrote in message > > news:1152063931.018410.223650@b68g2000cwa.googlegroups.com... [quoted text clipped - 15 lines] > > - Oliver Chris Uppal - 06 Jul 2006 11:48 GMT > I think it makes sense that there could be something weird at the start > of the text file. I may have to get a hex editor onto it. I printed > out the hex bytes I obtained from the string in the code and it looks > like UTF-8 to me (roughly). Can you post the byte values ?
It could be a BOM (Byte Order Mark) -- they are not recommended for use with 8-bit encodings like UTF-8, but some software adds one to the beginning of each file anyway.
If it is a BOM, U+FEFF, then the first three bytes of the UTF-8 file will be
0xEF 0xBB 0xBF
That's the bytes of the /file/, not whatever ends up in Java after it's been decoded.
If it is a BOM, then the easiest thing to do is just ignore it.
-- chris
Oliver Wong - 03 Jul 2006 16:47 GMT > Hi. Ive got a problem. I have some code that takes a text file and > breaks it into an array of substrings for displaying the text truncated [quoted text clipped - 19 lines] > and if it isnt a space, it adds the width of the character (in pixels) > , and keeps going until it does find a space. How about something like:
<pseudoCode> StringTokenizer st = new StringTokenizer(str, " \n", true); int offset = 0; while (st.hasMoreTokens()) { String token = st.nextToken(); if (token.equals(" ")) { /*do whatever you gotta do with spaces here.*/ offset++; } else if (token.equals("\n")) { return offset; } else { currentWidth = font.stringWidth(token); offset += token.length(); } } </pseudoCode>
You'll avoid breaking up the string into its individual codepoints, potentially splitting a character in two.
> the problem with this is it assumes that each byte is a characater. In > utf8, up to 3 bytes could be one character, so this code is trying to > find the widths of characters representing each byte in a utf8 > sequence, rather than the width of the utf8 character as a whole. Actually, it assumes each (Java) char is a (semantic) character. A Java char is 16 bits long, and Java Strings are internally stored in UTF-16, so a semantic character might be spread over 2 java char (32 bits).
> my additional problem is this is iAppli code, so I am limited to a 30K > codebase, and I have hit the limit, so I cant write any more lines of > code - I just have to change the existing code such that it doesnt > generate any more bytecode. Sounds rough. Can't really help you with this.
> what can I do to the above code so that I can count widths of utf8 > characters instead of asc characters, without writing too much extra > code - I need existing java library functions to do it for me, but I > dont know what that fucntionality is. See above. Since you're working with Unicode, you might want to use the Character.isWhiteSpace() method, isntead of the String.equals(" ") method. I believe the Japanese whitespace has a different unicode value than the ASCII whitespace.
- Oliver
ddimitrov - 06 Jul 2006 13:10 GMT I haven't done mobile Java for a long time, but as far as I remember the encoding for iApplis is ShiftJIS. Internally Java still uses unicode representation, but you have to make sure that all your resources are encoded in ShiftJIS and you might have to specify the propper encoding when you read and write the strings to the scratchpad.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|