Java Forum / General / February 2006
writing (char) 129 to file
leov - 20 Feb 2006 16:21 GMT I write a string containing the character (char) 129 or hex 0x81 to a FileWriter instance. The default character encoding is Cp1252. Immediately before writing it to the file, my String contains "\u0081". In the output file appears the char 0x3F instead. So far I figured out I probably have to set a different character encoding for the FileWriter. - how can I set another char encoding for FileWriter, it supports the method 'getEncoding()' , but no setEncoding() - what encoding set will support the 0x81 (1byte) character?
thx leo
Thomas Fritsch - 20 Feb 2006 16:47 GMT > I write a string containing the character (char) 129 or hex 0x81 to a > FileWriter instance. [quoted text clipped - 4 lines] > - how can I set another char encoding for FileWriter, it supports the > method 'getEncoding()' , but no setEncoding() And FileWriter doesn't have a constructor taking an encoding, too.
Instead of using Writer writer = new FileWriter(...); you should use Writer writer = new OutputStreamWriter(new FileInputStream(...), encoding));
> - what encoding set will support the 0x81 (1byte) character? What do you mean with an 1byte character 0x81 ? (1) The 2byte char '\u0081'. Its meaning is defined by the Unicode spec. See www.unicode.org (2) The 1byte byte 0x81. Its meaning varies from encoding to encoding. See http://mindprod.com/jgloss/encoding.html
 Signature "Thomas:Fritsch$ops:de".replace(':','.').replace('$','@')
Oliver Wong - 20 Feb 2006 16:56 GMT >> - what encoding set will support the 0x81 (1byte) character? > What do you mean with an 1byte character 0x81 ?
> (1) The 2byte char '\u0081'. Its meaning is defined by the > Unicode spec. See www.unicode.org To be precise, I don't think the unicode spec defines a byte-length for their characters. That is, the 129th character in the Unicode standard (where 129 in decimal = 81 in hexadecimal) does not intrinsically have a length of 2 bytes.
Particular encodings of the characters have length, but the character itself doesn't have a length. In UTF-16, '\u0081' has a length of 2 bytes. In other encodings, it might have other lengths.
To the OP, are you asking "Which encoding will encode the Unicode character '\u0081' as the byte 0x81?"?
- Oliver
Thomas Fritsch - 20 Feb 2006 18:16 GMT >> (1) The 2byte char '\u0081'. Its meaning is defined by the >> Unicode spec. See www.unicode.org [quoted text clipped - 3 lines] > (where 129 in decimal = 81 in hexadecimal) does not intrinsically have a > length of 2 bytes. Agreed! Unicode-characters are just abstract numbers without any length. And there are actually characters defined beyond 0x10000 (Cuneiform, Gothic, Linear B, ...). BTW: I suspect, that Sun now regrets the Java-1.0 design-decision that a char is 2 bytes long.
 Signature "Thomas:Fritsch$ops.de".replace(':', '.').replace('$', '@')
Oliver Wong - 20 Feb 2006 19:06 GMT > BTW: I suspect, that Sun now regrets the Java-1.0 design-decision that a > char is 2 bytes long. Yes. They allude to this regret in the Javadocs too: http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html
- Oliver
leov - 21 Feb 2006 11:43 GMT Thanks for the hints all, I've got it working now for Latin-1
leo
John O'Conner - 22 Feb 2006 08:43 GMT >> BTW: I suspect, that Sun now regrets the Java-1.0 design-decision that a >> char is 2 bytes long. [quoted text clipped - 3 lines] > > - Oliver I think that given the situation, we came up with the most reasonable solution for 1.5. Unicode had evolved past 65k characters for a long time...frankly, we ignored it as long as possible. With 1.5, the demand was overwhelming...and legitimate, real characters had shown up in the Unicode 4.0 specification. We had to find some way to move Java up to the new 4.0 spec. We considered practically everything...making a new char32 type, using ints exclusively as characters, changing the definition of char to be 32 bits wide, etc. Finally, we have what we have now...after much debate. It isn't perfect, but it works.
Best of luck, John O'Conner
Roedy Green - 24 Feb 2006 14:52 GMT On Mon, 20 Feb 2006 18:16:05 GMT, Thomas Fritsch <i.dont.like.spam@invalid.com> wrote, quoted or indirectly quoted someone who said :
>BTW: I suspect, that Sun now regrets the Java-1.0 design-decision that a >char is 2 bytes long. I don't think so. Going to 32-bit chars would double ram requirement fro character processing. That is mostly what I do with Java. It would cut my effective ram heap in two. This would mean more frequent GC. Those characters are mainly needed for Chinese, and even then I understand they are optional.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Oliver Wong - 24 Feb 2006 15:09 GMT > On Mon, 20 Feb 2006 18:16:05 GMT, Thomas Fritsch > <i.dont.like.spam@invalid.com> wrote, quoted or indirectly quoted [quoted text clipped - 8 lines] > GC. Those characters are mainly needed for Chinese, and even then I > understand they are optional. I'm not sure if one of the specifications forbid this, but perhaps Java could *appear* to be using 32-bit chars, but the VM actually internally uses UTF-16 or even UTF-8 encoding.
I think it'd be more elegant (though perhaps less practical) if the char data type was not considered a numeric type at all, and did not have any bit-size. As Unicode expands, so would the implementations of the char data type, without breaking existing code (since existing code shouldn't be depending on char being of size 16-bit or anything like that).
- Oliver
Stefan Ram - 24 Feb 2006 15:18 GMT >I'm not sure if one of the specifications forbid this, but >perhaps Java could *appear* to be using 32-bit chars, but the >VM actually internally uses UTF-16 or even UTF-8 encoding. This (with UTF-8) is done in Perl 5.
Roedy Green - 24 Feb 2006 20:10 GMT >>I'm not sure if one of the specifications forbid this, but >>perhaps Java could *appear* to be using 32-bit chars, but the >>VM actually internally uses UTF-16 or even UTF-8 encoding. > > This (with UTF-8) is done in Perl 5. the problem with that is charAt, indexOf etc all greatly slow down. Even substring could be a beast if you actually try to figure out the length in bytes.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Oliver Wong - 24 Feb 2006 20:26 GMT >>>I'm not sure if one of the specifications forbid this, but >>>perhaps Java could *appear* to be using 32-bit chars, but the [quoted text clipped - 5 lines] > Even substring could be a beast if you actually try to figure out the > length in bytes. When you're dealing with unicode characters above \uffff, charAt() doesn't do what one would expect it to do... Is better to have a fast implementation that works some of the time, or a slow implementation that works all the time?
Actually, perhaps we could have multiple implementations of the String interface. You could have an 8-bit-per-character String implementation for strings which consist mostly of English characters, a 16-bit implementation for String for European languages and mathematical symbols, and a 32-bit implementation to handle everything else (for now).
Since most Java programs use strings like so:
<example> String foo = "Hello world"; </example>
instead of
<example> String foo = new String("Hello world"); </example>
the compiler could actually, at compile time, look at what kind of string it is dealing with, and use the appropriate subclass. Similar intelligence (except at runtime instead of compile time) could be build into BufferedReader, and other classes which act as factories for Strings.
- Oliver
Chris Uppal - 25 Feb 2006 10:05 GMT > Actually, perhaps we could have multiple implementations of the String > interface. You could have an 8-bit-per-character String implementation for > strings which consist mostly of English characters, a 16-bit > implementation for String for European languages and mathematical > symbols, and a 32-bit implementation to handle everything else (for now). I put together an implementation of the same basic idea (for Smalltalk -- where the absence of static typing allows such things to work a lot better).
There's a separation between the interface to my strings (which are intersubstituable with the implementation's built-in String class), and their physical representation. One of the physical classes represents its data as an internal Array of UnicodeCharacters (this is mainly meant as a simple-as-possible implementation for sanity checking and unit tests). Most of the other implementations keep their data as a ByteArray internally and use one or another UnicodeByteEncoding to interpret it. There are encoding for UTF-8/16/32, plus the obvious-but-doesn't-actually-exist "UTF-24", and Java's wierd encoding.
One of the features I plan, but haven't got around to implementing yet, is for the variable-width encoded strings to keep a record of the first "glitch" in the encoding -- the first position where there's a character which doesn't fit in the encoding's minimum width. That should (I hope) mean that UTF-8 can be used efficiently in space /and/ time for data which is predominantly ASCII.
Writing about it here reminds me that I really ought to get that stuff finished...
-- chris
Roedy Green - 25 Feb 2006 10:06 GMT > Actually, perhaps we could have multiple implementations of the String >interface. You could have an 8-bit-per-character String implementation for >strings which consist mostly of English characters, a 16-bit implementation >for String for European languages and mathematical symbols, and a 32-bit >implementation to handle everything else (for now) that makes sense. Internally they could all be treated as the same type to the programmer.
You could do it like this:
A string literal could have a two bits marker
00 stored as 8-bits per char NO MULTICHAR STRINGS Unicode 0..FF (greater range that UTF single char)
01 stored as 16-bits per char no multichars
10 stored as 32-bits per char no multichars.
A string than has many possible internal and hidden representations:
It would even be possible for a string to be a list of the calls to append that created it, an array of a hodge podge of the three sizes.
The String class would be at liberty to reorganise Strings, collapsing pieces, making them all one piece of the largest size, or splitting them to isolate just a few difficult characters leaving the rest in narrower strings.
This sounds horribly complicated, but even a newbie could implement such a string class. It is just a lot of bookkeeping. It the cases where a string has a single segment, the code is almost as fast as the code we use today, and it would actually use LESS RAM, since so many strings are in made completely of characters in the rang 0..FF.
The difficult part comes in optimising. When to split, when to join. Actually splitting and joining are trivial.
Any JVM maker or AOT maker could implement his idea today with 16-8 bit Strings and you would never know unless you peaked inside. The big payoff for mixed width strings internally would come if Java started using 32-bit Strings as the default.
Similarly optimisers might internally use arrays if byte or int instead of long when the optimiser determines that in actuality that suffices.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Stefan Ram - 24 Feb 2006 20:34 GMT >> This (with UTF-8) is done in Perl 5. > the problem with that is charAt, indexOf etc all greatly slow down. This is what we've got today in Java to get the nth character from a string (because of surrogate pairs used): One can not just skip (n-1) char values, but has to analyze each char value for it surrogate property.
So the current Java solution combines problems from both worlds: It needs more complicated algorithms to care for surrogate pairs (so getting to the nth character is slower), but this is not even hidden by a layer from the client, so he needs to be aware of it.
It is not obvious that UTF-8 algorithms are slow, because the data is so small that it might often fit into a cache memory. Using UCS4 might simplify algorithms, but more strings might not fit into cache memory completely, which might slow down operations.
Perl 5 might have the suspected slowdown, but at least it has a layer over its internal UTF-8, so that the client does not have to be aware of it. His algorithms on strings look simple and encode the intentions of the programmer, not distorted by having to care for surrogate pairs. On the long run, code that expresses the programmers intention more cleanly might even lead to more chances for optimization. For example: Perl might change its internal representation to UCS4 later, while Java must keep surrogate pairs, because clients are written, which expect them.
Stefan Ram - 25 Feb 2006 03:03 GMT >This is what we've got today in Java to get the nth character >from a string (because of surrogate pairs used): One can not >just skip (n-1) char values, but has to analyze each char >value for it surrogate property. One might use:
final java.lang.String chString = string.substring( n - 1, n ); final int ch = java.lang.Character.codePointAt( chString, 0 );
Stefan Ram - 28 Feb 2006 04:54 GMT >>This is what we've got today in Java to get the nth character >>from a string (because of surrogate pairs used): One can not [quoted text clipped - 3 lines] >final java.lang.String chString = string.substring( n - 1, n ); >final int ch = java.lang.Character.codePointAt( chString, 0 ); No! It seems as if the substring index is not the number of code points, just of char values.
public class Main { public static void main( final java.lang.String[] args ) { java.lang.System.out.println( "\udb40\udc50a".substring( 1 )); }}
The above string literal should contain only two code points, the second one being "a". But substring( 1 ) seems to give "\udc50a", which contains two chars, but is possibly no meaningful Unicode code point sequence at alle.
So how does one get the second code point?
public class Main { public static void main( final java.lang.String[] args ) { final java.lang.String text = "\udb40\udc50a"; java.lang.System.out.println ( text.substring( text.offsetByCodePoints( 0, 1 ))); }}
Roedy Green - 24 Feb 2006 14:39 GMT >- how can I set another char encoding for FileWriter, it supports the >method 'getEncoding()' , but no setEncoding() see http://mindprod.com/applets/fileio.html
tell it you want encoded chars, non locale default.and it will generate you samplecode.
See http://mindprod.com/jgloss/encoding.html for background on various ways do do encoding and decoding.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 24 Feb 2006 14:50 GMT >- what encoding set will support the 0x81 (1byte) character? in Unicode it is a control character. It is supported, but will not encode into one byte in UTF-8.
see http://mindprod.com/jgloss/utf.html
I think you might enjoy our special this evening, ISO-8859-1, it was a great year, pure, elegant, easy to understand.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|