Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / February 2006

Tip: Looking for answers? Try searching our database.

writing (char) 129 to file

Thread view: 
leov - 20 Feb 2006 16:21 GMT
I write a string containing the character (char) 129  or hex 0x81 to a
FileWriter instance.
The default character encoding is Cp1252. Immediately before writing it
to the file, my String contains "\u0081". In the output file appears
the char 0x3F instead. So far I figured out I probably have to set a
different character encoding for the FileWriter.
- how can I set another char encoding for FileWriter, it supports the
method 'getEncoding()' , but no setEncoding()
- what encoding set will support the 0x81 (1byte) character?

thx
leo
Thomas Fritsch - 20 Feb 2006 16:47 GMT
> I write a string containing the character (char) 129  or hex 0x81 to a
> FileWriter instance.
[quoted text clipped - 4 lines]
> - how can I set another char encoding for FileWriter, it supports the
> method 'getEncoding()' , but no setEncoding()
And FileWriter doesn't have a constructor taking an encoding, too.

Instead of using
  Writer writer = new FileWriter(...);
you should use
  Writer writer =
    new OutputStreamWriter(new FileInputStream(...), encoding));

> - what encoding set will support the 0x81 (1byte) character?
What do you mean with an 1byte character 0x81 ?
(1) The 2byte char '\u0081'. Its meaning is defined by the
    Unicode spec. See www.unicode.org
(2) The 1byte byte 0x81. Its meaning varies from encoding to
    encoding. See http://mindprod.com/jgloss/encoding.html

Signature

"Thomas:Fritsch$ops:de".replace(':','.').replace('$','@')

Oliver Wong - 20 Feb 2006 16:56 GMT
>> - what encoding set will support the 0x81 (1byte) character?
> What do you mean with an 1byte character 0x81 ?

> (1) The 2byte char '\u0081'. Its meaning is defined by the
>     Unicode spec. See www.unicode.org

   To be precise, I don't think the unicode spec defines a byte-length for
their characters. That is, the 129th character in the Unicode standard
(where 129 in decimal = 81 in hexadecimal) does not intrinsically have a
length of 2 bytes.

   Particular encodings of the characters have length, but the character
itself doesn't have a length. In UTF-16, '\u0081' has a length of 2 bytes.
In other encodings, it might have other lengths.

   To the OP, are you asking "Which encoding will encode the Unicode
character '\u0081' as the byte 0x81?"?

   - Oliver
Thomas Fritsch - 20 Feb 2006 18:16 GMT
>> (1) The 2byte char '\u0081'. Its meaning is defined by the
>>     Unicode spec. See www.unicode.org
[quoted text clipped - 3 lines]
> (where 129 in decimal = 81 in hexadecimal) does not intrinsically have a
> length of 2 bytes.

Agreed! Unicode-characters are just abstract numbers without any length.
And there are actually characters defined beyond 0x10000 (Cuneiform, Gothic,
Linear B, ...).
BTW: I suspect, that Sun now regrets the Java-1.0 design-decision that a
char is 2 bytes long.

Signature

"Thomas:Fritsch$ops.de".replace(':', '.').replace('$', '@')

Oliver Wong - 20 Feb 2006 19:06 GMT
> BTW: I suspect, that Sun now regrets the Java-1.0 design-decision that a
> char is 2 bytes long.

   Yes. They allude to this regret in the Javadocs too:
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html

   - Oliver
leov - 21 Feb 2006 11:43 GMT
Thanks for the hints all, I've got it working now for Latin-1

leo
John O'Conner - 22 Feb 2006 08:43 GMT
>> BTW: I suspect, that Sun now regrets the Java-1.0 design-decision that a
>> char is 2 bytes long.
[quoted text clipped - 3 lines]
>
>    - Oliver

I think that given the situation, we came up with the most reasonable
solution for 1.5. Unicode had evolved past 65k characters for a long
time...frankly, we ignored it as long as possible. With 1.5, the demand
was overwhelming...and legitimate, real characters had shown up in the
Unicode 4.0 specification. We had to find some way to move Java up to
the new 4.0 spec. We considered practically everything...making a new
char32 type, using ints exclusively as characters, changing the
definition of char to be 32 bits wide, etc. Finally, we have what we
have now...after much debate. It isn't perfect, but it works.

Best of luck,
John O'Conner
Roedy Green - 24 Feb 2006 14:52 GMT
On Mon, 20 Feb 2006 18:16:05 GMT, Thomas Fritsch
<i.dont.like.spam@invalid.com> wrote, quoted or indirectly quoted
someone who said :

>BTW: I suspect, that Sun now regrets the Java-1.0 design-decision that a
>char is 2 bytes long.

I don't think so.  Going to 32-bit chars would double ram requirement
fro character processing. That is mostly what I do with Java.  It
would cut my effective ram  heap in two. This would mean more frequent
GC.  Those characters are mainly needed for Chinese, and even then I
understand they are optional.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Oliver Wong - 24 Feb 2006 15:09 GMT
> On Mon, 20 Feb 2006 18:16:05 GMT, Thomas Fritsch
> <i.dont.like.spam@invalid.com> wrote, quoted or indirectly quoted
[quoted text clipped - 8 lines]
> GC.  Those characters are mainly needed for Chinese, and even then I
> understand they are optional.

   I'm not sure if one of the specifications forbid this, but perhaps Java
could *appear* to be using 32-bit chars, but the VM actually internally uses
UTF-16 or even UTF-8 encoding.

   I think it'd be more elegant (though perhaps less practical) if the char
data type was not considered a numeric type at all, and did not have any
bit-size. As Unicode expands, so would the implementations of the char data
type, without breaking existing code (since existing code shouldn't be
depending on char being of size 16-bit or anything like that).

   - Oliver
Stefan Ram - 24 Feb 2006 15:18 GMT
>I'm not sure if one of the specifications forbid this, but
>perhaps Java could *appear* to be using 32-bit chars, but the
>VM actually internally uses UTF-16 or even UTF-8 encoding.

 This (with UTF-8) is done in Perl 5.
Roedy Green - 24 Feb 2006 20:10 GMT
>>I'm not sure if one of the specifications forbid this, but
>>perhaps Java could *appear* to be using 32-bit chars, but the
>>VM actually internally uses UTF-16 or even UTF-8 encoding.
>
>  This (with UTF-8) is done in Perl 5.

the problem with that is charAt, indexOf etc all greatly slow down.
Even substring could be a beast if you actually try to figure out the
length in bytes.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Oliver Wong - 24 Feb 2006 20:26 GMT
>>>I'm not sure if one of the specifications forbid this, but
>>>perhaps Java could *appear* to be using 32-bit chars, but the
[quoted text clipped - 5 lines]
> Even substring could be a beast if you actually try to figure out the
> length in bytes.

   When you're dealing with unicode characters above \uffff, charAt()
doesn't do what one would expect it to do... Is better to have a fast
implementation that works some of the time, or a slow implementation that
works all the time?

   Actually, perhaps we could have multiple implementations of the String
interface. You could have an 8-bit-per-character String implementation for
strings which consist mostly of English characters, a 16-bit implementation
for String for European languages and mathematical symbols, and a 32-bit
implementation to handle everything else (for now).

   Since most Java programs use strings like so:

<example>
String foo = "Hello world";
</example>

instead of

<example>
String foo = new String("Hello world");
</example>

   the compiler could actually, at compile time, look at what kind of
string it is dealing with, and use the appropriate subclass. Similar
intelligence (except at runtime instead of compile time) could be build into
BufferedReader, and other classes which act as factories for Strings.

   - Oliver
Chris Uppal - 25 Feb 2006 10:05 GMT
>     Actually, perhaps we could have multiple implementations of the String
> interface. You could have an 8-bit-per-character String implementation for
> strings which consist mostly of English characters, a 16-bit
> implementation for String for European languages and mathematical
> symbols, and a 32-bit implementation to handle everything else (for now).

I put together an implementation of the same basic idea (for Smalltalk -- where
the absence of static typing allows such things to work a lot better).

There's a separation between the interface to my strings (which are
intersubstituable with the implementation's built-in String class), and their
physical representation.  One of the physical classes represents its data as an
internal Array of UnicodeCharacters (this is mainly meant as a
simple-as-possible implementation for sanity checking and unit tests).  Most of
the other implementations keep their data as a ByteArray internally and use one
or another UnicodeByteEncoding to interpret it.  There are encoding for
UTF-8/16/32, plus the obvious-but-doesn't-actually-exist "UTF-24", and Java's
wierd encoding.

One of the features I plan, but haven't got around to implementing yet, is for
the variable-width encoded strings to keep a record of the first "glitch" in
the encoding -- the first position where there's a character which doesn't fit
in the encoding's minimum width.  That should (I hope) mean that UTF-8 can be
used efficiently in space /and/ time for data which is predominantly ASCII.

Writing about it here reminds me that I really ought to get that stuff
finished...

   -- chris
Roedy Green - 25 Feb 2006 10:06 GMT
>    Actually, perhaps we could have multiple implementations of the String
>interface. You could have an 8-bit-per-character String implementation for
>strings which consist mostly of English characters, a 16-bit implementation
>for String for European languages and mathematical symbols, and a 32-bit
>implementation to handle everything else (for now)

that makes sense. Internally they could all be treated as the same
type to the programmer.  

You could do it like this:

A string literal could have a two bits marker

00 stored as 8-bits per char NO MULTICHAR STRINGS Unicode 0..FF
(greater range that UTF single char)

01 stored as 16-bits per char no multichars

10 stored as 32-bits per char no multichars.

A string  than has many possible internal and hidden representations:

It would even be possible for a string to be a list of the calls to
append that created it, an array of a hodge podge of the three sizes.

The String class would be at liberty to reorganise Strings, collapsing
pieces, making them all one piece of the largest size, or splitting
them to isolate just a few difficult characters leaving the rest in
narrower strings.

This sounds horribly complicated, but even a newbie could implement
such a string class. It is just a lot of bookkeeping. It the cases
where a string has a single segment, the code is almost as fast as the
code we use today, and it would actually use LESS RAM, since so many
strings are in made completely of characters in the rang 0..FF.

The difficult part comes in optimising. When to split, when to join.
Actually splitting and joining are trivial.

Any JVM maker or AOT maker could implement his idea today with 16-8
bit Strings and you would never know unless you peaked inside. The big
payoff for mixed width strings internally  would come if Java started
using 32-bit Strings as the default.

Similarly optimisers might internally use arrays if byte or int
instead of long when the optimiser determines that in actuality that
suffices.  
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Stefan Ram - 24 Feb 2006 20:34 GMT
>>  This (with UTF-8) is done in Perl 5.
> the problem with that is charAt, indexOf etc all greatly slow down.

 This is what we've got today in Java to get the nth character
 from a string (because of surrogate pairs used): One can not
 just skip (n-1) char values, but has to analyze each char
 value for it surrogate property.

 So the current Java solution combines problems from both
 worlds: It needs more complicated algorithms to care for
 surrogate pairs (so getting to the nth character is slower),
 but this is not even hidden by a layer from the client, so he
 needs to be aware of it.

 It is not obvious that UTF-8 algorithms are slow, because
 the data is so small that it might often fit into a
 cache memory. Using UCS4 might simplify algorithms, but
 more strings might not fit into cache memory completely,
 which might slow down operations.

 Perl 5 might have the suspected slowdown, but at least it has
 a layer over its internal UTF-8, so that the client does not
 have to be aware of it. His algorithms on strings look simple
 and encode the intentions of the programmer, not distorted by
 having to care for surrogate pairs. On the long run, code that
 expresses the programmers intention more cleanly might even
 lead to more chances for optimization. For example: Perl might
 change its internal representation to UCS4 later, while Java
 must keep surrogate pairs, because clients are written, which
 expect them.
Stefan Ram - 25 Feb 2006 03:03 GMT
>This is what we've got today in Java to get the nth character
>from a string (because of surrogate pairs used): One can not
>just skip (n-1) char values, but has to analyze each char
>value for it surrogate property.

 One might use:

final java.lang.String chString = string.substring( n - 1, n );
final int ch = java.lang.Character.codePointAt( chString, 0 );
Stefan Ram - 28 Feb 2006 04:54 GMT
>>This is what we've got today in Java to get the nth character
>>from a string (because of surrogate pairs used): One can not
[quoted text clipped - 3 lines]
>final java.lang.String chString = string.substring( n - 1, n );
>final int ch = java.lang.Character.codePointAt( chString, 0 );

 No! It seems as if the substring index is not the number
 of code points, just of char values.

public class Main
{ public static void main( final java.lang.String[] args )
 { java.lang.System.out.println( "\udb40\udc50a".substring( 1 )); }}

 The above string literal should contain only two code points,
 the second one being "a". But substring( 1 ) seems to give
 "\udc50a", which contains two chars, but is possibly no
 meaningful Unicode code point sequence at alle.

 So how does one get the second code point?

public class Main
{ public static void main( final java.lang.String[] args )
 { final java.lang.String text = "\udb40\udc50a";
   java.lang.System.out.println
   ( text.substring( text.offsetByCodePoints( 0, 1 ))); }}
Roedy Green - 24 Feb 2006 14:39 GMT
>- how can I set another char encoding for FileWriter, it supports the
>method 'getEncoding()' , but no setEncoding()

see http://mindprod.com/applets/fileio.html

tell it you want encoded chars, non locale default.and it will
generate you samplecode.

See http://mindprod.com/jgloss/encoding.html
for background on various ways do do encoding and decoding.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 24 Feb 2006 14:50 GMT
>- what encoding set will support the 0x81 (1byte) character?
in Unicode it is a control character. It is supported, but will not
encode into one byte in UTF-8.

see http://mindprod.com/jgloss/utf.html

I think you might enjoy  our special this evening, ISO-8859-1, it was
a great year, pure, elegant, easy to understand.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.