Wow! I am utterly confused...
In MS Windows XP Notepad I created an xml file. I pasted a single
tibetan character '\u0F40' as the text part of a certain element:
<body></body>
I saved this file using notepad's Save as... -> encoding = UTF-8.
If I use a hex editor and view the document I can see that the
character is stored as I would expect in utf-8 encoding '\u0F40' -> E0
BD 80.
Next, I use dom4j to read in and parse the file. dom4j should be using
the xerces parser. I assume that the parser knows how to read the utf-8
file. After all, it prepends the xml file with:
<?xml version="1.0" encoding="UTF-8"?>
Question 1:
At this point, is the character stored in memory as '\u0f40'?
Maybe not, because if I print my xml as a string and view it in hex I
can see my utf-8 characters in there 'E0 BD 80'.
Next, I want to post my xml to a webserver using jakarta commons
httpclient. I add a header declaring the encoding as utf-8:
content-type=text/xml; charset=UTF-8. This action has the same effect
as taking my xml string and using the String.getBytes("UTF-8")
function. The bytes are pushed through the utf-8 encoding algorithm
again and are sent as 'c3 a0 c2 bd e2 82 ac'.
Question 2:
Is that how it should be done?
Question 3:
'c3 a0 c2 bd' translates back to 'E0 BD', but I have no idea where 'e2
82 ac' comes from... Any ideas?
> In MS Windows XP Notepad I created an xml file. I pasted a single
> tibetan character '\u0F40' as the text part of a certain element:
> <body>?</body>
I was amazed to find that the Ka letter in that paragraph is rendered correctly
by my newsreader !
> I saved this file using notepad's Save as... -> encoding = UTF-8.
Check whether Notepad has added a Byte Order Mark. It shouldn't (for UTF-8)
but I seem to remember that it usually does anyway.
> If I use a hex editor and view the document I can see that the
> character is stored as I would expect in utf-8 encoding '\u0F40' -> E0
[quoted text clipped - 4 lines]
> file. After all, it prepends the xml file with:
> <?xml version="1.0" encoding="UTF-8"?>
It's not clear at this point whether you mean that the file you created has
such a charset declaration ?
> Question 1:
> At this point, is the character stored in memory as '\u0f40'?
Why don't you try printing out the integer value of the character(s) ? If it
is 0x0F40 then all is well so far, if not then something has already gone wrong
(presumably the parser didn't realise that it was parsing UTF-8).
> Maybe not, because if I print my xml as a string and view it in hex I
> can see my utf-8 characters in there 'E0 BD 80'.
The problem with that is that you don't know how the process of printing the
string is converting characters into binary.
> Next, I want to post my xml to a webserver using jakarta commons
> httpclient. I add a header declaring the encoding as utf-8:
[quoted text clipped - 9 lines]
> 'c3 a0 c2 bd' translates back to 'E0 BD', but I have no idea where 'e2
> 82 ac' comes from... Any ideas?
It sounds as if the Ka character's UTF-8 representation hasn't been de-UTF-8-ed
as it was read in by the parser, thus resulting in a String containing the
chars 0x00E0 0x00BD 0x0080. Which has then been encoded as UTF-8 /again/
resulting in the gibberish you see.
I don't know much about dom4j (or Xerces, come to that), but it might be
worth posting the code you use to open the XML file. I suspect it's not
decoding the UTF-8.
-- chris
Alex Buell - 20 Dec 2005 12:46 GMT
> > In MS Windows XP Notepad I created an xml file. I pasted a single
> > tibetan character '\u0F40' as the text part of a certain element:
> > <body>?</body>
>
> I was amazed to find that the Ka letter in that paragraph is rendered correctly
> by my newsreader !
No it isn't. It is shown as a ? in your post, but I can do this: ཀ.
Perfect.

Signature
http://www.munted.org.uk
Anyone that thinks an imaginary deity is going to protect them against
earthquakes and hurricanes needs psychiatric help.
Chris Uppal - 20 Dec 2005 13:14 GMT
> > > In MS Windows XP Notepad I created an xml file. I pasted a single
> > > tibetan character '\u0F40' as the text part of a certain element:
[quoted text clipped - 5 lines]
> No it isn't. It is shown as a ? in your post, but I can do this: ?.
> Perfect.
Well, it was /rendered/ correctly (even in the reply composition window), it's
just that it throws the character away before actually sending the post...
-- chris
Alex Buell - 20 Dec 2005 13:20 GMT
> > > > In MS Windows XP Notepad I created an xml file. I pasted a single
> > > > tibetan character '\u0F40' as the text part of a certain element:
[quoted text clipped - 8 lines]
> Well, it was /rendered/ correctly (even in the reply composition window), it's
> just that it throws the character away before actually sending the post...
I actually posted it as an UTF-8 enabled message which might be why I
can do ཀ. I strongly suggest you have a look at Sylpheed, there's a
version for Windows (http://www.sylpheed.good-day.net). The author is
Japanese and very much aware of those issues and that's why it's
excellent.

Signature
http://www.munted.org.uk
Anyone that thinks an imaginary deity is going to protect them against
earthquakes and hurricanes needs psychiatric help.
Chris Uppal - 21 Dec 2005 09:37 GMT
> I actually posted it as an UTF-8 enabled message which might be why I
> can do ?. I strongly suggest you have a look at Sylpheed, there's a
> version for Windows (http://www.sylpheed.good-day.net). The author is
> Japanese and very much aware of those issues and that's why it's
> excellent.
The URL seems to be:
http://www.sylpheed.good-day.je/
Looks interesting. I'll probably try it out when the Window's version leaves
beta. (I'd rather not use Outlook Express, but -- for all its many defects --
I still haven't found anything like an acceptable replacement.)
-- chris
Alex Buell - 21 Dec 2005 11:08 GMT
On Wed, 21 Dec 2005 09:37:04 -0000 "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> waved a wand and this message
magically appeared:
> > I actually posted it as an UTF-8 enabled message which might be why I
> > can do ?. I strongly suggest you have a look at Sylpheed, there's a
[quoted text clipped - 4 lines]
> The URL seems to be:
> http://www.sylpheed.good-day.je/
Correction: http://sylpheed.good-day.net
> Looks interesting. I'll probably try it out when the Window's version leaves
> beta. (I'd rather not use Outlook Express, but -- for all its many defects --
> I still haven't found anything like an acceptable replacement.)
Anything but Outlook, please ;o)

Signature
http://www.munted.org.uk
Anyone that thinks an imaginary deity is going to protect them against
earthquakes and hurricanes needs psychiatric help.