Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / December 2005

Tip: Looking for answers? Try searching our database.

xml, windows, utf-8, and httpclient

Thread view: 
gernlearner@yahoo.com - 19 Dec 2005 22:09 GMT
Wow! I am utterly confused...

In MS Windows XP Notepad I created an xml file. I pasted a single
tibetan character '\u0F40' as the text part of a certain element:
<body></body>

I saved this file using notepad's Save as... -> encoding = UTF-8.

If I use a hex editor and view the document I can see that the
character is stored as I would expect in utf-8 encoding '\u0F40' -> E0
BD 80.

Next, I use dom4j to read in and parse the file. dom4j should be using
the xerces parser. I assume that the parser knows how to read the utf-8
file. After all, it prepends the xml file with:
<?xml version="1.0" encoding="UTF-8"?>

Question 1:
At this point, is the character stored in memory as '\u0f40'?

Maybe not, because if I print my xml as a string and view it in hex I
can see my utf-8 characters in there 'E0 BD 80'.

Next, I want to post my xml to a webserver using jakarta commons
httpclient. I add a header declaring the encoding as utf-8:
content-type=text/xml; charset=UTF-8. This action has the same effect
as taking my xml string and using the String.getBytes("UTF-8")
function. The bytes are pushed through the utf-8 encoding algorithm
again and are sent as 'c3 a0 c2 bd e2 82 ac'.

Question 2:
Is that how it should be done?

Question 3:
'c3 a0 c2 bd' translates back to 'E0 BD', but I have no idea where 'e2
82 ac' comes from... Any ideas?
Chris Uppal - 20 Dec 2005 12:13 GMT
> In MS Windows XP Notepad I created an xml file. I pasted a single
> tibetan character '\u0F40' as the text part of a certain element:
> <body>?</body>

I was amazed to find that the Ka letter in that paragraph is rendered correctly
by my newsreader !

> I saved this file using notepad's Save as... -> encoding = UTF-8.

Check whether Notepad has added a Byte Order Mark.  It shouldn't (for UTF-8)
but I seem to remember that it usually does anyway.

> If I use a hex editor and view the document I can see that the
> character is stored as I would expect in utf-8 encoding '\u0F40' -> E0
[quoted text clipped - 4 lines]
> file. After all, it prepends the xml file with:
> <?xml version="1.0" encoding="UTF-8"?>

It's not clear at this point whether you mean that the file you created has
such a charset declaration ?

> Question 1:
> At this point, is the character stored in memory as '\u0f40'?

Why don't you try printing out the integer value of the character(s) ?  If it
is 0x0F40 then all is well so far, if not then something has already gone wrong
(presumably the parser didn't realise that it was parsing UTF-8).

> Maybe not, because if I print my xml as a string and view it in hex I
> can see my utf-8 characters in there 'E0 BD 80'.

The problem with that is that you don't know how the process of printing the
string is converting characters into binary.

> Next, I want to post my xml to a webserver using jakarta commons
> httpclient. I add a header declaring the encoding as utf-8:
[quoted text clipped - 9 lines]
> 'c3 a0 c2 bd' translates back to 'E0 BD', but I have no idea where 'e2
> 82 ac' comes from... Any ideas?

It sounds as if the Ka character's UTF-8 representation hasn't been de-UTF-8-ed
as it was read in by the parser, thus resulting in a String containing the
chars 0x00E0 0x00BD 0x0080. Which has then been encoded as UTF-8 /again/
resulting in the gibberish you see.

I don't know much about dom4j (or Xerces, come to that), but it might be
worth posting the code you use to open the XML file.  I suspect it's not
decoding the UTF-8.

   -- chris
Alex Buell - 20 Dec 2005 12:46 GMT
> > In MS Windows XP Notepad I created an xml file. I pasted a single
> > tibetan character '\u0F40' as the text part of a certain element:
> > <body>?</body>
>
> I was amazed to find that the Ka letter in that paragraph is rendered correctly
> by my newsreader !

No it isn't. It is shown as a ? in your post, but I can do this: ཀ.
Perfect.

Signature

http://www.munted.org.uk

Anyone that thinks an imaginary deity is going to protect them against
earthquakes and hurricanes needs psychiatric help.

Chris Uppal - 20 Dec 2005 13:14 GMT
> > > In MS Windows XP Notepad I created an xml file. I pasted a single
> > > tibetan character '\u0F40' as the text part of a certain element:
[quoted text clipped - 5 lines]
> No it isn't. It is shown as a ? in your post, but I can do this: ?.
> Perfect.

Well, it was /rendered/ correctly (even in the reply composition window), it's
just that it throws the character away before actually sending the post...

   -- chris
Alex Buell - 20 Dec 2005 13:20 GMT
> > > > In MS Windows XP Notepad I created an xml file. I pasted a single
> > > > tibetan character '\u0F40' as the text part of a certain element:
[quoted text clipped - 8 lines]
> Well, it was /rendered/ correctly (even in the reply composition window), it's
> just that it throws the character away before actually sending the post...

I actually posted it as an UTF-8 enabled message which might be why I
can do ཀ. I strongly suggest you have a look at Sylpheed, there's a
version for Windows (http://www.sylpheed.good-day.net). The author is
Japanese and very much aware of those issues and that's why it's
excellent.

Signature

http://www.munted.org.uk

Anyone that thinks an imaginary deity is going to protect them against
earthquakes and hurricanes needs psychiatric help.

Chris Uppal - 21 Dec 2005 09:37 GMT
> I actually posted it as an UTF-8 enabled message which might be why I
> can do ?. I strongly suggest you have a look at Sylpheed, there's a
> version for Windows (http://www.sylpheed.good-day.net). The author is
> Japanese and very much aware of those issues and that's why it's
> excellent.

The URL seems to be:
   http://www.sylpheed.good-day.je/

Looks interesting.  I'll probably try it out when the Window's version leaves
beta.  (I'd rather not use Outlook Express, but -- for all its many defects --
I still haven't found anything like an acceptable replacement.)

   -- chris
Alex Buell - 21 Dec 2005 11:08 GMT
On Wed, 21 Dec 2005 09:37:04 -0000 "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> waved a wand and this message
magically appeared:

> > I actually posted it as an UTF-8 enabled message which might be why I
> > can do ?. I strongly suggest you have a look at Sylpheed, there's a
[quoted text clipped - 4 lines]
> The URL seems to be:
>     http://www.sylpheed.good-day.je/

Correction: http://sylpheed.good-day.net

> Looks interesting.  I'll probably try it out when the Window's version leaves
> beta.  (I'd rather not use Outlook Express, but -- for all its many defects --
> I still haven't found anything like an acceptable replacement.)

Anything but Outlook, please ;o)

Signature

http://www.munted.org.uk

Anyone that thinks an imaginary deity is going to protect them against
earthquakes and hurricanes needs psychiatric help.



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.