Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / June 2006

Tip: Looking for answers? Try searching our database.

Xml parser and character encoding

Thread view: 
Ghislain Benrais - 26 Jun 2006 15:51 GMT
Hello,
   I am new to java and I run a short program processing xml files.
   Everything ran very well until I received xml files with the character
itself instead of its numerical reference (for instance 'é' instead of
'é'). I thought java would handle it but unexpectedly, it handles it
under DOS but doesn't handle it under Linux !
   Do you have any explanations ?
Input file :
=======
<?xml version="1.0" encoding="ISO-8859-1" ?>
<values>
   <value>détail</value>
   <value>d&#xe9;tail</value>
</values>
Java program :
==========
package javaapplication2;
import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;
public class Main extends DefaultHandler  {
   private String CData;
   // Encodage
   static String encoding;
   private Writer out;
   public Main(String[] args) {
       super();
       encoding = "ISO-8859-15";
       try {
           XMLReader xr = XMLReaderFactory.createXMLReader();
           xr.setContentHandler( this );
           out = new OutputStreamWriter(new
FileOutputStream("out.txt"),encoding);
           InputSource input = null;
           input =  new InputSource(new FileReader("file.xml"));
           xr.parse(input);
           out.close();
       }catch ( Exception e ) {
           e.printStackTrace();
       }
   }
   public static void main(String[] args) {
       // TODO code application logic here
       Main main = new Main(args);
   }
   //--------------------------------------------------------------------------------------
   // Méthodes du parser
   //--------------------------------------------------------------------------------------
   public void startElement( String namespaceURI, String localName, String
qName, Attributes attr ) throws SAXException {
       CData = new String("");
  }
   public void characters(char[] chars, int iStart, int iLen) {
       CData = CData + new String(chars, iStart, iLen);
   }
   public void endElement( String namespaceURI,String localName,String
qName ) throws SAXException {
       if (localName.equals( "value" )) {
           try{
               out.write(CData+"\n");
           }catch ( Exception e ) {
              e.printStackTrace();
           }
           return;
       }
   }
}
Result if run from DOS
================
détail
détail
Result if run from Linux
=================
d?tail
détail

   Thanks in advance,
Ghislain
Oliver Wong - 26 Jun 2006 16:12 GMT
> Hello,
>    I am new to java and I run a short program processing xml files.
[quoted text clipped - 6 lines]
> =======
> <?xml version="1.0" encoding="ISO-8859-1" ?>
[most of the code snipped]
>            input =  new InputSource(new FileReader("file.xml"));

   From http://java.sun.com/j2se/1.5.0/docs/api/java/io/FileReader.html:

<quote>
The constructors of this class assume that the default character encoding
and the default byte-buffer size are appropriate. To specify these values
yourself, construct an InputStreamReader on a FileInputStream.
</quote>

   In other words, you're not specifying the encoding in the reader, and so
it picks some arbitrary one, and that encoding doesn't match the encoding
used in your XML file.

   Did you try using the constructor of InputSource which takes a byte
stream instead of a character stream?
http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/InputSource.html#InputSource
(java.io.InputStream
)

   - Oliver
Ghislain Benrais - 26 Jun 2006 16:42 GMT
I tried :
           input =  new InputSource(new FileInputStream("file.xml"));
and it works !
Thank you Oliver !
Chris Uppal - 26 Jun 2006 16:58 GMT
> I tried :
>             input =  new InputSource(new FileInputStream("file.xml"));
> and it works !

But now you are overriding the encoding specified in the input file with the
one used by the FileInputStream -- and that will be whatever your Java system
default is.

As far as I can see, your earlier code would have used the charset specified in
the XML file, and -- as far as I can tell that /ought/ to work correctly.  I
have no idea why it doesn't.

   -- chris
Oliver Wong - 26 Jun 2006 17:29 GMT
>> I tried :
>>             input =  new InputSource(new FileInputStream("file.xml"));
[quoted text clipped - 11 lines]
> I
> have no idea why it doesn't.

   The original code specifies the *OUTPUT* encoding, but not the input
one.

   - Oliver
Oliver Wong - 26 Jun 2006 17:54 GMT
>>> I tried :
>>>             input =  new InputSource(new FileInputStream("file.xml"));
[quoted text clipped - 14 lines]
>    The original code specifies the *OUTPUT* encoding, but not the input
> one.

   Oops, sorry, I misread your post, Chris.

   Here's what I suspect is happening in the original code: A FileReader is
created with no specified encoding. A FileReader doesn't know anything about
XML, so it's not like the file reader is going to look for an XML
declaration node, and check it's encoding attribute. Instead, the FileReader
just uses the system default encoding and reads a stream of bytes from the
disk, an transforms them into a stream of characters, and passes these
characters to the XMLReader. By the time the XMLReader receives these
characters, they've already been decoded under some specific encoding, so
it's "too late" for the XMLReader to try to use the encoding information
specified in the XML file.

   That's why I suggested the OP use the constructor which takes in a
stream of bytes instead. The XMLReader will probably decode the first few
bytes using ASCII or UTF-8, until it finds an encoding specified in the
file, in which case it does whatever magic it needs to do to switch encoding
mid-stream.

   And it turns out that's what the OP actually did. FileInputStream
processes files as a stream of bytes, and not as a stream of characters, so
no encoding/decoding is done by FileInputStream.

   - Oliver
Dale King - 28 Jun 2006 06:41 GMT
>    That's why I suggested the OP use the constructor which takes in a
> stream of bytes instead. The XMLReader will probably decode the first
> few bytes using ASCII or UTF-8, until it finds an encoding specified in
> the file, in which case it does whatever magic it needs to do to switch
> encoding mid-stream.

It is UTF-8 by the way. XML can get encoding information from:

- an external transport protocol (e.g. HTTP or MIME) which is really the
only reason to use a Reader as input to XMLReader.
- from an encoding declaration as in <?xml encoding='UTF-8'?>
- or from a byte order mark

If none of the above are present it is a fatal error for the XML to be
in anything but UTF-8.
Signature

 Dale King

Chris Uppal - 26 Jun 2006 17:58 GMT
> > As far as I can see, your earlier code would have used the charset
> > specified in
[quoted text clipped - 4 lines]
>     The original code specifies the *OUTPUT* encoding, but not the input
> one.

Yes, precisely.  And if the input encoding is not specified from code, then (as
I understand it) the SAX implementation is /supposed/ to take it from the XML
(where, in the OP's examply it was declared as "IS-8859-1").  Using a
FileInputStream means that the input is decoded by that stream before the XML
parser sees it -- which may not be what is desired.  More specifically, the
code I commented on uses the Java system default decoder (whatever that happens
to be) -- which is almost certainly not what is desired.

   -- chris
Chris Uppal - 26 Jun 2006 18:08 GMT
I wrote:

> Yes, precisely.  And if the input encoding is not specified from code,
> then (as I understand it) the SAX implementation is /supposed/ to take it
[quoted text clipped - 4 lines]
> decoder (whatever that happens to be) -- which is almost certainly not
> what is desired.

Oops, sorry, I misread your post Oliver.

;-)  (But the "sorry" is real)

I misread both your post and the OP, in fact.  I was under the impression that
he was originally using an FileInputStream, and you were "correcting" that to a
FileReader.  My mistake.

   -- chris


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.