>>> I tried :
>>> input = new InputSource(new FileInputStream("file.xml"));
[quoted text clipped - 14 lines]
> The original code specifies the *OUTPUT* encoding, but not the input
> one.
Oops, sorry, I misread your post, Chris.
Here's what I suspect is happening in the original code: A FileReader is
created with no specified encoding. A FileReader doesn't know anything about
XML, so it's not like the file reader is going to look for an XML
declaration node, and check it's encoding attribute. Instead, the FileReader
just uses the system default encoding and reads a stream of bytes from the
disk, an transforms them into a stream of characters, and passes these
characters to the XMLReader. By the time the XMLReader receives these
characters, they've already been decoded under some specific encoding, so
it's "too late" for the XMLReader to try to use the encoding information
specified in the XML file.
That's why I suggested the OP use the constructor which takes in a
stream of bytes instead. The XMLReader will probably decode the first few
bytes using ASCII or UTF-8, until it finds an encoding specified in the
file, in which case it does whatever magic it needs to do to switch encoding
mid-stream.
And it turns out that's what the OP actually did. FileInputStream
processes files as a stream of bytes, and not as a stream of characters, so
no encoding/decoding is done by FileInputStream.
- Oliver
Dale King - 28 Jun 2006 06:41 GMT
> That's why I suggested the OP use the constructor which takes in a
> stream of bytes instead. The XMLReader will probably decode the first
> few bytes using ASCII or UTF-8, until it finds an encoding specified in
> the file, in which case it does whatever magic it needs to do to switch
> encoding mid-stream.
It is UTF-8 by the way. XML can get encoding information from:
- an external transport protocol (e.g. HTTP or MIME) which is really the
only reason to use a Reader as input to XMLReader.
- from an encoding declaration as in <?xml encoding='UTF-8'?>
- or from a byte order mark
If none of the above are present it is a fatal error for the XML to be
in anything but UTF-8.

Signature
Dale King
> > As far as I can see, your earlier code would have used the charset
> > specified in
[quoted text clipped - 4 lines]
> The original code specifies the *OUTPUT* encoding, but not the input
> one.
Yes, precisely. And if the input encoding is not specified from code, then (as
I understand it) the SAX implementation is /supposed/ to take it from the XML
(where, in the OP's examply it was declared as "IS-8859-1"). Using a
FileInputStream means that the input is decoded by that stream before the XML
parser sees it -- which may not be what is desired. More specifically, the
code I commented on uses the Java system default decoder (whatever that happens
to be) -- which is almost certainly not what is desired.
-- chris
Chris Uppal - 26 Jun 2006 18:08 GMT
I wrote:
> Yes, precisely. And if the input encoding is not specified from code,
> then (as I understand it) the SAX implementation is /supposed/ to take it
[quoted text clipped - 4 lines]
> decoder (whatever that happens to be) -- which is almost certainly not
> what is desired.
Oops, sorry, I misread your post Oliver.
;-) (But the "sorry" is real)
I misread both your post and the OP, in fact. I was under the impression that
he was originally using an FileInputStream, and you were "correcting" that to a
FileReader. My mistake.
-- chris