Hi:
I've got some XHTML documents that I'm using the classes in
java.xml.xpath to find certain tags. These documents contain a DTD
declaration for XHTML, with a public identifier. Since my application
needs to work without a network connection, I've downloaded the DTD
and associated entities and made them available to my application as
resources. I then set an EntityResolver the document builder that I
get from DocumentBuilderFactory.newInstance(). Here's the relevant
code from the resolveEntity method:
url = getClass().getResource (identifierMap.get(publicId));
return new InputSource (url.toString());
When I run the application, I get the following message from the
parser:
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:
Invalid byte 1 of 1-byte UTF-8 sequence.
After browsing around a bit, I tried:
url = getClass().getResource (identifierMap.get(publicId));
FileReader reader = new FileReader (new File (url.toURI()));
return new InputSource (reader);
but this had the same problem.
I downloaded the files from the W3C site, both by using FireFox and by
using wget. In both cases I get the same behavior.
I don't know much about character encodings, so I'm at a loss as to
what to try next. Any suggestions would be greatly appreciated.
Ryan
Lew - 13 Jun 2007 18:05 GMT
> Hi:
>
[quoted text clipped - 3 lines]
> com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:
> Invalid byte 1 of 1-byte UTF-8 sequence.
Ideally, all XML documents should be in UTF-8 encoding. Apparently the DTD or
your XML file isn't. When they aren't, the XML declaration should specify the
encoding.
> After browsing around a bit, I tried:
>
[quoted text clipped - 3 lines]
>
> but this had the same problem.
Have you considered using
<http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html#InputStream
Reader(java.io.InputStream,%20java.nio.charset.Charset)>
?
This will let you specify the document encoding to match how it's stored.

Signature
Lew
Ryan McFall - 13 Jun 2007 18:08 GMT
Pardon my stupidity - the XML file was saved by someone else, and
apparently it was saved as something other than UTF-8. Re-saving it
into UTF-8 solved my problem.
Ryan