I'm writing a little app that users rss feeds from a website and I've found
that some feeds contain items in different languages. So far, I've only had
feeds that are in Japanese (eucjp). I've managed to get the feed to save
and display properly by adding the charset to my inputstream (or
inputstreamreader, I forgot). Anyway, the problem I'm having is that I'd
like to be able to read the xml, and then figure out the language that each
item is in. It seems that only a few are in Japanese, and I wouldn't be
surprised if they are sometimes mixed with items from different languages.
I've found the "Java port of Mozilla charset detector" and it works ok, but
it still won't be able to handle what I'm trying to do.
I'm using an rss library to parse the xml and give me simple objects to work
with. I'd hate to parse the xml manually by looking at bytes and then
feeding byte arrays to the charset detection library, this seems like a
dumb way to go (plus it means a lot more work).
Has anybody dealt with this in the past? I can't seem to find any solutions
on the net.
Thanks.
- Miguel

Signature
Posted via a free Usenet account from http://www.teranews.com
Oliver Wong - 29 Jun 2006 20:19 GMT
> I'm writing a little app that users rss feeds from a website and I've
> found
[quoted text clipped - 22 lines]
>
> Thanks.
RSS uses XML. My understanding is that XML is by default encoded in
UTF-8, and an XML parser should assume it's receiving UTF-8 data until it
receives an encoding declaration stating otherwise. In other words, this
should all work automatically.
Possibilities why it might not be working:
(1) The RSS library is buggy.
(2) The author of the RSS feed set their encoding declaration
incorrectly.
- Oliver
John W. Kennedy - 29 Jun 2006 20:53 GMT
>> I'm writing a little app that users rss feeds from a website and I've
>> found
[quoted text clipped - 28 lines]
> it receives an encoding declaration stating otherwise. In other words,
> this should all work automatically.
It's a little more complicated.
A) The encoding can be specified externally (e.g., by an HTTP header).
B) If it is not specified, it may be either UTF-8 or UTF-16 without an
encoding declaration.
C) And if there /is/ an encoding declaration, it is necessary to make at
least an approximate guess in order to read the encoding declaration.
The matter is discussed at
<URL:http://www.w3.org/TR/2004/REC-xml-20040204/#sec-guessing>.

Signature
John W. Kennedy
"The blind rulers of Logres
Nourished the land on a fallacy of rational virtue."
-- Charles Williams. "Taliessin through Logres: Prelude"