> Depending on the OS, why wouldn't filenames "be unicode"[*] as well?
Yes, in some environments. My statements were not careful enough.
> [*] the phrase is in quotes because comparing ASCII and Unicode doesn't make
> sense, as ASCII both an encoding system and a character set, while Unicode
> is a character set without an encoding system.
After spending some time reviewing defitiontion of unicode I see that
many, including one from unicode.org
(http://www.unicode.org/faq/basic_q.html#a) talk about unicode being an
encoding as well as character set.
Unicode.org technical introductions tarts with:
"The Unicode Standard is the universal character encoding standard used
for representation of text for computer processing."
Perhaps I am completely misunderstanding your point. There is plenty
of content on the internet where ASCII and Unicode are compared. Even
if that comparison is not between two things of the same type, isn't it
fairly clear that when comparing a filename's ASCII bytes equence to a
classname's Unicode byte sequence one cannot perform a byte by byte
match?
Regards,
Opalinski
opalpa@gmail.com
http://www.geocities.com/opalpaweb/
Oliver Wong - 13 May 2006 18:36 GMT
<opalpa@gmail.com> wrote in message
news:1147538377.052199.15000@d71g2000cwd.googlegroups.com...
>> [*] the phrase is in quotes because comparing ASCII and Unicode doesn't
>> make
[quoted text clipped - 6 lines]
> (http://www.unicode.org/faq/basic_q.html#a) talk about unicode being an
> encoding as well as character set.
Hmm, maybe my terminology is a bit loose. I meant that ASCII encodes
characters directly to byte sequences, whereas Unicode is a mapping from
characters to natural numbers (of arbitrary size; e.g. not merely 0 to
2^32), and then you need a seperate encoding, like UTF-8, to map from those
natural numbers to byte-sequences of finite size.
> Unicode.org technical introductions tarts with:
> "The Unicode Standard is the universal character encoding standard used
[quoted text clipped - 6 lines]
> classname's Unicode byte sequence one cannot perform a byte by byte
> match?
A lot of people (subconciously?) think Unicode maps directly from
characters to byte sequences; it's a common misconception, so it wouldn't
surprise me that there would be a large amount of content on the Internet
which makes this mistake, or gloss over it. AFAIK, there's no such thing as
a "Unicode byte sequence". You could talk about comparing ASCII byte
sequences to UTF-8 byte sequences, but not to "Unicode byte sequences".
- Oliver
Roedy Green - 14 May 2006 03:57 GMT
> A lot of people (subconciously?) think Unicode maps directly from
>characters to byte sequences
if you just use 16 bit Unicode, it does.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Oliver Wong - 15 May 2006 17:40 GMT
>> A lot of people (subconciously?) think Unicode maps directly from
>>characters to byte sequences
>
> if you just use 16 bit Unicode, it does.
Only in UTF-16, AFAIK. In UTF-8 (which is one of the few encodings which
is backwards compatible with ASCII when using only characters from the ASCII
character set, and so one of the most commonly used encodings), the unicode
character \u00A9 (the copyright symbol), for example, maps onto the byte
sequence C2A9, and not 00A9.
- Oliver
Roedy Green - 16 May 2006 20:18 GMT
> A lot of people (subconciously?) think Unicode maps directly from
>>>characters to byte sequences
>>
>> if you just use 16 bit Unicode, it does.
Let me restate that in a more lawyerly way.
If you use only the codepoints 0..0xffff and you use UTF-16BE
encoding, you can think of Unicode as mapping directly to byte
sequences.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.