Java Forum / General / July 2007
How to identify cyrillic characters in String?
t.javast@gmail.com - 21 Jun 2007 07:05 GMT Hi everyone,
How can I identfiy my string contains cyrillic characters? Whether my String has cyrillic characters in it?
Thanks in advance :)
Karl Uppiano - 21 Jun 2007 07:34 GMT > Hi everyone, > > How can I identfiy my string contains cyrillic characters? > Whether my String has cyrillic characters in it? > > Thanks in advance :) I know it's a little weird, but you might iterate through your string, testing each character, using something like this:
boolean isCyrillic(char c) { return Character.UnicodeBlock.CYRILLIC.equals(Character.UnicodeBlock.of(c)); }
Perhaps not the most efficient, but you don't have to recreate and maintain parts of the Unicode character database this way.
t.javast@gmail.com - 21 Jun 2007 07:40 GMT Thanks a lot karl !! This will help. :)
Andreas Leitgeb - 21 Jun 2007 10:40 GMT > Thanks a lot karl !! > This will help. :) Excuse me, I'm just curious, as to why one would need this. Recognizing only cyrillic, but e.g. not chinese,etc. Does cyrillic include both greek and russian?
If it's about language specifics, would you also recognize e.g. "greeklish" (greek written with roman letters, like 'ellhnika') ?
Please help me widen my horizon.
Thomas Fritsch - 21 Jun 2007 11:10 GMT Andreas Leitgeb schrieb:
> Excuse me, I'm just curious, as to why one would need > this. Recognizing only cyrillic, but e.g. not chinese,etc. > Does cyrillic include both greek and russian? No, Cyrillic ist just Cyrillic, and has nothing to do with Greek characters. The Cyrillic characters are '\u0400' to '\u04FF'. The Greek characters are '\u0370' to '\u03FF'. However, the 'A' in Latin, Greek, Cyrillic look all the same, although they are 3 different characters ('\0041', '\u0391', '\u0410'). It seems you confuse character ranges (like Cyrillic) with languages (like Russian, Bulgarian, Serbian) which use these characters. May be the confusion arises because there is only one language in the world (the Greek language) which uses Greek characters.
> If it's about language specifics, would you also > recognize e.g. "greeklish" (greek written with roman > letters, like 'ellhnika') ? Recognizing "greeklish" is a completely different story (probably much more difficult).
> Please help me widen my horizon.
 Signature Thomas
Andreas Leitgeb - 28 Jun 2007 18:58 GMT >> Excuse me, I'm just curious, as to why one would need >> this. Recognizing only cyrillic, but e.g. not chinese,etc. >> Does cyrillic include both greek and russian? > No, Cyrillic ist just Cyrillic, and has nothing to do with Greek > characters. I think there is a misunderstanding... Greek and Russian character sets are generally referred to as "cyrillic", and so are the scripts of some more countries (you named serbia, yourself), whose people use a script whose "R" looks like a latin "P" (*)
>> Please help me widen my horizon. Now that it's clear that you mean the subset of unicode chars used for russian language, I'd still, and out of pure curiosity, like to know, what difference it makes in your application, if a user types russian letters as opposed to whether he's writing chinese, vietnamese, X-hosa, accented latin letters or just plain us-latin. ... unless of course, if telling me that would conflict with any non disclosure agreements...
(*): yes, I'm aware that this is not a language science worthy definition of cyrillic scripts :-)
Alan Morgan - 28 Jun 2007 19:32 GMT >>> Excuse me, I'm just curious, as to why one would need >>> this. Recognizing only cyrillic, but e.g. not chinese,etc. [quoted text clipped - 4 lines] >I think there is a misunderstanding... Greek and Russian >character sets are generally referred to as "cyrillic", By whom? I can see the Russian alphabet being referred to as Greek or Hellenic as it is derived mostly from Greek, but I've never heard of the Greek alphabet referred to as Cyrillic. That would be like referring to the Latin alphabet as Turkic.
Alan
 Signature Defendit numerus
Andreas Leitgeb - 28 Jun 2007 20:03 GMT >>>> Excuse me, I'm just curious, as to why one would need >>>> this. Recognizing only cyrillic, but e.g. not chinese,etc. [quoted text clipped - 5 lines] > By whom? I can see the Russian alphabet being referred to as > Greek or Hellenic as it is derived mostly from Greek, ... So it's two to one in this thread... I'll do some research on what "cyrillic" really is. Maybe I'm wrong, afterall. You know, there's always things that one believes one knows for sure, and many years later one might find out it was wrong all the time... Whether greek has got anything to do with cyrillic wasn't really my point...
Anyway I'd like to get to know any reason why one would try to detect characters of any particular alphabet in a user's input. What difference in applications behaviour would such a detection reasonably trigger?
John W. Kennedy - 28 Jun 2007 20:31 GMT >>>>> Excuse me, I'm just curious, as to why one would need >>>>> this. Recognizing only cyrillic, but e.g. not chinese,etc. [quoted text clipped - 8 lines] > So it's two to one in this thread... > I'll do some research on what "cyrillic" really is. The Cyrillic alphabet is the alphabet used for Russian, Serbian, Bulgarian, Ukrainian, Belorus, and some other East Slavic languages, plus some non-Slavic languages in the traditional area of Russian hegemony. The Greek alphabet is something different. The Cyrillic alphabet is named for St. Cyril, who, along with his brother, St. Methodius, first brought Christianity to the Slavs.
 Signature John W. Kennedy "The first effect of not believing in God is to believe in anything...." -- Emile Cammaerts, "The Laughing Prophet"
Andreas Leitgeb - 02 Jul 2007 11:52 GMT > [ about cyrill and his script ] fine, learnt a thing.
Still I'm eager to learn about possible reasons to determine existence of any particular script within a user's input, as the original poster wanted to do.
paulgor@compuserve.com - 21 Jul 2007 18:33 GMT 1:4 :)
No, Greek alphabet is never called "Cyrillic"
-- Regards, Paul Gorodyansky "Cyrillic (Russian): instructions for Windows and Internet": http://RusWin.net
Andreas Leitgeb - 23 Jul 2007 09:08 GMT > 1:4 :) > No, Greek alphabet is never called "Cyrillic" No, that's only 1:3.5, for the incredible lateness of that posting.
:-p Roedy Green - 21 Jun 2007 12:41 GMT >How can I identfiy my string contains cyrillic characters? >Whether my String has cyrillic characters in it? see http://mindprod.com/jgloss/unicode.html
You will likely discover the range 0x400 .. 0x4ff covers the basic chars. (there are two other ranges)
So you can write
isCyrillic = ( 0x400 <= c && c <= 0x4ff ); -- Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|