Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / July 2007

Tip: Looking for answers? Try searching our database.

How to identify cyrillic characters in String?

Thread view: 
t.javast@gmail.com - 21 Jun 2007 07:05 GMT
Hi everyone,

How can I identfiy my string contains cyrillic characters?
Whether my String has cyrillic characters in it?

Thanks in advance :)
Karl Uppiano - 21 Jun 2007 07:34 GMT
> Hi everyone,
>
> How can I identfiy my string contains cyrillic characters?
> Whether my String has cyrillic characters in it?
>
> Thanks in advance :)

I know it's a little weird, but you might iterate through your string,
testing each character, using something like this:

 boolean isCyrillic(char c) {
     return
Character.UnicodeBlock.CYRILLIC.equals(Character.UnicodeBlock.of(c));
 }

Perhaps not the most efficient, but you don't have to recreate and maintain
parts of the Unicode character database this way.
t.javast@gmail.com - 21 Jun 2007 07:40 GMT
Thanks a lot karl !!
This will help. :)
Andreas Leitgeb - 21 Jun 2007 10:40 GMT
> Thanks a lot karl !!
> This will help. :)

Excuse me, I'm just curious, as to why one would need
this.  Recognizing only cyrillic, but e.g. not chinese,etc.
Does cyrillic include both greek and russian?

If it's about language specifics, would you also
recognize e.g. "greeklish" (greek written with roman
letters, like 'ellhnika') ?

Please help me widen my horizon.
Thomas Fritsch - 21 Jun 2007 11:10 GMT
Andreas Leitgeb schrieb:

> Excuse me, I'm just curious, as to why one would need
> this.  Recognizing only cyrillic, but e.g. not chinese,etc.
> Does cyrillic include both greek and russian?
No, Cyrillic ist just Cyrillic, and has nothing to do with Greek
characters. The Cyrillic characters are '\u0400' to '\u04FF'. The Greek
characters are '\u0370' to '\u03FF'. However, the 'A' in Latin, Greek,
Cyrillic look all the same, although they are 3 different characters
('\0041', '\u0391', '\u0410').
It seems you confuse character ranges (like Cyrillic) with languages
(like Russian, Bulgarian, Serbian) which use these characters.
May be the confusion arises because there is only one language in the
world (the Greek language) which uses Greek characters.

> If it's about language specifics, would you also
> recognize e.g. "greeklish" (greek written with roman
> letters, like 'ellhnika') ?
Recognizing "greeklish" is a completely different story (probably much
more difficult).

> Please help me widen my horizon.

Signature

Thomas

Andreas Leitgeb - 28 Jun 2007 18:58 GMT
>> Excuse me, I'm just curious, as to why one would need
>> this.  Recognizing only cyrillic, but e.g. not chinese,etc.
>> Does cyrillic include both greek and russian?
> No, Cyrillic ist just Cyrillic, and has nothing to do with Greek
> characters.

I think there is a misunderstanding...   Greek and Russian
character sets are generally referred to as "cyrillic", and
so are the scripts of some more countries (you named serbia,
yourself), whose people use a script whose "R" looks like a
latin "P" (*)

>> Please help me widen my horizon.

Now that it's clear that you mean the subset of unicode chars
used for russian language, I'd still, and out of pure curiosity,
like to know, what difference it makes in your application, if
a user types russian letters as opposed to whether he's writing
chinese, vietnamese, X-hosa, accented latin letters or just
plain us-latin.   ...   unless of course, if telling me that
would conflict with any non disclosure agreements...

(*): yes, I'm aware that this is not a language science
 worthy definition of cyrillic scripts :-)
Alan Morgan - 28 Jun 2007 19:32 GMT
>>> Excuse me, I'm just curious, as to why one would need
>>> this.  Recognizing only cyrillic, but e.g. not chinese,etc.
[quoted text clipped - 4 lines]
>I think there is a misunderstanding...   Greek and Russian
>character sets are generally referred to as "cyrillic",

By whom?  I can see the Russian alphabet being referred to as
Greek or Hellenic as it is derived mostly from Greek, but I've
never heard of the Greek alphabet referred to as Cyrillic.  That
would be like referring to the Latin alphabet as Turkic.

Alan
Signature

Defendit numerus

Andreas Leitgeb - 28 Jun 2007 20:03 GMT
>>>> Excuse me, I'm just curious, as to why one would need
>>>> this.  Recognizing only cyrillic, but e.g. not chinese,etc.
[quoted text clipped - 5 lines]
> By whom?  I can see the Russian alphabet being referred to as
> Greek or Hellenic as it is derived mostly from Greek, ...

So it's two to one in this thread...
I'll do some research on what "cyrillic" really is.
Maybe I'm wrong, afterall.  You know, there's always
things that one believes one knows for sure, and many
years later one might find out it was wrong all the time...
Whether greek has got anything to do with cyrillic wasn't
really my point...

Anyway I'd like to get to know any reason why one would try to
detect characters of any particular alphabet in a user's input.
What difference in applications behaviour would such a detection
reasonably trigger?
John W. Kennedy - 28 Jun 2007 20:31 GMT
>>>>> Excuse me, I'm just curious, as to why one would need
>>>>> this.  Recognizing only cyrillic, but e.g. not chinese,etc.
[quoted text clipped - 8 lines]
> So it's two to one in this thread...
> I'll do some research on what "cyrillic" really is.

The Cyrillic alphabet is the alphabet used for Russian, Serbian,
Bulgarian, Ukrainian, Belorus, and some other East Slavic languages,
plus some non-Slavic languages in the traditional area of Russian
hegemony. The Greek alphabet is something different. The Cyrillic
alphabet is named for St. Cyril, who, along with his brother, St.
Methodius, first brought Christianity to the Slavs.
Signature

John W. Kennedy
"The first effect of not believing in God is to believe in anything...."
  -- Emile Cammaerts, "The Laughing Prophet"

Andreas Leitgeb - 02 Jul 2007 11:52 GMT
> [ about cyrill and his script ]

fine, learnt a thing.

Still I'm eager to learn about possible reasons to determine
existence of any particular script within a user's input, as
the original poster wanted to do.
paulgor@compuserve.com - 21 Jul 2007 18:33 GMT
1:4 :)

No, Greek alphabet is never called "Cyrillic"

--
Regards,
Paul Gorodyansky
"Cyrillic (Russian): instructions for Windows and Internet":
  http://RusWin.net
Andreas Leitgeb - 23 Jul 2007 09:08 GMT
> 1:4 :)
> No, Greek alphabet is never called "Cyrillic"

No, that's only 1:3.5, for the incredible lateness of that posting.
:-p
Roedy Green - 21 Jun 2007 12:41 GMT
>How can I identfiy my string contains cyrillic characters?
>Whether my String has cyrillic characters in it?

see http://mindprod.com/jgloss/unicode.html

You will likely discover the range 0x400 .. 0x4ff covers the basic
chars. (there are two other ranges)

So you can write

isCyrillic = ( 0x400 <= c && c <= 0x4ff );
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.