Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / March 2008

Tip: Looking for answers? Try searching our database.

Java Library - to determine whether given text is in English ?

Thread view: 
anonym - 14 Mar 2008 19:53 GMT
Hi,

I am looking for an available java function or library that takes a
sentence or a text as an input and outputs whether the text is in
English or not.

Thank you.
Roedy Green - 14 Mar 2008 23:12 GMT
> I am looking for an available java function or library that takes a
>sentence or a text as an input and outputs whether the text is in
>English or not.

A simple test would look for some common English works such as "is"
"an" "the". You could cook up a similar list for other languages and
get a best match.
Signature


Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Arne Vajhøj - 17 Mar 2008 00:15 GMT
>  I am looking for an available java function or library that takes a
> sentence or a text as an input and outputs whether the text is in
> English or not.

You can look at monograph or digraph frequencies and make
a guess based on those.

I did some experiments a long time ago.

See the C snippet below for some ideas.

Arne

=====================================================

   // monograph RIO analysis
   if((f['r']+f['R'])>(f['i']+f['I'])) {
      indicator[DK]++;
      indicator[FR]--;
   }
   if((f['O']+f['o'])>(f['R']+f['r'])) {
      indicator[UK]++;
      indicator[ES]++;
      indicator[DK]--;
   }
   if((f['I']+f['i'])>(f['O']+f['o'])) {
      indicator[DE]++;
      indicator[UK]--;
      indicator[ES]--;
   }
   // characteristic digraph analysis
   if((ff['t'*256+'h']+ff['T'*256+'H']+ff['T'*256+'h'])>0.01*l) {
      indicator[UK]++;
      indicator[DK]--;
      indicator[FR]--;
      indicator[DE]--;
      indicator[ES]--;
   }
   if((ff['c'*256+'h']+ff['C'*256+'H']+ff['C'*256+'h'])>0.01*l) {
      indicator[DE]++;
      indicator[DK]--;
      indicator[FR]--;
      indicator[ES]--;
   }
   if((ff['o'*256+'u']+ff['O'*256+'U']+ff['O'*256+'u'])>0.01*l) {
      indicator[UK]++;
      indicator[FR]++;
      indicator[DE]--;
      indicator[DK]--;
      indicator[ES]--;
   }
   if((ff['n'*256+'t']+ff['N'*256+'T']+ff['N'*256+'t'])>0.01*l) {
      indicator[FR]++;
      indicator[UK]--;
      indicator[DE]--;
      indicator[ES]--;
   }
   if((ff['u'*256+'e']+ff['U'*256+'E']+ff['U'*256+'e'])>0.01*l) {
      indicator[ES]++;
      indicator[DK]--;
      indicator[UK]--;
      indicator[FR]--;
      indicator[DE]--;
   }
   if((ff['l'*256+'a']+ff['L'*256+'A']+ff['L'*256+'a'])>0.01*l) {
      indicator[ES]++;
      indicator[DK]--;
      indicator[FR]--;
      indicator[DE]--;
   }
   // unused characters analysis
   if((f['j']+f['J'])>0.01*l) {
      indicator[DE]--;
   }
   if((f['k']+f['K'])>0.01*l) {
      indicator[DK]++;
      indicator[FR]--;
      indicator[ES]--;
   }
   if((f['w']+f['W'])>0.01*l) {
      indicator[UK]++;
      indicator[DE]++;
      indicator[FR]--;
      indicator[ES]--;
   }
   if((f['y']+f['Y'])>0.01*l) {
      indicator[UK]++;
      indicator[FR]--;
      indicator[DE]--;
   }
   // special characters analysis
   if((f[UCHAR('Æ')]+f[UCHAR('Ø')]+f[UCHAR('Å')]+
       f[UCHAR('æ')]+f[UCHAR('ø')]+f[UCHAR('å')])>0) { // danish
      indicator[DK]++;
      indicator[UK]--;
      indicator[FR]--;
      indicator[DE]--;
      indicator[ES]--;
   }
   if((f[UCHAR('Ä')]+f[UCHAR('Ö')]+f[UCHAR('Ü')]+
       f[UCHAR('ä')]+f[UCHAR('ö')]+f[UCHAR('ü')])>0) { // german umlaut
      indicator[DE]++;
      indicator[DK]--;
      indicator[UK]--;
      indicator[FR]--;
      indicator[ES]--;
   }
   if((f[UCHAR('É')]+f[UCHAR('Í')]+f[UCHAR('Ó')]+
       f[UCHAR('é')]+f[UCHAR('í')]+f[UCHAR('ó')])>0) { // roman slash
      indicator[FR]++;
      indicator[ES]++;
      indicator[DK]--;
      indicator[UK]--;
      indicator[DE]--;
   }
   if((f[UCHAR('Ñ')]+f[UCHAR('ñ')])>0) { // spanish n tilde
      indicator[ES]++;
      indicator[DK]--;
      indicator[UK]--;
      indicator[FR]--;
      indicator[DE]--;
   }
   if((f[UCHAR('Ç')]+f[UCHAR('ç')])>0) { // french c cedile
      indicator[FR]++;
      indicator[DK]--;
      indicator[UK]--;
      indicator[DE]--;
      indicator[ES]--;
   }
   if((f[UCHAR('ß')])>0) { // german double s
      indicator[DE]++;
      indicator[FR]--;
      indicator[DK]--;
      indicator[UK]--;
      indicator[ES]--;
   }
   if((f[UCHAR('À')]+f[UCHAR('È')]+f[UCHAR('Ò')]+
       f[UCHAR('à')]+f[UCHAR('è')]+f[UCHAR('ò')])>0) { // roman backslash
      indicator[FR]++;
      indicator[DK]--;
      indicator[UK]--;
      indicator[DE]--;
   }
   if((f[UCHAR('Ê')]+f[UCHAR('Î')]+f[UCHAR('Ô')]+
       f[UCHAR('ê')]+f[UCHAR('î')]+f[UCHAR('ô')])>0) { // roman hat
      indicator[FR]++;
      indicator[DK]--;
      indicator[UK]--;
      indicator[DE]--;
   }
Roedy Green - 17 Mar 2008 15:51 GMT
>    if((ff['t'*256+'h']+ff['T'*256+'H']+ff['T'*256+'h'])>0.01*l) {

Is this supposed to work with Unicode too, or only with an 8-bit
encoding?  is l the length of the string in chars?
Signature


Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Arne Vajhøj - 18 Mar 2008 02:35 GMT
>>    if((ff['t'*256+'h']+ff['T'*256+'H']+ff['T'*256+'h'])>0.01*l) {
>
> Is this supposed to work with Unicode too, or only with an 8-bit
> encoding?  is l the length of the string in chars?

Nope. As written it is C/C++. And it is assuming a single
byte character set (ISO-8859-1). But the idea could easily
be extended to Unicode.

Arne
Roedy Green - 17 Mar 2008 15:52 GMT
>    if((f[UCHAR('Ä')]+f[UCHAR('Ö')]+f[UCHAR('Ü')]+
>        f[UCHAR('ä')]+f[UCHAR('ö')]+f[UCHAR('ü')])>0) { // german umlaut

what does your UCHAR function do?
Signature


Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Arne Vajhøj - 18 Mar 2008 02:36 GMT
>>    if((f[UCHAR('Ä')]+f[UCHAR('Ö')]+f[UCHAR('Ü')]+
>>        f[UCHAR('ä')]+f[UCHAR('ö')]+f[UCHAR('ü')])>0) { // german umlaut
>
> what does your UCHAR function do?

It is a typedef for unsigned char.

Signed chars is a curse.

Arne
Roger Lindsjö - 17 Mar 2008 19:34 GMT
>>  I am looking for an available java function or library that takes a
>> sentence or a text as an input and outputs whether the text is in
>> English or not.
>
> You can look at monograph or digraph frequencies and make
> a guess based on those.

For english see here:
http://www.cs.chalmers.se/Cs/Grundutb/Kurser/krypto/en_stat.html or if
you have a large sample of text you can build your own tables.

Then build a table of the text you want to test and match it to the
"best" language  using a chi square test for example. I used something
similar in an exersize many years ago for finding probable language in a
cryptography class.

The test gets more accurate if you have lots of text. Very short texts
can not be tested reliably with these simple tests.

"I like my dog" is made of just Swedish words, although the meaning in
Swedish is gibberish.

Signature

Roger Lindsjö

Jeff Higgins - 17 Mar 2008 21:31 GMT
>>>  I am looking for an available java function or library that takes a
>>> sentence or a text as an input and outputs whether the text is in
[quoted text clipped - 6 lines]
> http://www.cs.chalmers.se/Cs/Grundutb/Kurser/krypto/en_stat.html or if you
> have a large sample of text you can build your own tables.

Thanks to above posters for the intersting ideas.

I would like to find a link to Google Corporation's similar
list taken from a sample of 1.252 X 10^100 email and usenet spam posts.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.