Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / May 2006

Tip: Looking for answers? Try searching our database.

Function for removing Accents?

Thread view: 
Luc The Perverse - 16 May 2006 13:09 GMT
Right now I have a crude hard coded method using a series of replaceAll's
for removing accents and converting them to their approximate non accented
equivalents.

Included are the Latin AE combo, the German double S etc.

I know that Java understands these characters because when I make a string
lowercase it will convert the capital AE to a lowercase ae.

All I could find was a reference to an obscure unimplemented API.   I would
go for a publicly available function though - anything has got to be better
than the way I am doing it.  (If worse comes to worse I will copy my ANSI
tables and make some "range" conversions)

Signature

LTP

:)
Chris Uppal - 16 May 2006 13:36 GMT
> Right now I have a crude hard coded method using a series of replaceAll's
> for removing accents and converting them to their approximate non accented
> equivalents.

I doubt if that is either meaningful or possible in general.  It is certainly
not easy.

If you try it at all, and are not satisfied with a handful of hardwired
mappings, then you'll probably have to get deeply into Unicode.  See:
   http://www.unicode.org/reports/tr15/index.html
for one of the Unicode technical reports which discusses decomposition of
characters into base characters plus various kinds of diacritical marks.  You
could presumably filter out characters representing diacritical marks leaving
the character which was qualified by the marks in place.

   -- chris
Luc The Perverse - 17 May 2006 00:14 GMT
>> Right now I have a crude hard coded method using a series of replaceAll's
>> for removing accents and converting them to their approximate non
[quoted text clipped - 4 lines]
> certainly
> not easy.

My wife does not know to push ALT 0 2 4 4 for an ñ, so she would have a hard
time searching for Piña Coladas if she wanted to hear that Garth Brooks'
song.

> If you try it at all, and are not satisfied with a handful of hardwired
> mappings, then you'll probably have to get deeply into Unicode.  See:
[quoted text clipped - 5 lines]
> leaving
> the character which was qualified by the marks in place.

AH!   I don't want to "learn" unicode.  Especially after looking at that
link!

Perhaps I should have explained in more detail the scope of my question.

--
LTP

:)
Chris Uppal - 17 May 2006 10:08 GMT
> My wife does not know to push ALT 0 2 4 4 for an ñ, so she would have a
> hard time searching for Piña Coladas if she wanted to hear that Garth
> Brooks' song.

In that case a little properties file containing mappings should do the job.
Add new mappings as and when the need arises.  You could allow multiple
candidate replacements, so that your aenima example (which I can't type in to
this newsreader correctly) could be matched by any of
   the correct version
   aenima
   anima
   enima

> I don't want to "learn" unicode.

Odd that you should say that; quite a lot of programmers seem to feel the
same...

   -- chris
bugbear - 16 May 2006 13:40 GMT
> Right now I have a crude hard coded method using a series of replaceAll's
> for removing accents and converting them to their approximate non accented
[quoted text clipped - 9 lines]
> than the way I am doing it.  (If worse comes to worse I will copy my ANSI
> tables and make some "range" conversions)

http://glaforge.free.fr/weblog/index.php?itemid=115

looks informative.

http://www.alphaworks.ibm.com/tech/unicodenormalizer

looks helpful

  BugBear
Mickey Segal - 16 May 2006 13:42 GMT
> Right now I have a crude hard coded method using a series of replaceAll's
> for removing accents and converting them to their approximate non accented
> equivalents.

When I brought up this issue a few months ago the consensus was that the
replacing approach was the way to go.  I've chosen 37 commonly used
characters to replace, but I'm sure on could get more fancy.
Thomas Weidenfeller - 16 May 2006 14:15 GMT
> Right now I have a crude hard coded method using a series of replaceAll's
> for removing accents and converting them to their approximate non accented
> equivalents.

There is no such thing as "non accented equivalents" for accented
characters in general. Accents, diaeresis, etc. on a character have a
meaning, they are not there for decorative reasons. A character without
its "decoration" is usually a complete different character than the
original.

In short, removing such information is a very bad idea.

> All I could find was a reference to an obscure unimplemented API.   I would
> go for a publicly available function though - anything has got to be better
> than the way I am doing it.  (If worse comes to worse I will copy my ANSI
> tables and make some "range" conversions)

If preserving the meaning of your input text data is not desirable, or,
in other words, you badly want to butcher your text, annoy native
speakers and apply an irreversible algorithm, then you could try the
following:

Get the Unicode Character Database (UCD), and build a particular
lookup-table from the information in the UCD:

Parse the UCD data to extract all characters which have explicit
decomposition information. Analyze the decomposition information (search
for the accents, diaeresis marks, etc. in the information). If you find
one, remove it from the decomposition map of the particular character.
If a single character remains in the map, consider that your result,
otherwise apply more decomposition.

Use that information to build a lookup table "original char -> butchered
char". Code that table into some Java class and provide a method to look
up characters in the map.

/Thomas
Signature

The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/

Chris Uppal - 16 May 2006 15:45 GMT
> Get the Unicode Character Database (UCD), and build a particular
> lookup-table from the information in the UCD:
[quoted text clipped - 5 lines]
> If a single character remains in the map, consider that your result,
> otherwise apply more decomposition.

You could probably speed up this process considerably by using a pre-existing
Unicode package such as ICU:

   http://icu.sourceforge.net/

That would only save code, though.  You'd still have the problem of learning
what all the data means before you start to manipulate it.

Incidentally, why is ICU never mentioned around here ?  It looks like a very
solid and complete package -- although I admit I haven't actually /used/ it yet
(planning to do so fairly soon).

   -- chris
Thomas Weidenfeller - 16 May 2006 16:16 GMT
> You could probably speed up this process considerably by using a pre-existing
> Unicode package such as ICU:

I am not sure :-) My understanding of ICU after checking the
documentation is that it doesn't do the destructive thing the OP might
want to do. It looks more as if the authors of ICU tried very hard to
get every aspect of Unicode right. Mapping an accented character to a
single non-accented  "equivalent" is certainly not right in the scope of
Unicode, and also not in the scope of non-ascii languages.

The effort to invest in a solution also depends on how good the solution
has to be. Since the original text is anyhow supposed to be butchered, I
don't see a reason for 100% accuracy.

So, scripting the parsing of the UCD for finding the interesting values
should not take that much time. I would guess less than an hour. That
should include scripting of checking the decomposition values for these
"bad" accents (probably code points starting at 0x300 up to some value I
forgot). The result should be a map of a bunch of characters.

Some more scripting to get that output into a Java data structure, add a
lookup method, compile, and that's it.

> Incidentally, why is ICU never mentioned around here ?

Probably because people don't know about it (I didn't). And probably
because it solves problems not many people have each day.

/Thomas
Signature

The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/

Luc The Perverse - 17 May 2006 00:18 GMT
>> You could probably speed up this process considerably by using a
>> pre-existing
[quoted text clipped - 6 lines]
> "equivalent" is certainly not right in the scope of Unicode, and also not
> in the scope of non-ascii languages.

I did say approximate representation.

The thing that annoys me the most is the song by Tool named ænima.

> The effort to invest in a solution also depends on how good the solution
> has to be. Since the original text is anyhow supposed to be butchered, I
> don't see a reason for 100% accuracy.

To be honest the solution that I have now works just fine.  I have hardcoded
ReplaceAlls for every character appearing in my file name scope.

>> Incidentally, why is ICU never mentioned around here ?
>
> Probably because people don't know about it (I didn't). And probably
> because it solves problems not many people have each day.

You are not the only one that had not heard about it.

Unicode is such a beautiful thing - I wonder why it takes so long to catch
on?

--
LTP

:)
Domagoj Klepac - 17 May 2006 10:52 GMT
>Unicode is such a beautiful thing - I wonder why it takes so long to catch
>on?

Because internationalization was always expensive, and the IT industry
was based in the Western countries, and it had no incentive to adapt
to the new, small markets. For example, I would guess that there is no
Esperanto Windows version. Or Esperanto Office. Not enough people
would buy it. And if even mighty Microsoft can't afford to do it,
other software vendors certainly can't.

Think of it - how many programs did you write, which export all GUI
text, and make it easy to translate it? And which are designed so that
both German translation and Arabic translation fit into design? And
mind you, German translation might be much longer than English text -
Germans are known for their excellent synchronization of foreign TV
shows, since German subtitles would take half of screen and are simply
not an option. And Arabic is written from left to right.

Though things regarding internationalization are changing rapidly
since the India and China had become great markets. Now most of the
Western-centered companies are struggling to get into Asia. And now
they have huge disadvantage regarding internationalization compared to
the native companies. What goes around comes around. :)

But I digress - what I wanted to say is that Java, with its native
Unicode support, has probably done more for Unicode adaptation than
all other languages combined. :) But Java is still new, and I suspect
that most of the world's codebase is in C/C++, where you're not
"forced" to use Unicode by default. Which is probably why Unicode
takes so long to catch on.

               Domchi

Signature

Ouroboros ltd. - http://www.ouroboros.hr 
Antispam: to reply, remove extra monkey from reply-to address.

Alex Buell - 17 May 2006 11:02 GMT
> Because internationalization was always expensive, and the IT industry
> was based in the Western countries, and it had no incentive to adapt
> to the new, small markets. For example, I would guess that there is no
> Esperanto Windows version. Or Esperanto Office. Not enough people
> would buy it. And if even mighty Microsoft can't afford to do it,
> other software vendors certainly can't.

Microsoft can afford to. They just aren't interested.

Signature

http://www.munted.org.uk

Take a nap, it saves lives.

Luc The Perverse - 17 May 2006 21:31 GMT
>> Because internationalization was always expensive, and the IT industry
>> was based in the Western countries, and it had no incentive to adapt
[quoted text clipped - 4 lines]
>
> Microsoft can afford to. They just aren't interested.

Microsoft is too "closed room development".  If they had allowed people to
make their own translations and then paid someone a small fee to "check it"
people would have jumped at the opportunity.

But they are not completely immune to suggestions.

I made a suggestion for a modification to the MSDN library which literally
would have turned a 5 hour job into a 5 minute job.  I didn't know what I
was looking for (a matter of semantics), and a simple "See Also" link at the
bottom would have saved me much headache.   About 17 months after I
"emailed" (it was a web form) them I received an email back saying that they
had accepted my suggestion and it would be included in the next version of
MSDN.

--
LTP

:)
Roedy Green - 22 May 2006 06:32 GMT
On Wed, 17 May 2006 14:31:08 -0600, "Luc The Perverse"
<sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly
quoted someone who said :

> About 17 months after I
>"emailed" (it was a web form) them I received an email back saying that they
>had accepted my suggestion and it would be included in the next version of
>MSDN.
I am unworthy.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Luc The Perverse - 22 May 2006 06:59 GMT
> On Wed, 17 May 2006 14:31:08 -0600, "Luc The Perverse"
> <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly
[quoted text clipped - 6 lines]
>>MSDN.
> I am unworthy.

. . .

I can't tell - are you mocking me?

--
LTP

:)
Roedy Green - 23 May 2006 04:45 GMT
On Sun, 21 May 2006 23:59:11 -0600, "Luc The Perverse"
<sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly
quoted someone who said :

>> I am unworthy.
>
>. . .
>
>I can't tell - are you mocking me?

I am making a reference to pop culture.  It very unusual to get a
major company to actually implement an outside idea.  It is quite a
feather in your cap.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Luc The Perverse - 23 May 2006 05:05 GMT
> On Sun, 21 May 2006 23:59:11 -0600, "Luc The Perverse"
> <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly
[quoted text clipped - 9 lines]
> major company to actually implement an outside idea.  It is quite a
> feather in your cap.

Well thank you.

Now if I could just get Taco Time to bring back (permanently) the Smokey
Southwest Chicken Burrito - the finest fast food item that has ever existed.

--
LTP

:)
Mickey Segal - 23 May 2006 16:07 GMT
> I am making a reference to pop culture.  It very unusual to get a
> major company to actually implement an outside idea.  It is quite a
> feather in your cap.

In my experience, Microsoft has implemented things I've suggested a
surprising number of times.  In some cases I know it was not just an obvious
idea that lots of other people had suggested already.

I never figured out how much of this success was from was asking for the
right things, reaching the right people or the similarity of my name to that
of a Seattle talk show host.
Roedy Green - 24 May 2006 00:49 GMT
On Tue, 23 May 2006 11:07:51 -0400, "Mickey Segal"
<not_monitored@example.com> wrote, quoted or indirectly quoted someone
who said :

>In my experience, Microsoft has implemented things I've suggested a
>surprising number of times.  In some cases I know it was not just an obvious
>idea that lots of other people had suggested already.

I wrote Walt Disney circa 1975 to explain how you could use computers
to automate the production of cartoons (an extremely labour intensive
process). They wrote back an unusually rude letter saying they weren't
interested in any ideas except those generated by Disney employees.

Around the same time, I wrote Polaroid a letter about how you might
incorporate image enhancement eventually in a home camera.  They were
much more polite. They sent back an inch thick contract to protect
them from me later asking them for money.  It was such a scary thing I
decided to let the matter drop.

Sun wrote me and asked me to serve on some sort of advisory board.  I
simply have not had the energy to deal with it. That would probably
give me a lot more leverage than simply posting ideas.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Luc The Perverse - 24 May 2006 00:54 GMT
>>In my experience, Microsoft has implemented things I've suggested a
>>surprising number of times.  In some cases I know it was not just an
[quoted text clipped - 5 lines]
> process). They wrote back an unusually rude letter saying they weren't
> interested in any ideas except those generated by Disney employees.

OMG.  Well they sure showed you.  See how the cartoon industry is thriving
without computers.

> Around the same time, I wrote Polaroid a letter about how you might
> incorporate image enhancement eventually in a home camera.  They were
> much more polite. They sent back an inch thick contract to protect
> them from me later asking them for money.  It was such a scary thing I
> decided to let the matter drop.

LOL

> Sun wrote me and asked me to serve on some sort of advisory board.  I
> simply have not had the energy to deal with it. That would probably
> give me a lot more leverage than simply posting ideas.

You should accept that one - assuming it doesn't require something which you
are physically incapable of.

--
LTP

:)
Mickey Segal - 24 May 2006 02:02 GMT
> Around the same time, I wrote Polaroid a letter about how you might
> incorporate image enhancement eventually in a home camera.  They were
> much more polite. They sent back an inch thick contract to protect
> them from me later asking them for money.  It was such a scary thing I
> decided to let the matter drop.

It may be no coincidence that Polaroid is gone.  There have, though, been
cases in which people who made suggestions have sued companies for a share
and won, so their concerns were not totally irrational, but there are less
bureaucratic ways of dealing with it.

> Sun wrote me and asked me to serve on some sort of advisory board.  I
> simply have not had the energy to deal with it. That would probably
> give me a lot more leverage than simply posting ideas.

I've had good experience with Sun too, but that may have something to do
with the seniority of the person I've contacted.
Dale King - 25 May 2006 15:22 GMT
>> Around the same time, I wrote Polaroid a letter about how you might
>> incorporate image enhancement eventually in a home camera.  They were
[quoted text clipped - 6 lines]
> and won, so their concerns were not totally irrational, but there are less
> bureaucratic ways of dealing with it.

Apple's policy for dealing with it is to reject any outside suggestions,
which unfortunately led to some negative publicity:

http://www.macnn.com/articles/06/04/17/apple.makes.girl.cry

Signature

 Dale King

Domagoj Klepac - 24 May 2006 10:21 GMT
>>In my experience, Microsoft has implemented things I've suggested a
>>surprising number of times.  In some cases I know it was not just an obvious
[quoted text clipped - 4 lines]
>process). They wrote back an unusually rude letter saying they weren't
>interested in any ideas except those generated by Disney employees.

I think that at that time Disney had a official view that cartoons
need to have a heart, which no machine can supply. They were also
employing a lot of animators and artists, and feared that with the
advent of computers, they would have to let most of them go. You
probably struck a nerve. :)

               Domchi

Signature

Ouroboros ltd. - http://www.ouroboros.hr 
Antispam: to reply, remove extra monkey from reply-to address.

Mickey Segal - 16 May 2006 16:40 GMT
> There is no such thing as "non accented equivalents" for accented
> characters in general. Accents, diaeresis, etc. on a character have a
[quoted text clipped - 7 lines]
> other words, you badly want to butcher your text, annoy native speakers
> and apply an irreversible algorithm, then you could try the following:

There are times in which it makes sense to remove accented characters, for
example when carrying out a search.  In our software we store names of
diseases in their fully accented forms.  However, users of our software may
be familiar with the name of the disease as rendered with unaccented
characters, or they may not know how to type accented characters.  We carry
out a user's search by removing accents from the search text and stripping
accents from the disease names for the purposes of the search.  We do not
store the stripped-down strings.
Luc The Perverse - 17 May 2006 00:30 GMT
>> There is no such thing as "non accented equivalents" for accented
>> characters in general. Accents, diaeresis, etc. on a character have a
[quoted text clipped - 17 lines]
> stripping accents from the disease names for the purposes of the search.
> We do not store the stripped-down strings.

Thomas seemed to think I was removing accents from a document.

The easiest solution would have been to rename all the files/titles of the
media so that I didn't have to deal with accented characters at all.    But
to me ça is not ca, muñeca is not muneca.   So I would have to agree with
him.

My desire to get a "better solution" is just to make my functions more
future ready (I hesitate to say future proof)

--
LTP

:)
Domagoj Klepac - 17 May 2006 09:59 GMT
>> Right now I have a crude hard coded method using a series of replaceAll's
>> for removing accents and converting them to their approximate non accented
[quoted text clipped - 7 lines]
>
>In short, removing such information is a very bad idea.

It depends, really.

We in non-English parts of the world, whose alphabet has more than 25
letters, have lived with that since the PC came along (Mac, on the
other hand, always had almost transparent internationalization
support). This has led to a wide adaptation of "non-accented
equivalents", in specific areas - for example, in Usenet posts, most
people don't use non-accented characters, although nowdays almost all
Usenet readers support it. But even the Windows XP might not be able
to access the file if the file name (or folder name) contains
non-English characters.

Mostly, those equivalents are simply the same character with stripped
accent. Sometimes, accented character maps to several ASCII
characters. For example, Croatian "ž" maps to "z", but "Ð" maps to
"Dj". Or, in German, "ß" maps to "ss". But all vowels, which make most
of the accented characters, (I think) can be simply stripped to their
non-accented counterparts.

Google does this, BTW. It searches for both accented and non-accented
version of the word you entered. That's a good example of a case where
removing accents is good.

               Domchi

Signature

Ouroboros ltd. - http://www.ouroboros.hr 
Antispam: to reply, remove extra monkey from reply-to address.

Patricia Shanahan - 17 May 2006 14:40 GMT
...
> Mostly, those equivalents are simply the same character with stripped
> accent. Sometimes, accented character maps to several ASCII
> characters. For example, Croatian "ž" maps to "z", but "Ð" maps to
> "Dj". Or, in German, "ß" maps to "ss". But all vowels, which make most
> of the accented characters, (I think) can be simply stripped to their
> non-accented counterparts.

What about o-umlaut and oe? As in "Goedel".

Google can deal with this - I tried a search for "Goedel" and got a
page,
http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Godel.html, that
spells the name with o-umlaut.

Patricia
Jaakko Kangasharju - 17 May 2006 15:54 GMT
> What about o-umlaut and oe? As in "Goedel".
>
> Google can deal with this - I tried a search for "Goedel" and got a
> page,
> http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Godel.html, that
> spells the name with o-umlaut.

Interesting: If you search for "Goedel", you get "Goedel" and "Gödel"
as matches, but if you search for "Gödel", the matches are "Gödel" and
"Godel".

Signature

Jaakko Kangasharju, Helsinki Institute for Information Technology
I + NT = Problem

Roedy Green - 21 May 2006 20:48 GMT
>What about o-umlaut and oe? As in "Goedel".
>
>Google can deal with this - I tried a search for "Goedel" and got a
>page,
>http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Godel.html, that
>spells the name with o-umlaut.

I have noticed that too, also finding variant spellings.  I have not
yet experimented to see if it is Google collapsing or the existence of
multiple variants buried in the text, perhaps in the keywords.

On most typos though Google asks if I really meant something else, but
on a popular "variant" spelling it does not.

I suspect I would have better luck looking up "seperater" that
"separator" even though it is the wrong spelling.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Hendrik Maryns - 16 May 2006 16:23 GMT
Luc The Perverse schreef:
> Right now I have a crude hard coded method using a series of replaceAll's
> for removing accents and converting them to their approximate non accented
> equivalents.
>
> Included are the Latin AE combo, the German double S etc.

I think this is impossible, because the conversion is very language
dependent: if you want to remove the ¨ from a German ä, it should become
ae, whereas in the more seldom case you encounter an ä in Dutch, the
best equivalent would be a.  Unless you just want to remove accents,
without trying to keep meaning, but then, why not just remove the
accented letters altogether?

H.
- --
Hendrik Maryns

==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
Luc The Perverse - 17 May 2006 00:07 GMT
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
[quoted text clipped - 13 lines]
> without trying to keep meaning, but then, why not just remove the
> accented letters altogether?

All I want to design is an accent insensitive search dictionary.   No
meaning needs to be preserved because the user will not see any converted
text.

I do not wish to remove the accent altogether, because I want the user to be
able to search with or without the accent.

For instance if I want to search for rückenwind, it will take me at least
several additional seconds to experiment with the ALT keys to find the right
character combination for ü.   (Although learning the character codes is
easier than attempting to learn alternate German spellings.)   It doesn't
mean I don't know that there is an accent there, it is just easier to type
ruckenwind.

And if you cringe at that, you might faint to see the approximate
transliterations that I have used to name my Russian MP3's!

--
LTP

:)
alexandre_paterson@yahoo.fr - 17 May 2006 03:35 GMT
...
> For instance if I want to search for rückenwind, it will take me at least
> several additional seconds to experiment with the ALT keys to find the right
[quoted text clipped - 5 lines]
> And if you cringe at that, you might faint to see the approximate
> transliterations that I have used to name my Russian MP3's!

You decided to go for the "accent-removing" way though there
are many different ways to achieve the functionality you're after.

To me this looks very similar to a spelling algorithm (even though
you're using several languages at once).

For every search word entered you could:
-check exact matches
-check "one typo" matches (eg "Pixes" instead of "Pixies")
-check "two typos" matches (eg "Spellshaker" instead of "Spellchecker")
-check all "sound-alike" words (using eg Soundex or Double-Metaphone)
-rank all the propositions found using these various techniques by
their "closeness" to the search term entered (using Levenhstein's
edit distance, implemented using a DP algorithm).

This should allow to enter really badly written band names/song
names and still find back the song(s) you're after.

For example, my home-made spellchecker's first proposition for
"paulitiquale" ("political" spelled in a strange french-phonetical
way) is, surely enough, "political" (note that Un*x' aspell/ispell
works the same way).

Now of course this is lots of work for a simple functionality, but
you could still use something similar but much much simpler
(take could be implemented very easily):

- for every correct name you have in your database, you create
a hashmap with all vowels removed

Pixies -> Pxs
Pink Floyd -> Pnk Fld
rückenwind -> rckwnd

etc.

Then, when someone enters a search term, *if you don't find
an exact match*, you try to remove every vowels
from the entered search terms and sees if this corresponds
to something in your hashmap.

You can get a little more fancy by checking if the entry (or
entries) in the hashmap really correspond to something close
to the entered search term by calculating the "edit distance"
between the two strings.

It really looks like what a spellchecker would do: the whole
point is not using a single technique, but a variety of
techniques.  Each one augmenting the probability that
the search gives back a meaningfull result.

FWIW,

Alex

P.S: this post directly edited in groups.google.com, without
bothering to copy&paste in a spellchecker ;)
Hendrik Maryns - 17 May 2006 09:08 GMT
alexandre_paterson@yahoo.fr schreef:
> ...
>> For instance if I want to search for rückenwind, it will take me at least
[quoted text clipped - 21 lines]
> their "closeness" to the search term entered (using Levenhstein's
> edit distance, implemented using a DP algorithm).

There surely are packages that do ‘fuzzy’ string matching (when not in
Java, then certainly in Perl (CPAN)).

> This should allow to enter really badly written band names/song
> names and still find back the song(s) you're after.
[quoted text clipped - 16 lines]
>
> etc.

Nce idea, but in the end you end up removing all letters: how about ñ
(Spanish), ý, ð, þ (Icelandic), ç (French), ß (German), ś (Czech?), and
those with a v on top of them, I can’t find them either...

And what to do if you want to look for this song: second on
http://en.wikipedia.org/wiki/Windowlicker#Track_listing ??  :-p

H.
- --
Hendrik Maryns

==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
alexandre_paterson@yahoo.fr - 17 May 2006 13:27 GMT
Hi Hendrik!

...

> > Pixies -> Pxs
> > Pink Floyd -> Pnk Fld
[quoted text clipped - 5 lines]
> (Spanish), ý, ð, þ (Icelandic), ç (French), ß (German), s (Czech?), and
> those with a v on top of them, I can't find them either...

I know... Which is why I said that the idea of a spellchecker is not
to use a single technique but a variety of techniques.

You've got a "result List<SortableString>", each technique adds
zero or more "propositions" to this List.

- You first try all correct matches
- You then try all "one letter typos" (one wrong letter, one letter
  not entered, one letter typed too much)
- You then try all "two letters typos" (still easily doable)
- You then try all close "no vowels" words
- etc.

Then you rank all those propositions according to their "edit
distance" to correct band names/song names.

It's not computationally intensive: at billions operations/second
you can get "creative". (it can be memory consuming though if
you've got hundreds of thousands of songs, then you need to
bypass Java's object [HashMap and String just won't cut it] and
use your own data structure. I know, for I got to 22 bits per
word on a spellchecker able to deal with dictionaries handling
hundreds of thousands of words -- try to do that with Jazzy ;)

And there's no "bad technique" either.  Each trick simply can
lead to propositions.

Imagine a really bad technique adds "gigolo" to the list of
proposition when the user entered "gogle" (looking for "google"),
well... The "one letter typo" check will have found "google".
So propositions will contains ?????, ?????, gigolo, ?????, google, ????

Then they get sorted by edit distance to the user's entered search
string:  google is close to gogle (edit distance of 1) while gigolo is
not that close to gogle.  It doesn't mean that "gigolo" isn't proposed
but simply that "google" is proposed first.

I'll re-write what I wrote in the previous post: the whole point is
not using a single technique, but a variety of techniques.  Each one
augmenting the probability that the search gives back a meaningfull
result.

> And what to do if you want to look for this song: second on
> http://en.wikipedia.org/wiki/Windowlicker#Track_listing ??  :-p

Ah, Aphex Twin :)

Well, for this one I admit I'm not so sure... I'd add a "special case"
and I would have the only spellchecker/search engine in the world
able to deal with that case!

;)
Hendrik Maryns - 17 May 2006 17:39 GMT
alexandre_paterson@yahoo.fr schreef:
> Hi Hendrik!
>
[quoted text clipped - 20 lines]
> - You then try all close "no vowels" words
> - etc.

<snip>

I agree that you’re on the right track here...

>> And what to do if you want to look for this song: second on
>> http://en.wikipedia.org/wiki/Windowlicker#Track_listing ??  :-p
[quoted text clipped - 4 lines]
> and I would have the only spellchecker/search engine in the world
> able to deal with that case!

It is commonly known as (A complex mathematical equation), with or
without brackets, or simply [mathematical equation].  The whole song is
a joke, though, it is just a picture of his face converted with some
pic-to-audio program.  Nevertheless, I like it.

H.
- --
Hendrik Maryns

==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
Real Gagnon - 16 May 2006 22:55 GMT
> All I could find was a reference to an obscure unimplemented API.   I
> would go for a publicly available function though - anything has got
> to be better than the way I am doing it.  (If worse comes to worse I
> will copy my ANSI tables and make some "range" conversions)

You can go for the "obscure API" (Sun or IBM), do a bunch of replaceAll()
or use an Array lookup.

An example of each can be found at
http://www.rgagnon.com/javadetails/java-0456.html

Bye.
Signature

Real Gagnon  from  Quebec, Canada
* Looking for Java or PB code examples ? Visit Real's How-to  
* http://www.rgagnon.com/howto.html

Luc The Perverse - 17 May 2006 00:36 GMT
>> All I could find was a reference to an obscure unimplemented API.   I
>> would go for a publicly available function though - anything has got
[quoted text clipped - 6 lines]
> An example of each can be found at
> http://www.rgagnon.com/javadetails/java-0456.html

Awesome!  That is exactly what I was looking for.

Now I can search for "Belanger" next time I want to listen to "La
Parapluie"! (But to be honest, é is one of the truly natural key sequences
that I have memorized.)

--
LTP

:)
Roedy Green - 21 May 2006 19:28 GMT
On Tue, 16 May 2006 06:09:19 -0600, "Luc The Perverse"
<sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly
quoted someone who said :

>Right now I have a crude hard coded method using a series of replaceAll's
>for removing accents and converting them to their approximate non accented
[quoted text clipped - 4 lines]
>I know that Java understands these characters because when I make a string
>lowercase it will convert the capital AE to a lowercase ae.

Boyer Moore has a version that searches for multiple strings at once.

You could do it with a translate table of bytes. You index by char and
get a classifying byte.. You use the byte to index a delegate method
or switch method to deal with the class.  

e.g.
0 = leave alone, append char to StringBuilder.

1 = check if next letter is e or o and convert to ligature ae or ao,
arrange so next letter is ignored.

2 = check if next letter is s, and convert to Eszett

Now you just loop through the string once, classifying each char and
calling the corresponding method,

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Dave Glasser - 24 May 2006 04:46 GMT
"Luc The Perverse" <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote on
Tue, 16 May 2006 06:09:19 -0600 in comp.lang.java.programmer:

>Right now I have a crude hard coded method using a series of replaceAll's
>for removing accents and converting them to their approximate non accented
[quoted text clipped - 9 lines]
>than the way I am doing it.  (If worse comes to worse I will copy my ANSI
>tables and make some "range" conversions)

I realize I'm a latecomer to this thread, but has anyone suggested
using a java.text.Collator? You could probably use one to build your
lookup table (that maps accented characters to their non-accented
equivalents) one time and store it in memory. At least that way it
would deal with different character sets and locales without changing
the code. I don't know how you would handle the combination characters
with a Collator, though, without knowing in advance which ones a
character set contains.

Signature

Check out QueryForm, a free, open source, Java/Swing-based
front end for relational databases.

http://qform.sourceforge.net

If you're a musician, check out RPitch Relative Pitch
Ear Training Software.

http://rpitch.sourceforge.net

Luc The Perverse - 24 May 2006 06:23 GMT
> "Luc The Perverse" <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote on
> Tue, 16 May 2006 06:09:19 -0600 in comp.lang.java.programmer:
[quoted text clipped - 23 lines]
> with a Collator, though, without knowing in advance which ones a
> character set contains.

I looked it up on the Java glossary and while I don't claim to fully
understand it, it looks a little complicated for my needs.

--
LTP

:)


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.