Java Forum / General / May 2006
Function for removing Accents?
Luc The Perverse - 16 May 2006 13:09 GMT Right now I have a crude hard coded method using a series of replaceAll's for removing accents and converting them to their approximate non accented equivalents.
Included are the Latin AE combo, the German double S etc.
I know that Java understands these characters because when I make a string lowercase it will convert the capital AE to a lowercase ae.
All I could find was a reference to an obscure unimplemented API. I would go for a publicly available function though - anything has got to be better than the way I am doing it. (If worse comes to worse I will copy my ANSI tables and make some "range" conversions)
 Signature LTP
:) Chris Uppal - 16 May 2006 13:36 GMT > Right now I have a crude hard coded method using a series of replaceAll's > for removing accents and converting them to their approximate non accented > equivalents. I doubt if that is either meaningful or possible in general. It is certainly not easy.
If you try it at all, and are not satisfied with a handful of hardwired mappings, then you'll probably have to get deeply into Unicode. See: http://www.unicode.org/reports/tr15/index.html for one of the Unicode technical reports which discusses decomposition of characters into base characters plus various kinds of diacritical marks. You could presumably filter out characters representing diacritical marks leaving the character which was qualified by the marks in place.
-- chris
Luc The Perverse - 17 May 2006 00:14 GMT >> Right now I have a crude hard coded method using a series of replaceAll's >> for removing accents and converting them to their approximate non [quoted text clipped - 4 lines] > certainly > not easy. My wife does not know to push ALT 0 2 4 4 for an ñ, so she would have a hard time searching for Piña Coladas if she wanted to hear that Garth Brooks' song.
> If you try it at all, and are not satisfied with a handful of hardwired > mappings, then you'll probably have to get deeply into Unicode. See: [quoted text clipped - 5 lines] > leaving > the character which was qualified by the marks in place. AH! I don't want to "learn" unicode. Especially after looking at that link!
Perhaps I should have explained in more detail the scope of my question.
-- LTP
:) Chris Uppal - 17 May 2006 10:08 GMT > My wife does not know to push ALT 0 2 4 4 for an ñ, so she would have a > hard time searching for Piña Coladas if she wanted to hear that Garth > Brooks' song. In that case a little properties file containing mappings should do the job. Add new mappings as and when the need arises. You could allow multiple candidate replacements, so that your aenima example (which I can't type in to this newsreader correctly) could be matched by any of the correct version aenima anima enima
> I don't want to "learn" unicode. Odd that you should say that; quite a lot of programmers seem to feel the same...
-- chris
bugbear - 16 May 2006 13:40 GMT > Right now I have a crude hard coded method using a series of replaceAll's > for removing accents and converting them to their approximate non accented [quoted text clipped - 9 lines] > than the way I am doing it. (If worse comes to worse I will copy my ANSI > tables and make some "range" conversions) http://glaforge.free.fr/weblog/index.php?itemid=115
looks informative.
http://www.alphaworks.ibm.com/tech/unicodenormalizer
looks helpful
BugBear
Mickey Segal - 16 May 2006 13:42 GMT > Right now I have a crude hard coded method using a series of replaceAll's > for removing accents and converting them to their approximate non accented > equivalents. When I brought up this issue a few months ago the consensus was that the replacing approach was the way to go. I've chosen 37 commonly used characters to replace, but I'm sure on could get more fancy.
Thomas Weidenfeller - 16 May 2006 14:15 GMT > Right now I have a crude hard coded method using a series of replaceAll's > for removing accents and converting them to their approximate non accented > equivalents. There is no such thing as "non accented equivalents" for accented characters in general. Accents, diaeresis, etc. on a character have a meaning, they are not there for decorative reasons. A character without its "decoration" is usually a complete different character than the original.
In short, removing such information is a very bad idea.
> All I could find was a reference to an obscure unimplemented API. I would > go for a publicly available function though - anything has got to be better > than the way I am doing it. (If worse comes to worse I will copy my ANSI > tables and make some "range" conversions) If preserving the meaning of your input text data is not desirable, or, in other words, you badly want to butcher your text, annoy native speakers and apply an irreversible algorithm, then you could try the following:
Get the Unicode Character Database (UCD), and build a particular lookup-table from the information in the UCD:
Parse the UCD data to extract all characters which have explicit decomposition information. Analyze the decomposition information (search for the accents, diaeresis marks, etc. in the information). If you find one, remove it from the decomposition map of the particular character. If a single character remains in the map, consider that your result, otherwise apply more decomposition.
Use that information to build a lookup table "original char -> butchered char". Code that table into some Java class and provide a method to look up characters in the map.
/Thomas
 Signature The comp.lang.java.gui FAQ: ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
Chris Uppal - 16 May 2006 15:45 GMT > Get the Unicode Character Database (UCD), and build a particular > lookup-table from the information in the UCD: [quoted text clipped - 5 lines] > If a single character remains in the map, consider that your result, > otherwise apply more decomposition. You could probably speed up this process considerably by using a pre-existing Unicode package such as ICU:
http://icu.sourceforge.net/
That would only save code, though. You'd still have the problem of learning what all the data means before you start to manipulate it.
Incidentally, why is ICU never mentioned around here ? It looks like a very solid and complete package -- although I admit I haven't actually /used/ it yet (planning to do so fairly soon).
-- chris
Thomas Weidenfeller - 16 May 2006 16:16 GMT > You could probably speed up this process considerably by using a pre-existing > Unicode package such as ICU: I am not sure :-) My understanding of ICU after checking the documentation is that it doesn't do the destructive thing the OP might want to do. It looks more as if the authors of ICU tried very hard to get every aspect of Unicode right. Mapping an accented character to a single non-accented "equivalent" is certainly not right in the scope of Unicode, and also not in the scope of non-ascii languages.
The effort to invest in a solution also depends on how good the solution has to be. Since the original text is anyhow supposed to be butchered, I don't see a reason for 100% accuracy.
So, scripting the parsing of the UCD for finding the interesting values should not take that much time. I would guess less than an hour. That should include scripting of checking the decomposition values for these "bad" accents (probably code points starting at 0x300 up to some value I forgot). The result should be a map of a bunch of characters.
Some more scripting to get that output into a Java data structure, add a lookup method, compile, and that's it.
> Incidentally, why is ICU never mentioned around here ? Probably because people don't know about it (I didn't). And probably because it solves problems not many people have each day.
/Thomas
 Signature The comp.lang.java.gui FAQ: ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
Luc The Perverse - 17 May 2006 00:18 GMT >> You could probably speed up this process considerably by using a >> pre-existing [quoted text clipped - 6 lines] > "equivalent" is certainly not right in the scope of Unicode, and also not > in the scope of non-ascii languages. I did say approximate representation.
The thing that annoys me the most is the song by Tool named ænima.
> The effort to invest in a solution also depends on how good the solution > has to be. Since the original text is anyhow supposed to be butchered, I > don't see a reason for 100% accuracy. To be honest the solution that I have now works just fine. I have hardcoded ReplaceAlls for every character appearing in my file name scope.
>> Incidentally, why is ICU never mentioned around here ? > > Probably because people don't know about it (I didn't). And probably > because it solves problems not many people have each day. You are not the only one that had not heard about it.
Unicode is such a beautiful thing - I wonder why it takes so long to catch on?
-- LTP
:) Domagoj Klepac - 17 May 2006 10:52 GMT >Unicode is such a beautiful thing - I wonder why it takes so long to catch >on? Because internationalization was always expensive, and the IT industry was based in the Western countries, and it had no incentive to adapt to the new, small markets. For example, I would guess that there is no Esperanto Windows version. Or Esperanto Office. Not enough people would buy it. And if even mighty Microsoft can't afford to do it, other software vendors certainly can't.
Think of it - how many programs did you write, which export all GUI text, and make it easy to translate it? And which are designed so that both German translation and Arabic translation fit into design? And mind you, German translation might be much longer than English text - Germans are known for their excellent synchronization of foreign TV shows, since German subtitles would take half of screen and are simply not an option. And Arabic is written from left to right.
Though things regarding internationalization are changing rapidly since the India and China had become great markets. Now most of the Western-centered companies are struggling to get into Asia. And now they have huge disadvantage regarding internationalization compared to the native companies. What goes around comes around. :)
But I digress - what I wanted to say is that Java, with its native Unicode support, has probably done more for Unicode adaptation than all other languages combined. :) But Java is still new, and I suspect that most of the world's codebase is in C/C++, where you're not "forced" to use Unicode by default. Which is probably why Unicode takes so long to catch on.
Domchi
 Signature Ouroboros ltd. - http://www.ouroboros.hr Antispam: to reply, remove extra monkey from reply-to address.
Alex Buell - 17 May 2006 11:02 GMT > Because internationalization was always expensive, and the IT industry > was based in the Western countries, and it had no incentive to adapt > to the new, small markets. For example, I would guess that there is no > Esperanto Windows version. Or Esperanto Office. Not enough people > would buy it. And if even mighty Microsoft can't afford to do it, > other software vendors certainly can't. Microsoft can afford to. They just aren't interested.
 Signature http://www.munted.org.uk
Take a nap, it saves lives.
Luc The Perverse - 17 May 2006 21:31 GMT >> Because internationalization was always expensive, and the IT industry >> was based in the Western countries, and it had no incentive to adapt [quoted text clipped - 4 lines] > > Microsoft can afford to. They just aren't interested. Microsoft is too "closed room development". If they had allowed people to make their own translations and then paid someone a small fee to "check it" people would have jumped at the opportunity.
But they are not completely immune to suggestions.
I made a suggestion for a modification to the MSDN library which literally would have turned a 5 hour job into a 5 minute job. I didn't know what I was looking for (a matter of semantics), and a simple "See Also" link at the bottom would have saved me much headache. About 17 months after I "emailed" (it was a web form) them I received an email back saying that they had accepted my suggestion and it would be included in the next version of MSDN.
-- LTP
:) Roedy Green - 22 May 2006 06:32 GMT On Wed, 17 May 2006 14:31:08 -0600, "Luc The Perverse" <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly quoted someone who said :
> About 17 months after I >"emailed" (it was a web form) them I received an email back saying that they >had accepted my suggestion and it would be included in the next version of >MSDN. I am unworthy.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Luc The Perverse - 22 May 2006 06:59 GMT > On Wed, 17 May 2006 14:31:08 -0600, "Luc The Perverse" > <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly [quoted text clipped - 6 lines] >>MSDN. > I am unworthy. . . .
I can't tell - are you mocking me?
-- LTP
:) Roedy Green - 23 May 2006 04:45 GMT On Sun, 21 May 2006 23:59:11 -0600, "Luc The Perverse" <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly quoted someone who said :
>> I am unworthy. > >. . . > >I can't tell - are you mocking me? I am making a reference to pop culture. It very unusual to get a major company to actually implement an outside idea. It is quite a feather in your cap.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Luc The Perverse - 23 May 2006 05:05 GMT > On Sun, 21 May 2006 23:59:11 -0600, "Luc The Perverse" > <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly [quoted text clipped - 9 lines] > major company to actually implement an outside idea. It is quite a > feather in your cap. Well thank you.
Now if I could just get Taco Time to bring back (permanently) the Smokey Southwest Chicken Burrito - the finest fast food item that has ever existed.
-- LTP
:) Mickey Segal - 23 May 2006 16:07 GMT > I am making a reference to pop culture. It very unusual to get a > major company to actually implement an outside idea. It is quite a > feather in your cap. In my experience, Microsoft has implemented things I've suggested a surprising number of times. In some cases I know it was not just an obvious idea that lots of other people had suggested already.
I never figured out how much of this success was from was asking for the right things, reaching the right people or the similarity of my name to that of a Seattle talk show host.
Roedy Green - 24 May 2006 00:49 GMT On Tue, 23 May 2006 11:07:51 -0400, "Mickey Segal" <not_monitored@example.com> wrote, quoted or indirectly quoted someone who said :
>In my experience, Microsoft has implemented things I've suggested a >surprising number of times. In some cases I know it was not just an obvious >idea that lots of other people had suggested already. I wrote Walt Disney circa 1975 to explain how you could use computers to automate the production of cartoons (an extremely labour intensive process). They wrote back an unusually rude letter saying they weren't interested in any ideas except those generated by Disney employees.
Around the same time, I wrote Polaroid a letter about how you might incorporate image enhancement eventually in a home camera. They were much more polite. They sent back an inch thick contract to protect them from me later asking them for money. It was such a scary thing I decided to let the matter drop.
Sun wrote me and asked me to serve on some sort of advisory board. I simply have not had the energy to deal with it. That would probably give me a lot more leverage than simply posting ideas.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Luc The Perverse - 24 May 2006 00:54 GMT >>In my experience, Microsoft has implemented things I've suggested a >>surprising number of times. In some cases I know it was not just an [quoted text clipped - 5 lines] > process). They wrote back an unusually rude letter saying they weren't > interested in any ideas except those generated by Disney employees. OMG. Well they sure showed you. See how the cartoon industry is thriving without computers.
> Around the same time, I wrote Polaroid a letter about how you might > incorporate image enhancement eventually in a home camera. They were > much more polite. They sent back an inch thick contract to protect > them from me later asking them for money. It was such a scary thing I > decided to let the matter drop. LOL
> Sun wrote me and asked me to serve on some sort of advisory board. I > simply have not had the energy to deal with it. That would probably > give me a lot more leverage than simply posting ideas. You should accept that one - assuming it doesn't require something which you are physically incapable of.
-- LTP
:) Mickey Segal - 24 May 2006 02:02 GMT > Around the same time, I wrote Polaroid a letter about how you might > incorporate image enhancement eventually in a home camera. They were > much more polite. They sent back an inch thick contract to protect > them from me later asking them for money. It was such a scary thing I > decided to let the matter drop. It may be no coincidence that Polaroid is gone. There have, though, been cases in which people who made suggestions have sued companies for a share and won, so their concerns were not totally irrational, but there are less bureaucratic ways of dealing with it.
> Sun wrote me and asked me to serve on some sort of advisory board. I > simply have not had the energy to deal with it. That would probably > give me a lot more leverage than simply posting ideas. I've had good experience with Sun too, but that may have something to do with the seniority of the person I've contacted.
Dale King - 25 May 2006 15:22 GMT >> Around the same time, I wrote Polaroid a letter about how you might >> incorporate image enhancement eventually in a home camera. They were [quoted text clipped - 6 lines] > and won, so their concerns were not totally irrational, but there are less > bureaucratic ways of dealing with it. Apple's policy for dealing with it is to reject any outside suggestions, which unfortunately led to some negative publicity:
http://www.macnn.com/articles/06/04/17/apple.makes.girl.cry
 Signature Dale King
Domagoj Klepac - 24 May 2006 10:21 GMT >>In my experience, Microsoft has implemented things I've suggested a >>surprising number of times. In some cases I know it was not just an obvious [quoted text clipped - 4 lines] >process). They wrote back an unusually rude letter saying they weren't >interested in any ideas except those generated by Disney employees. I think that at that time Disney had a official view that cartoons need to have a heart, which no machine can supply. They were also employing a lot of animators and artists, and feared that with the advent of computers, they would have to let most of them go. You probably struck a nerve. :)
Domchi
 Signature Ouroboros ltd. - http://www.ouroboros.hr Antispam: to reply, remove extra monkey from reply-to address.
Mickey Segal - 16 May 2006 16:40 GMT > There is no such thing as "non accented equivalents" for accented > characters in general. Accents, diaeresis, etc. on a character have a [quoted text clipped - 7 lines] > other words, you badly want to butcher your text, annoy native speakers > and apply an irreversible algorithm, then you could try the following: There are times in which it makes sense to remove accented characters, for example when carrying out a search. In our software we store names of diseases in their fully accented forms. However, users of our software may be familiar with the name of the disease as rendered with unaccented characters, or they may not know how to type accented characters. We carry out a user's search by removing accents from the search text and stripping accents from the disease names for the purposes of the search. We do not store the stripped-down strings.
Luc The Perverse - 17 May 2006 00:30 GMT >> There is no such thing as "non accented equivalents" for accented >> characters in general. Accents, diaeresis, etc. on a character have a [quoted text clipped - 17 lines] > stripping accents from the disease names for the purposes of the search. > We do not store the stripped-down strings. Thomas seemed to think I was removing accents from a document.
The easiest solution would have been to rename all the files/titles of the media so that I didn't have to deal with accented characters at all. But to me ça is not ca, muñeca is not muneca. So I would have to agree with him.
My desire to get a "better solution" is just to make my functions more future ready (I hesitate to say future proof)
-- LTP
:) Domagoj Klepac - 17 May 2006 09:59 GMT >> Right now I have a crude hard coded method using a series of replaceAll's >> for removing accents and converting them to their approximate non accented [quoted text clipped - 7 lines] > >In short, removing such information is a very bad idea. It depends, really.
We in non-English parts of the world, whose alphabet has more than 25 letters, have lived with that since the PC came along (Mac, on the other hand, always had almost transparent internationalization support). This has led to a wide adaptation of "non-accented equivalents", in specific areas - for example, in Usenet posts, most people don't use non-accented characters, although nowdays almost all Usenet readers support it. But even the Windows XP might not be able to access the file if the file name (or folder name) contains non-English characters.
Mostly, those equivalents are simply the same character with stripped accent. Sometimes, accented character maps to several ASCII characters. For example, Croatian "" maps to "z", but "Ð" maps to "Dj". Or, in German, "ß" maps to "ss". But all vowels, which make most of the accented characters, (I think) can be simply stripped to their non-accented counterparts.
Google does this, BTW. It searches for both accented and non-accented version of the word you entered. That's a good example of a case where removing accents is good.
Domchi
 Signature Ouroboros ltd. - http://www.ouroboros.hr Antispam: to reply, remove extra monkey from reply-to address.
Patricia Shanahan - 17 May 2006 14:40 GMT ...
> Mostly, those equivalents are simply the same character with stripped > accent. Sometimes, accented character maps to several ASCII > characters. For example, Croatian "ž" maps to "z", but "Ð" maps to > "Dj". Or, in German, "ß" maps to "ss". But all vowels, which make most > of the accented characters, (I think) can be simply stripped to their > non-accented counterparts. What about o-umlaut and oe? As in "Goedel".
Google can deal with this - I tried a search for "Goedel" and got a page, http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Godel.html, that spells the name with o-umlaut.
Patricia
Jaakko Kangasharju - 17 May 2006 15:54 GMT > What about o-umlaut and oe? As in "Goedel". > > Google can deal with this - I tried a search for "Goedel" and got a > page, > http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Godel.html, that > spells the name with o-umlaut. Interesting: If you search for "Goedel", you get "Goedel" and "Gödel" as matches, but if you search for "Gödel", the matches are "Gödel" and "Godel".
 Signature Jaakko Kangasharju, Helsinki Institute for Information Technology I + NT = Problem
Roedy Green - 21 May 2006 20:48 GMT >What about o-umlaut and oe? As in "Goedel". > >Google can deal with this - I tried a search for "Goedel" and got a >page, >http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Godel.html, that >spells the name with o-umlaut. I have noticed that too, also finding variant spellings. I have not yet experimented to see if it is Google collapsing or the existence of multiple variants buried in the text, perhaps in the keywords.
On most typos though Google asks if I really meant something else, but on a popular "variant" spelling it does not.
I suspect I would have better luck looking up "seperater" that "separator" even though it is the wrong spelling.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Hendrik Maryns - 16 May 2006 16:23 GMT Luc The Perverse schreef:
> Right now I have a crude hard coded method using a series of replaceAll's > for removing accents and converting them to their approximate non accented > equivalents. > > Included are the Latin AE combo, the German double S etc. I think this is impossible, because the conversion is very language dependent: if you want to remove the ¨ from a German ä, it should become ae, whereas in the more seldom case you encounter an ä in Dutch, the best equivalent would be a. Unless you just want to remove accents, without trying to keep meaning, but then, why not just remove the accented letters altogether?
H. - -- Hendrik Maryns
================== http://aouw.org Ask smart questions, get good answers: http://www.catb.org/~esr/faqs/smart-questions.html
Luc The Perverse - 17 May 2006 00:07 GMT > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 [quoted text clipped - 13 lines] > without trying to keep meaning, but then, why not just remove the > accented letters altogether? All I want to design is an accent insensitive search dictionary. No meaning needs to be preserved because the user will not see any converted text.
I do not wish to remove the accent altogether, because I want the user to be able to search with or without the accent.
For instance if I want to search for rückenwind, it will take me at least several additional seconds to experiment with the ALT keys to find the right character combination for ü. (Although learning the character codes is easier than attempting to learn alternate German spellings.) It doesn't mean I don't know that there is an accent there, it is just easier to type ruckenwind.
And if you cringe at that, you might faint to see the approximate transliterations that I have used to name my Russian MP3's!
-- LTP
:) alexandre_paterson@yahoo.fr - 17 May 2006 03:35 GMT ...
> For instance if I want to search for rückenwind, it will take me at least > several additional seconds to experiment with the ALT keys to find the right [quoted text clipped - 5 lines] > And if you cringe at that, you might faint to see the approximate > transliterations that I have used to name my Russian MP3's! You decided to go for the "accent-removing" way though there are many different ways to achieve the functionality you're after.
To me this looks very similar to a spelling algorithm (even though you're using several languages at once).
For every search word entered you could: -check exact matches -check "one typo" matches (eg "Pixes" instead of "Pixies") -check "two typos" matches (eg "Spellshaker" instead of "Spellchecker") -check all "sound-alike" words (using eg Soundex or Double-Metaphone) -rank all the propositions found using these various techniques by their "closeness" to the search term entered (using Levenhstein's edit distance, implemented using a DP algorithm).
This should allow to enter really badly written band names/song names and still find back the song(s) you're after.
For example, my home-made spellchecker's first proposition for "paulitiquale" ("political" spelled in a strange french-phonetical way) is, surely enough, "political" (note that Un*x' aspell/ispell works the same way).
Now of course this is lots of work for a simple functionality, but you could still use something similar but much much simpler (take could be implemented very easily):
- for every correct name you have in your database, you create a hashmap with all vowels removed
Pixies -> Pxs Pink Floyd -> Pnk Fld rückenwind -> rckwnd
etc.
Then, when someone enters a search term, *if you don't find an exact match*, you try to remove every vowels from the entered search terms and sees if this corresponds to something in your hashmap.
You can get a little more fancy by checking if the entry (or entries) in the hashmap really correspond to something close to the entered search term by calculating the "edit distance" between the two strings.
It really looks like what a spellchecker would do: the whole point is not using a single technique, but a variety of techniques. Each one augmenting the probability that the search gives back a meaningfull result.
FWIW,
Alex
P.S: this post directly edited in groups.google.com, without bothering to copy&paste in a spellchecker ;)
Hendrik Maryns - 17 May 2006 09:08 GMT alexandre_paterson@yahoo.fr schreef:
> ... >> For instance if I want to search for rückenwind, it will take me at least [quoted text clipped - 21 lines] > their "closeness" to the search term entered (using Levenhstein's > edit distance, implemented using a DP algorithm). There surely are packages that do ‘fuzzy’ string matching (when not in Java, then certainly in Perl (CPAN)).
> This should allow to enter really badly written band names/song > names and still find back the song(s) you're after. [quoted text clipped - 16 lines] > > etc. Nce idea, but in the end you end up removing all letters: how about ñ (Spanish), ý, ð, þ (Icelandic), ç (French), ß (German), ś (Czech?), and those with a v on top of them, I can’t find them either...
And what to do if you want to look for this song: second on http://en.wikipedia.org/wiki/Windowlicker#Track_listing ?? :-p
H. - -- Hendrik Maryns
================== http://aouw.org Ask smart questions, get good answers: http://www.catb.org/~esr/faqs/smart-questions.html
alexandre_paterson@yahoo.fr - 17 May 2006 13:27 GMT Hi Hendrik!
...
> > Pixies -> Pxs > > Pink Floyd -> Pnk Fld [quoted text clipped - 5 lines] > (Spanish), ý, ð, þ (Icelandic), ç (French), ß (German), s (Czech?), and > those with a v on top of them, I can't find them either... I know... Which is why I said that the idea of a spellchecker is not to use a single technique but a variety of techniques.
You've got a "result List<SortableString>", each technique adds zero or more "propositions" to this List.
- You first try all correct matches - You then try all "one letter typos" (one wrong letter, one letter not entered, one letter typed too much) - You then try all "two letters typos" (still easily doable) - You then try all close "no vowels" words - etc.
Then you rank all those propositions according to their "edit distance" to correct band names/song names.
It's not computationally intensive: at billions operations/second you can get "creative". (it can be memory consuming though if you've got hundreds of thousands of songs, then you need to bypass Java's object [HashMap and String just won't cut it] and use your own data structure. I know, for I got to 22 bits per word on a spellchecker able to deal with dictionaries handling hundreds of thousands of words -- try to do that with Jazzy ;)
And there's no "bad technique" either. Each trick simply can lead to propositions.
Imagine a really bad technique adds "gigolo" to the list of proposition when the user entered "gogle" (looking for "google"), well... The "one letter typo" check will have found "google". So propositions will contains ?????, ?????, gigolo, ?????, google, ????
Then they get sorted by edit distance to the user's entered search string: google is close to gogle (edit distance of 1) while gigolo is not that close to gogle. It doesn't mean that "gigolo" isn't proposed but simply that "google" is proposed first.
I'll re-write what I wrote in the previous post: the whole point is not using a single technique, but a variety of techniques. Each one augmenting the probability that the search gives back a meaningfull result.
> And what to do if you want to look for this song: second on > http://en.wikipedia.org/wiki/Windowlicker#Track_listing ?? :-p Ah, Aphex Twin :)
Well, for this one I admit I'm not so sure... I'd add a "special case" and I would have the only spellchecker/search engine in the world able to deal with that case!
;)
Hendrik Maryns - 17 May 2006 17:39 GMT alexandre_paterson@yahoo.fr schreef:
> Hi Hendrik! > [quoted text clipped - 20 lines] > - You then try all close "no vowels" words > - etc. <snip>
I agree that you’re on the right track here...
>> And what to do if you want to look for this song: second on >> http://en.wikipedia.org/wiki/Windowlicker#Track_listing ?? :-p [quoted text clipped - 4 lines] > and I would have the only spellchecker/search engine in the world > able to deal with that case! It is commonly known as (A complex mathematical equation), with or without brackets, or simply [mathematical equation]. The whole song is a joke, though, it is just a picture of his face converted with some pic-to-audio program. Nevertheless, I like it.
H. - -- Hendrik Maryns
================== http://aouw.org Ask smart questions, get good answers: http://www.catb.org/~esr/faqs/smart-questions.html
Real Gagnon - 16 May 2006 22:55 GMT > All I could find was a reference to an obscure unimplemented API. I > would go for a publicly available function though - anything has got > to be better than the way I am doing it. (If worse comes to worse I > will copy my ANSI tables and make some "range" conversions) You can go for the "obscure API" (Sun or IBM), do a bunch of replaceAll() or use an Array lookup.
An example of each can be found at http://www.rgagnon.com/javadetails/java-0456.html
Bye.
 Signature Real Gagnon from Quebec, Canada * Looking for Java or PB code examples ? Visit Real's How-to * http://www.rgagnon.com/howto.html
Luc The Perverse - 17 May 2006 00:36 GMT >> All I could find was a reference to an obscure unimplemented API. I >> would go for a publicly available function though - anything has got [quoted text clipped - 6 lines] > An example of each can be found at > http://www.rgagnon.com/javadetails/java-0456.html Awesome! That is exactly what I was looking for.
Now I can search for "Belanger" next time I want to listen to "La Parapluie"! (But to be honest, é is one of the truly natural key sequences that I have memorized.)
-- LTP
:) Roedy Green - 21 May 2006 19:28 GMT On Tue, 16 May 2006 06:09:19 -0600, "Luc The Perverse" <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote, quoted or indirectly quoted someone who said :
>Right now I have a crude hard coded method using a series of replaceAll's >for removing accents and converting them to their approximate non accented [quoted text clipped - 4 lines] >I know that Java understands these characters because when I make a string >lowercase it will convert the capital AE to a lowercase ae. Boyer Moore has a version that searches for multiple strings at once.
You could do it with a translate table of bytes. You index by char and get a classifying byte.. You use the byte to index a delegate method or switch method to deal with the class.
e.g. 0 = leave alone, append char to StringBuilder.
1 = check if next letter is e or o and convert to ligature ae or ao, arrange so next letter is ignored.
2 = check if next letter is s, and convert to Eszett
Now you just loop through the string once, classifying each char and calling the corresponding method,
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Dave Glasser - 24 May 2006 04:46 GMT "Luc The Perverse" <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote on Tue, 16 May 2006 06:09:19 -0600 in comp.lang.java.programmer:
>Right now I have a crude hard coded method using a series of replaceAll's >for removing accents and converting them to their approximate non accented [quoted text clipped - 9 lines] >than the way I am doing it. (If worse comes to worse I will copy my ANSI >tables and make some "range" conversions) I realize I'm a latecomer to this thread, but has anyone suggested using a java.text.Collator? You could probably use one to build your lookup table (that maps accented characters to their non-accented equivalents) one time and store it in memory. At least that way it would deal with different character sets and locales without changing the code. I don't know how you would handle the combination characters with a Collator, though, without knowing in advance which ones a character set contains.
 Signature Check out QueryForm, a free, open source, Java/Swing-based front end for relational databases.
http://qform.sourceforge.net
If you're a musician, check out RPitch Relative Pitch Ear Training Software.
http://rpitch.sourceforge.net
Luc The Perverse - 24 May 2006 06:23 GMT > "Luc The Perverse" <sll_noSpamlicious_z_XXX_m@cc.usu.edu> wrote on > Tue, 16 May 2006 06:09:19 -0600 in comp.lang.java.programmer: [quoted text clipped - 23 lines] > with a Collator, though, without knowing in advance which ones a > character set contains. I looked it up on the Java glossary and while I don't claim to fully understand it, it looks a little complicated for my needs.
-- LTP
:)
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|