Does Java have a method to take a string with accented characters and
convert it to unaccented characters? I want to search a big string for a
test string, ignoring accents on characters.
Doing the equivalent ignoring of case is simple:
String actualTestString = testString.toLowerCase();
String actualBigString = bigString.toLowerCase();
if (actualBigString.lastIndexOf(actualTestString) >= 0)
{
// do stuff
}
In the Collator class I see a way of checking if two strings are equivalent,
disregarding both case and accents:
Collator c = Collator.getInstance();
c.setStrength(Collator.PRIMARY); // ignore both case and accents
if (c.compare(oneString, otherString) == 0)
{
//do stuff
}
However, I don't see a way of reducing the accented string to a simpler
string so I could search in a bigger string using a "toUnaccentedForm"
method instead of the toLowerCase method in the code above.
Is there a built-in method like "toUnaccentedForm" or some other approach
simpler than writing one's own version of lastIndexOf to ignore accents?
Oliver Wong - 15 Dec 2005 22:58 GMT
> Does Java have a method to take a string with accented characters and
> convert it to unaccented characters? I want to search a big string for a
[quoted text clipped - 25 lines]
> Is there a built-in method like "toUnaccentedForm" or some other approach
> simpler than writing one's own version of lastIndexOf to ignore accents?
AFAIK, there is no built in "toUnaccentedForm()". What you can do that
might be less painful than implementing your own lastIndexOf() is to built a
Map of characters that goes from the accented version to the unaccented
version, and then transforms your string using that map, and THEN do the
comparison.
- Oliver
Mickey Segal - 16 Dec 2005 00:22 GMT
> AFAIK, there is no built in "toUnaccentedForm()". What you can do that
> might be less painful than implementing your own lastIndexOf() is to built
> a Map of characters that goes from the accented version to the unaccented
> version, and then transforms your string using that map, and THEN do the
> comparison.
I came to the same conclusion, mapping the 10 non-standard lower-case
characters likely to come up in our database. Since I was also using
toLowerCase this also covered the upper-case forms.
I also fiddled around with writing my own equivalent of lastIndexOf() using
CollationElementIterator after finding an example at
http://icu.sourceforge.net/docs/papers/efficient_text_searching_in_java.html.
However in the real world that approached turned out to be painfully slow
when searching 1000 strings. In contrast, the approach of mapping 10
characters was very fast because the characters are very rare in our
database so the handling of accented characters did not slow down the
program much.
Roedy Green - 16 Dec 2005 03:34 GMT
On Thu, 15 Dec 2005 14:57:53 -0500, "Mickey Segal"
<not_monitored@example.com> wrote, quoted or indirectly quoted someone
who said :
>Does Java have a method to take a string with accented characters and
>convert it to unaccented characters? I want to search a big string for a
>test string, ignoring accents on characters.
There is one in Abundance, but I don't think I have seen one in Java.
The way you implement it is with a translate table. You index by
accented char to get unaccented. You might just implement it for low
numbered chars.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.