Java Forum / General / February 2007
How to check variables for uniqueness ?
krislioe@gmail.com - 21 Dec 2006 05:03 GMT Hi all,
I have eight variables : var1, var2... var 8. All types String. How to check that each variables has unique values ?
Thank you for your help, xtanto
Andrew Thompson - 21 Dec 2006 05:08 GMT krisl...@gmail.com wrote: ...
> I have eight variables : var1, var2... var 8. All types String. > How to check that each variables has unique values ? One way would be to create a Map, iterate the var's and if not present in the map, add the value as a key, else return false.
Andrew T.
Patricia Shanahan - 21 Dec 2006 06:18 GMT > krisl...@gmail.com wrote: > ... [quoted text clipped - 6 lines] > > Andrew T. Any particular reason for Map, rather than Set?
Note that the result of a Set add call is true if, and only if, the value is not already in the Set.
Patricia
Andrew Thompson - 21 Dec 2006 06:25 GMT > > krisl...@gmail.com wrote: > > ... [quoted text clipped - 4 lines] > > var's and if not present in the map, add the value > > as a key, else return false. ...
> Any particular reason for Map, rather than Set? You mean besides, 'lack of enough consultation of the relevant docs.'? ;-)
> Note that the result of a Set add call is true if, and only if, the > value is not already in the Set. A Set sounds the go - it is just right for this task.
Andrew T.
John Ersatznom - 21 Dec 2006 06:33 GMT >>>krisl...@gmail.com wrote: >>>... [quoted text clipped - 17 lines] > > A Set sounds the go - it is just right for this task. HashSet<String> foo = new HashSet<String>(); foo.add(var1); foo.add(var2); foo.add(var3); foo.add(var4); foo.add(var5); foo.add(var6); foo.add(var7); foo.add(var8); if (foo.size() < 8) duplicateExists(); else duplicateDoesNotExist();
If you actually need to identify the specific duplicate pairs, you need to compare them one by one -- 1 with all the others, 2 with all the higher-numbered ones, and so on up to 7 and 8, using equals().
If you want case insensitivity, use e.g.
foo.add(var3.toLowerCase());
or equalsIgnoreCase().
Patricia Shanahan - 21 Dec 2006 11:30 GMT >>>> krisl...@gmail.com wrote: >>>> ... [quoted text clipped - 35 lines] > to compare them one by one -- 1 with all the others, 2 with all the > higher-numbered ones, and so on up to 7 and 8, using equals(). To save repititious writing, I'm going to assume the strings are in an array. The equivalent of your code would be:
HashSet<String> foo = new HashSet<String>(); for(String v:vars){ foo.add(v); } if (foo.size() < vars.length) duplicateExists(); else duplicateDoesNotExist();
You can simplify finding specific duplicates by checking the foo.add results:
HashSet<String> foo = new HashSet<String>(); for(int i=0; i<vars.length; i++){ if(!foo.add(vars[i]){ for(int j=0; j<i; j++){ if(vars[i].equals(vars[j])){ reportDuplicate(i,j); } } } }
A true result from foo.add means the string was actually added to the set, so it has no duplicate with a lower index.
Patricia
Ed Kirwan - 21 Dec 2006 12:25 GMT > You can simplify finding specific duplicates by checking the foo.add > results: [quoted text clipped - 14 lines] > > Patricia Perhaps using a List would obviate the need for the nest loop?
List list = new ArrayList(); for (int i = 0, n = vars.length; i < n; i++) { int duplicateIndex = list.indexOf(vars[i]); if (duplicateIndex != -1) { reportDuplicate(i, duplicateIndex); } else { list.add(vars[i]); } }
.ed
 Signature www.EdmundKirwan.com - Home of The Fractal Class Composition.
Download Fractality, free Java code analyzer: www.EdmundKirwan.com/servlet/fractal/frac-page130.html
Remon van Vliet - 21 Dec 2006 13:28 GMT > Perhaps using a List would obviate the need for the nest loop? > [quoted text clipped - 9 lines] > > .ed The nested loop is only needed to allow reporting of a specific duplicate pair. I cannot think of many practical examples where that is required rather than simply reporting that the element to be added is a duplicate. If it is required then I'd say you're right, using a List does result is slightly more readable code.
That said, if the collection must not contain duplicate elements then at least from a design and correctness perspective you should use a Set. I'd personally do so even if that decision would result in a few extra lines of code here and there.
Remon
Hemal Pandya - 22 Dec 2006 05:45 GMT [...]
> Perhaps using a List would obviate the need for the nest loop? It will, but will be a lot more expensive. Use can use a Map<String,Integer> to both avoid nested loop and report indexes. Yes, it will take more memory.
[....]
Patricia Shanahan - 22 Dec 2006 06:03 GMT Hemal Pandya wrote:
> [...] >> Perhaps using a List would obviate the need for the nest loop? Note that I did NOT write that.
> It will, but will be a lot more expensive. Use can use a > Map<String,Integer> to both avoid nested loop and report indexes. Yes, > it will take more memory. > > [....] Hemal Pandya - 22 Dec 2006 06:46 GMT [....]
> Note that I did NOT write that. No, you did not. Your lines would have had one more '>' at the beginning-of-line. I apologize if I caused confusion.
Ed - 30 Dec 2006 16:15 GMT Hemal Pandya skrev:
> [...] > > Perhaps using a List would obviate the need for the nest loop? > > It will, but will be a lot more expensive. > [....] Thanks for that tip, Hemal. I had no idea that Set-implementations were so much more efficient (in this case) than List-implementations. The output from the (no-doubt indent-mashed) code below gives:
522393 duplicated words. Using java.util.HashSet, time = 678ms. 522393 duplicated words. Using java.util.TreeSet, time = 1812ms. 522393 duplicated words. Using java.util.ArrayList, time = 157724ms. 522393 duplicated words. Using java.util.LinkedList, time = 251739ms.
import java.util.*; import java.io.*;
class Test { private static String TEXT_BOOK_NAME = "war-and-peace.txt";
public static void main(String[] args) { try { String text = readText(); // Read text into RAM countDuplicateWords(text, new HashSet()); countDuplicateWords(text, new TreeSet()); countDuplicateWords(text, new ArrayList()); countDuplicateWords(text, new LinkedList()); } catch (Throwable t) { System.out.println(t.toString()); } }
private static String readText() throws Throwable { BufferedReader reader = new BufferedReader(new FileReader(TEXT_BOOK_NAME)); String line = null; StringBuffer text = new StringBuffer(); while ((line = reader.readLine()) != null) { text.append(line + " "); } return text.toString(); }
private static void countDuplicateWords(String text, Collection listOfWords) { int numDuplicatedWords = 0; long startTime = System.currentTimeMillis(); for (StringTokenizer i = new StringTokenizer(text); i.hasMoreElements();) { String word = i.nextToken(); if (listOfWords.contains(word)) { numDuplicatedWords++; } else { listOfWords.add(word); } } long endTime = System.currentTimeMillis(); System.out.println(numDuplicatedWords + " duplicated words. " + "Using " + listOfWords.getClass().getName() + ", time = " + (endTime - startTime) + "ms."); } }
.ed
--
www.EdmundKirwan.com - Home of The Fractal Class Composition
Lew - 30 Dec 2006 18:10 GMT > Hemal Pandya skrev: > [quoted text clipped - 60 lines] > } > } (Please do not embed TAB characters in newsgroup postings.)
You could use a HashMap if you wanted to know how many times each word occurred:
Map< String, Integer > concordance = new HashMap< String, Integer > (); for ( StringTokenizer tok = new StringTokenizer(text); tok.hasMoreElements(); ) { String word = tok.nextToken(); Integer kt = concordance.get( word ); if ( kt == null ) { concordance.put( word, Integer.valueOf( 0 )); } else { concordance.put( word, Integer.valueOf( kt.intValue() + 1 )); } }
then get total dupes by analyzing the concordance:
int totalDupes = 0; for ( Map.Entry< String, Integer > entry : concordance.entrySet() ) { if ( entry.getValue().intValue() > 1 ) { ++totalDupes; } }
- Lew
Ed - 30 Dec 2006 22:32 GMT Lew skrev:
> (Please do not embed TAB characters in newsgroup postings.) > > You could use a HashMap if you wanted to know how many times each word occurred: snip
> - Lew Indeed.
And in case anyone's interested, here are the times for HashMap. Looks like Map is in the league of Set, and not the slow-moving List. (These times are longer than the previous times because of current CPU loading; relativity is the key.)
522393 duplicated words. Using java.util.HashSet, time = 789ms. 522393 duplicated words. Using java.util.TreeSet, time = 2168ms. 522393 duplicated words. Using Map , time = 1180ms. 522393 duplicated words. Using java.util.ArrayList, time = 183795ms. 522393 duplicated words. Using java.util.LinkedList, time = 274781ms.
Apologies to Patricia: I see I mis-attributed her post, yet again. And Lew, I've now become fast friends now with Linux's expand(). Let's see whether I purged those nasty TABs:
import java.util.*; import java.io.*;
class Test { private static String TEXT_BOOK_NAME = "war-and-peace.txt";
public static void main(String[] args) { try { String text = readText(); // Read text into RAM countDuplicateWords(text, new HashSet()); countDuplicateWords(text, new TreeSet()); countDuplicateWordsMap(text); countDuplicateWords(text, new ArrayList()); countDuplicateWords(text, new LinkedList()); } catch (Throwable t) { System.out.println(t.toString()); } }
private static String readText() throws Throwable { BufferedReader reader = new BufferedReader(new FileReader(TEXT_BOOK_NAME)); String line = null; StringBuffer text = new StringBuffer(); while ((line = reader.readLine()) != null) { text.append(line + " "); } return text.toString(); }
private static void countDuplicateWords(String text, Collection listOfWords) { int numDuplicatedWords = 0; long startTime = System.currentTimeMillis(); for (StringTokenizer i = new StringTokenizer(text); i.hasMoreElements();) { String word = i.nextToken(); if (listOfWords.contains(word)) { numDuplicatedWords++; } else { listOfWords.add(word); } } long endTime = System.currentTimeMillis(); System.out.println(numDuplicatedWords + " duplicated words. " + "Using " + listOfWords.getClass().getName() + ", time = " + (endTime - startTime) + "ms."); }
private static void countDuplicateWordsMap(String text) { int numDuplicatedWords = 0; Map wordsToFrequency = new HashMap(); long startTime = System.currentTimeMillis(); for (StringTokenizer i = new StringTokenizer(text); i.hasMoreElements();) { String word = i.nextToken(); Integer frequency = (Integer)wordsToFrequency.get(word); if (frequency == null) { wordsToFrequency.put(word, new Integer(0)); } else { int value = frequency.intValue(); wordsToFrequency.put(word, new Integer(value + 1)); numDuplicatedWords++; } } long endTime = System.currentTimeMillis(); System.out.println(numDuplicatedWords + " duplicated words. " + "Using Map " + ", time = " + (endTime - startTime) + "ms."); } }
.ed
--
www.EdmundKirwan.com - Home of The Fractal Class Composition
Lew - 31 Dec 2006 13:55 GMT > And in case anyone's interested, here are the times for HashMap. Looks > like Map is in the league of Set, and not the slow-moving List. (These [quoted text clipped - 6 lines] > 522393 duplicated words. Using java.util.ArrayList, time = 183795ms. > 522393 duplicated words. Using java.util.LinkedList, time = 274781ms. These times are extremely interesting.
I speculate that the greater part of the difference between HashMap and HashSet would be the second loop through the Map. Note that though the Map was slightly slower than the Set, it delivers more information. With the Set you only knew how many words were duplicated; with the Map you can also figure out which words were, and how many times each one occurred.
You could, for example, use the Map to deliver the words in order of frequency, given the right comparator over the entry set.
- Lew
John Ersatznom - 04 Jan 2007 11:08 GMT >> And in case anyone's interested, here are the times for HashMap. Looks >> like Map is in the league of Set, and not the slow-moving List. (These [quoted text clipped - 17 lines] > You could, for example, use the Map to deliver the words in order of > frequency, given the right comparator over the entry set. A lot of the Map slowness is probably the churn of Integer objects created. Using an int[1] as a "mutable Integer" would work far better (although mutable objects in collections is normally bad, mutable values in a map isn't generally a problem, so long as you don't have mutable keys).
On the subject of tabs, my copy of Thunderbird seems to be quietly converting tabs into spaces, though I can't find the setting for it. Posts apparently originally containing tabs (e.g. Ed's earlier) have spaces when I view them, and my own posts written with tabs don't make you complain. :) The curious thing is that incoming posts seem to have tab->4 spaces and the editor shows tabs as 8 spaces, but they become 4 in the actual sent posting...and none of the options in Thunderbird say anything about conversion of tabs at all, either to set their displayed width or to actually change tabs to certain numbers of spaces. Hrm. The "online help" doesn't open a help window, but rather hijacks my open Firefox window, and the search there is useless on this topic too...
Oliver Wong - 21 Dec 2006 21:09 GMT > If you want case insensitivity, use e.g. > > foo.add(var3.toLowerCase()); This might not actually work, because of the fickleness of certain human languages.
> or equalsIgnoreCase(). Yeah, I'd essentially wrap the String in a custom class which overrides equals to call equalsIgnoreCase, and give that to the Set.
- Oliver
John Ersatznom - 22 Dec 2006 09:37 GMT >>If you want case insensitivity, use e.g. >> >>foo.add(var3.toLowerCase()); > > This might not actually work, because of the fickleness of certain human > languages. ?
> Yeah, I'd essentially wrap the String in a custom class which overrides > equals to call equalsIgnoreCase, and give that to the Set. What is obviously missing from java.util is an Equalizer:
public interface Equalizer<T> { public boolean areEqual (T foo, T bar); public boolean getHash (T foo); }
and the ability to pass these to collection constructors to use, the way those that use order comparison can already be handed a custom comparator.
Problems caused by comparators not consitent with an object's equals method could be avoided by supplying an Equalizer that is consistent with the comparator, as well as it obviating the need you perceive to wrap the String class. (Either way, by the way, you need to replace hashCode() with a case-insensitive version too, or you'll have strings that compare equal and have different hash codes, at least potentially. That at least can't happen if you use add(var.toFooCase()) or similar.)
Oliver Wong - 22 Dec 2006 15:33 GMT >>>If you want case insensitivity, use e.g. >>> [quoted text clipped - 4 lines] > > ? I'm not a linguist, so this may be linguistically incorrect, but it illustrates the type of problems you can run into:
assert locale is German; //pseudcode assert "BEISSEN".toLowerCase().equals("beissen"); assert "BEISSEN".toLowerCase().equals("beißen");
- Oliver
John Ersatznom - 23 Dec 2006 13:14 GMT >>>>If you want case insensitivity, use e.g. >>>> [quoted text clipped - 11 lines] > assert "BEISSEN".toLowerCase().equals("beissen"); > assert "BEISSEN".toLowerCase().equals("beißen"); Yeah, and assert "Color".toLowerCase().equals("Colour".toLowerCase()). Whenever there's multiple legitimate spellings for the same word, there's going to be trouble if you try to make the computer "smart enough" to treat them as equal.
Mind you, there ARE lexicographical "distance" measures that are useful for "fuzzy-matching", such as spell-checker "suggestions" use. (Google now suggests an alternate if it thinks you've misspelled a query term, for example.) But you can't use those as an equality test, since they don't define an equivalence relation -- they aren't transitive, since you can have a.isCloseTo(b), a.isCloseTo(c), and !b.isCloseTo(c) (e.g. where the distance is 1 from c to a, 1 from a to b, and 2 from c to b, and 1 is the threshold). Even a threshold of 1 is too high if the result is not only to equate "color" with "colour" but also with "colon". :)
Best to treat distinct spellings as distinct, and perhaps use a fuzzy-match "suggested alternative" if users enter a query with no results, e.g. if a search for "beissen" comes up empty.
Of course, if you really want to drive yourself mad, try to program the computer to identify when two different input strings identify the same thing in general. Good luck having it compare e.g. "Carrie-Anne Moss" and "Lead actress in The Matrix" as equal. Sure, go ahead, you'll even solve the NLP while you're at it so you should become rich and famous. If you succeed. :)
Of course, all this arose in the context of "foo.equalsIgnoreCase(bar)" vs. "foo.toLowerCase().equals(bar.toLowerCase())". Those *should* be equal; both should be transforming words into a canonical representation. Or else there should be another toFoo() method that returns a canonical representation that compares equal for words that compare equalsIgnoreCase, because the usefulness of having such a representation to use as a key in a hashmap is obvious.
Oliver Wong - 27 Dec 2006 17:39 GMT >>>>>If you want case insensitivity, use e.g. >>>>> [quoted text clipped - 13 lines] > > Yeah, and assert "Color".toLowerCase().equals("Colour".toLowerCase()). { String originalA = "color"; a = originalA; // "color" a = a.toUppercase(); // "COLOR" a = a.toLowercase(); // "color" assert a.equals(originalA); } { String originalA = "beißen"; a = originalA; // "beißen" a = a.toUppercase(); // "BEISSEN" a = a.toLowercase(); // "beissen" assert a.equals(originalA); }
- Oliver
John Ersatznom - 29 Dec 2006 15:05 GMT >>>assert locale is German; //pseudcode >>>assert "BEISSEN".toLowerCase().equals("beissen"); [quoted text clipped - 9 lines] > assert a.equals(originalA); > } I don't see "colour" (with a U) in there anywhere, Oliver.
Oliver Wong - 29 Dec 2006 15:28 GMT >>>>assert locale is German; //pseudcode >>>>assert "BEISSEN".toLowerCase().equals("beissen"); [quoted text clipped - 11 lines] > > I don't see "colour" (with a U) in there anywhere, Oliver. You weren't intended to.
- Oliver
John Ersatznom - 04 Jan 2007 11:11 GMT >>>>>assert locale is German; //pseudcode >>>>>assert "BEISSEN".toLowerCase().equals("beissen"); [quoted text clipped - 13 lines] > > You weren't intended to. Then you're missing the point entirely. "COLOR" and "colour" differ only by capitalization while "beissen" and "beißen" differ by spelling in a manner similar to "color" vs. "colour". Alternate spellings of the same word can't in general be idenfitied as identical by a computer -- not without a trip through a spellchecking dictionary or the like, anyway. I think you may be expecting too much of Java's humble string classes. Perhaps Collator is smart enough for you?
Andrew Thompson - 04 Jan 2007 11:25 GMT ...
> Then you're missing the point entirely. "COLOR" and "colour" differ only > by capitalization .. As well as the 'u' in the second word. And from my vague recollections of this thread (that I am not prepared to review at this instant) - a misunderstanding between the spelling observed, might actually explain this (sub) thread..(?)
Andrew T.
John Ersatznom - 05 Jan 2007 21:59 GMT > ... > >>Then you're missing the point entirely. "COLOR" and "colour" differ only >>by capitalization .. > > As well as the 'u' in the second word. That wasn't supposed to be there, though the later "color" vs "colour" (all lower case) is correct. :P I trust my meaning is still easy to glean.
Oliver Wong - 04 Jan 2007 21:26 GMT >>>>>>assert locale is German; //pseudcode >>>>>>assert "BEISSEN".toLowerCase().equals("beissen"); [quoted text clipped - 15 lines] > > Then you're missing the point entirely. Must be, because I was under the impression I was making a point to you, as opposed to the other way around. I thought you were curious as to how manually doing case-insensitive conversions could fail, as opposed to using the build in equalsIgnoreCase().
> "COLOR" and "colour" differ only by capitalization while "beissen" and > "beißen" differ by spelling in a manner similar to "color" vs. "colour". I disagree.
> Alternate spellings of the same word can't in general be idenfitied as > identical by a computer -- not without a trip through a spellchecking > dictionary or the like, anyway. I think you may be expecting too much of > Java's humble string classes. Perhaps Collator is smart enough for you? You should take the code I posted and put it in your favorite IDE, fix the compile errors (apparently, it's toLowerCase, not toLowercase), and run it. You might find the results enlightening. If those results surprise you, add a few System.out.println(a) to see what's going on.
- Oliver
John Ersatznom - 06 Jan 2007 12:36 GMT >>>>I don't see "colour" (with a U) in there anywhere, Oliver. >>> [quoted text clipped - 6 lines] > manually doing case-insensitive conversions could fail, as opposed to using > the build in equalsIgnoreCase(). Both will fail when you want words spelled differently to compare equal, though Collator may have more smarts in that area.
>>"COLOR" and "colour" differ only by capitalization while "beissen" and >>"beißen" differ by spelling in a manner similar to "color" vs. "colour". > > I disagree. On what basis? The typo I made? It was meant to say "COLOR" and "color" differ only by capitalization while "beissen" and "beißen" differ by spelling in a manner similar to "color" vs. "colour".
In fact the analogy goes so far as for the number of letters in the latter two examples to differ by one in both cases, and for a two letter region in one to correspond to a single letter at the same place in the other in particular. And (presumably -- I don't know the German word(s)) they are in both cases variant spellings of a different word -- differing in more than just capitalization, but used interchangeably or as regional variants rather than having distinct meanings.
> You should take the code I posted and put it in your favorite IDE, fix > the compile errors (apparently, it's toLowerCase, not toLowercase), and run > it. It would have been nice if Sun had been consistent about their own capitalization. There's also Character.isWhitespace (in the same class! Note lowercase s) and System.arraycopy (note lowercase c), at minimum.
:P Maybe they need to implement an isCamelCase method (note second capital C)... :)
In any event, I suppose the real lesson here is that String (and friends) get you primitive ordering and comparisons, perhaps somewhat Anglocentric, and you need to use Collator and relatives for serious language-and-locale-sensitive comparisons. I don't know the extent to which even the latter will cope with variant spellings, mind you. There is also a where-do-you-draw-the-line issue -- from case to slight variations in the actual sequence of letters used on to more overt differences, as between "huge" and "giant" -- when should those be considered synonyms, and when different? -- and on until if you broaden your requirements enough solving the NLP seems to be a required component of any conforming implementation. :) Language has a fuzziness in it in actual human usage that computers have trouble with. It's curiously not unlike the problems that arose elsewhere here today with float and double comparisons. You can't rely usefully on == for the most part, and using Math.abs(x - y) < someThreshold gives an "equality" test that's more meaninful in some ways but is not transitive any more. Eventually linguistic equality loses transitivity too -- you can play all kinds of games of picking close synonyms of the previous word to grow a chain that can end in a fairly good approximation to an antonym for your starting word, in most any language, using either phonemic proximity or lexical proximity, and get different results with each besides.
The real upshot is simply "computers, at present, don't have the ability to really model things in linguistics". But they know about abstract sequences of discrete, wholly-distinct characters that happen to stand for graphical squiggles meaningful to humans.
Play to their strengths -- the computers' *and* the humans'. :)
Oliver Wong - 08 Jan 2007 19:17 GMT >> I thought you were curious as to how manually doing case-insensitive >> conversions could fail, as opposed to using the build in >> equalsIgnoreCase(). > > Both will fail when you want words spelled differently to compare equal, > though Collator may have more smarts in that area. I don't know how you define "fail" or "not fail" in this context, but the point that I'm trying to make is that the two methods do not give the same results and are thus not equivalent. Try running the example I provided earlier, or try this example:
{ System.out.println("beißen".equalsIgnoreCase("BEISSEN")); System.out.println("beißen".toUpperCase().equals("BEISSEN")); }
>>>"COLOR" and "colour" differ only by capitalization while "beissen" and >>>"beißen" differ by spelling in a manner similar to "color" vs. "colour". >> >> I disagree. > > On what basis? Replace the "beißen" by "colour" and "BEISSEN" by "COLOR", and you will see get different results, thus showing that the difference between "COLOR" and "colour" is not of the same nature as that between "beißen" and "BEISSEN".
- Oliver
John Ersatznom - 08 Jan 2007 23:23 GMT >>>I thought you were curious as to how manually doing case-insensitive >>>conversions could fail, as opposed to using the build in [quoted text clipped - 24 lines] > and "colour" is not of the same nature as that between "beißen" and > "BEISSEN". This may show that it is "not of the same nature" as defined by certain Java library functions, but I don't see how this is really meaningful to people, except insofar as it "means" that the standard library has a bug or at least a wart or misfeature of some kind. The "equalsIgnoreCase" method should ignore case, but not spelling. It shouldn't consider "color" equal to "colour" and it shouldn't consider "beißen" equal to "beissen" either. Why? Because those pairs differ by spelling and not just capitalization!
Lew - 09 Jan 2007 04:46 GMT Oliver Wong wrote:
>> Try this example: >> >> { >> System.out.println("beißen".equalsIgnoreCase("BEISSEN")); >> System.out.println("beißen".toUpperCase().equals("BEISSEN")); >> }
> ... The "equalsIgnoreCase" > method should ignore case, but not spelling. It shouldn't consider > ... "beißen" equal to "beissen" either. > Why? Because those pairs differ by spelling and not > just capitalization! That is how equalsIgnoreCase() works:
"beißen".equalsIgnoreCase("BEISSEN"): false
- Lew
John Ersatznom - 15 Jan 2007 14:11 GMT > That is how equalsIgnoreCase() works: > > "beißen".equalsIgnoreCase("BEISSEN"): false Well, then, either Wong is completely nuts, or we're using different JDK versions (1.6 here), or (seems least likely) toUpperCase actually alters the spelling of some words(!) rather than just changing a-z to A-Z (likewise accented equivalents) while leaving the rest alone.
Lew - 15 Jan 2007 15:19 GMT Lew wrote:
>> That is how equalsIgnoreCase() works: >> >> "beißen".equalsIgnoreCase("BEISSEN"): false
> Well, then, either Wong is completely nuts, The result agrees with Oliver's assertion.
> or we're using different JDK > versions (1.6 here), or (seems least likely) toUpperCase actually alters > the spelling of some words(!) rather than just changing a-z to A-Z > (likewise accented equivalents) while leaving the rest alone. AFAIK, toUpperCase() follows the socially-determined locale rules. What is the upper case of "beißen" in German? (E.g., what would a German newspaper do?)
- Lew
John Ersatznom - 16 Jan 2007 16:30 GMT > AFAIK, toUpperCase() follows the socially-determined locale rules. What > is the upper case of "beißen" in German? (E.g., what would a German > newspaper do?) Well, it certainly shouldn't actually use a different spelling. Would an American newspaper use "color" in article text but "COLOUR" in headlines? :)
Regardless, even if toUpperCase makes changes other than to case, even altering the number of symbols, what does toLowerCase do, and why isn't equalsIgnoreCase consistent with them? It should consider any two strings equal whose toUpperCase()s are equal as decided by equals() or whose toLowerCase()s are equal likewise, and extend this transitively as necessary. Otherwise, equalsIgnoreCase is really equalsIgnoreFoo and toLowerCase and toUpperCase are toLowerBar and toUpperBar -- the word "case" is not talking about the same thing in the one as it is in the other, and in at least one it isn't even talking about "case" at all, as that term is commonly understood. The methods should then be renamed to make it clear what they are really talking about -- at least the ones that aren't really talking about "case". In this case (no pun intended), that set apparently includes toUpperCase, which seems to make other transformations than capital letter substitution, and should maybe be named toAllCapsTitle, with a more logically-behaving toUpperCase also made available.
Ian Wilson - 16 Jan 2007 16:56 GMT >> AFAIK, toUpperCase() follows the socially-determined locale rules. >> What is the upper case of "beißen" in German? (E.g., what would a >> German newspaper do?) > > Well, it certainly shouldn't actually use a different spelling. AIUI, It has to since there is not un uppercase version of the lowercase ß ligature. The uppercase equivalent of the ß ligature "character" is the two characters SS.
> Would an American newspaper use "color" in article text but "COLOUR" > in headlines? :) They might use the 6 character a\uFB04uent in article text but the 8 character AFFLUENT in headlines.
Stefan Ram - 16 Jan 2007 18:15 GMT >AIUI, It has to since there is not un uppercase version of the lowercase >ß ligature. The uppercase equivalent of the ß ligature "character" is >the two characters SS. Yes.
There used to be another rule, requesting to use »SZ« instead, when »SS« would be ambigous. For example,
»Das Rechnen mit Massen beherrschen« »DAS RECHNEN MIT MASSEN BEHERRSCHEN«
»TO BE PROFICIENT IN CALCULATIONS WITH INVOLVING MASSES«
»Das Rechnen mit Maßen beherrschen« »DAS RECHNEN MIT MASZEN BEHERRSCHEN«
»TO BE PROFICIENT IN CALCULATIONS WITH INVOLVING MEASURES«
The »amtliche Regelung« for the language to be used in public Schools now is specifying that the uppercase spelling of »ß« always is »SS«. But according to polls only 19 % of the population use this regulation - most of them should be teachers or pupils. Everyone out of school or contracts is free to choose the regulation he wants to adhere to. Therefore an unknown part of the population might use the SZ-rule, although it would be deemed wrong in a public school.
There are also some official regulations regarding telegraphy of the administration which demand to use »SZ« when »SS« might introduce ambiguity as of now. (According to a recent Usenet post, which I can not find now.)
A Usenet post from 1997 claims that »sz« is always used for »ß« in certain messages of the »Bundeswehr« (German Federal Armed Forces) and by news agencies. Another usenet posting claims that this spelling is to be used for labels in technical drawings of a certain company. So it still seems to be used when avoiding ambiguity matters.
Sometimes, the letter »B« is used, because it vaguely looks like »ß«. This is considered wrong, but for fun some people even use it in pronunciation, e.g., speaking of a »StraBe« (from »Straße« - »street«).
Stefan Ram - 16 Jan 2007 18:21 GMT >»TO BE PROFICIENT IN CALCULATIONS WITH INVOLVING MASSES« .replace( "WITH ", "" )
>Schools now is specifying that the uppercase spelling of »ß« .replace( "Sc", "sc" )
Sorry, I /have/ proof-read my post. But I only spot errors after posting it. (I do not dare to think of the error I still have not found or I made within this new post.)
Chris Uppal - 16 Jan 2007 20:06 GMT > There used to be another rule, requesting to use »SZ« instead, > when »SS« would be ambigous. For example, [...] Interesting.
And takes the complexity of case-mapping into an entirely different -- word and meaning sensitive -- direction.
It would be "nice" if we had similar rules in English. (Not too long ago our government was trying to introduce VAT on books, and there was a popular campaign opposed to it. The local branch of one bookshop had large "Don't Tax Reading" posters up everwhere, and I rather enjoyed the ambiguity since "Reading" (pronounced red-ing) was the name of the town....)
-- chris
John Ersatznom - 18 Jan 2007 22:31 GMT > They might use the 6 character a\uFB04uent in article text but the 8 > character AFFLUENT in headlines. ??
Encoding not apparently supported at my end, sorry.
Regardless, the correct way to go about doing things is to have the usual string representation (e.g. "affluent") under the hood, however it's actually rendered. Representation and presentation are *supposed* to be kept separate -- that's why we invented things like CSS, also.
Lew - 18 Jan 2007 23:37 GMT >> They might use the 6 character a\uFB04uent in article text but the 8 >> character AFFLUENT in headlines. [quoted text clipped - 7 lines] > it's actually rendered. Representation and presentation are *supposed* > to be kept separate -- that's why we invented things like CSS, also. The usual String representation of a word spelled with a ligature character will be with the ligature character in the spelling, not with the equivalent double-character pair. As has been stated multiple times in this thread, Java Strings have no native construct for "word", nor for "correct spelling", nor for what you or I think should happen. Rather, they have an adaptation of what the Unicode folks determined should happen.
The reasoning presented in this thread has convinced me that the shortcoming is not in toUpperCase() or toLowerCase() but in equalsIgnoreCase(), and not in a language's own practice of how to case-convert ligatures, as "ß" to "SS" or "SZ", but in the use of UTF-16 encoding internally.
While I agree with your statement that "[r]epresentation and presentation are *supposed* to be kept separate", it is clear that the representation of "ß" should be a character for "ß", and not for "ss" nor "sz". In this domain of discourse, the character represented as "ß" may be presented as "ß", or as "\u00DF", or "?", as the system warrants. The upper-case transformation of "ß" is represented by "SS". That's a fact in German, it's a fact in Unicode, and it's a fact in Java.
So, in fact, what you describe as "the correct way to go about doing things" is, in fact, what is actually in reality happening. The "usual", in fact, the *correct* (within the limits of UTF-16) representation is what's "under the hood, however it's actually rendered". In fact.
- Lew
Lew - 18 Jan 2007 23:42 GMT >> They might use the 6 character a\uFB04uent in article text but the 8 >> character AFFLUENT in headlines. > > ??
> Encoding not apparently supported at my end, sorry. Ian's quoted snippet uses all 7-bit characters, so it is likely not an encoding issue on your end.
Ian's point was that the ligature character '\uFB04' would be upper-cased to "FF" even in the non-computer, social context of newspaper headlines. What Java does is an echo of that.
- Lew
Lew - 18 Jan 2007 23:43 GMT > Ian's point was that the ligature character '\uFB04' would be > upper-cased to "FF" even in the non-computer, social context of > newspaper headlines. What Java does is an echo of that. > > - Lew Er, "FFL".
Ian Wilson - 19 Jan 2007 10:17 GMT >> [American newspapers] might use the 6 character a\uFB04uent in article text but the 8 >> character AFFLUENT in headlines. > > ?? > > Encoding not apparently supported at my end, sorry. Apparently you are wrong.
The quoted text above is all encoded in ASCII. It wasn't intended to present an ffi ligature on your screen. It contains an ASCII representation of the Unicode code-point for an ffi ligature, in a form (\uXXXX) that should be familiar to readers of Java newsgroups.
My message had this encoding: Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Lets look at your headers .. Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Your headers also indicate you're using Thunderbird, as am I.
ISO-8859-1 is a superset of ASCII, so has no problems with the ASCII text of my message.
Your newsreader only needs to be ASCII compatible to display the six ASCII characters backlash u F B zero four.
The encodings I used ARE supported at your end, either in Thunderbird or in Java.
Chris Uppal - 15 Jan 2007 18:48 GMT > > That is how equalsIgnoreCase() works: > > > > "beißen".equalsIgnoreCase("BEISSEN"): false > > Well, then, either Wong is completely nuts, or we're using different JDK > versions (1.6 here), You mean you've tried this and found that your version gives different results ? I find that hard to believe unless its a side effect of attemting to use non-ASCII characters in the input to javac. Try being explicit about using the Unicode character (well, UTF16 value).
public class Test { public static void main(String[] args) { System.out.println("bei\u00DFen -> " + "bei\u00DFen".toUpperCase()); System.out.println("BEISSEN".equalsIgnoreCase("bei\u00DFen")); System.out.println("BEISSEN".equals("bei\u00DFen".toUpperCase()));
// or equivalently, but using octal string escapes System.out.println("bei\337en -> " + "bei\337en".toUpperCase()); System.out.println("BEISSEN".equalsIgnoreCase("bei\337en")); System.out.println("BEISSEN".equals("bei\337en".toUpperCase())); } }
(Tested on 1.4.2, 1.5.0, and 1.6.0)
> or (seems least likely) toUpperCase actually alters > the spelling of some words(!) rather than just changing a-z to A-Z > (likewise accented equivalents) while leaving the rest alone. That sounds as if you /haven't/ actually tried it. (Nor read the documentation for String.toUpperCase() which expounds on this subject).
String.toUpperCase() does /not/ change the spelling of words (how could it, it doesn't know anything about words ?). What it does follow are the correct (insofar as the Unicode spec is correct) rules for mapping lowercase to uppercase. It produces the /same/ word with the /same/ spelling[*], but (naturally) a different representation. In this case the number of visually separable glyphs changes because the U+00DF character (LATIN SMALL LETTER SHARP S) is a ligature of two logical characters, long s and short s (U+017F and U+0073), there is no upper case ligature for that combination (compare fi and FI in English typography), so the correct uppercase version of those (logical) characters is the sequence SS. (At least that's the theory the Uncicode people seem to be operating on -- they know more about it than me so I'm willing to believe them).
It is simply erroneous to expect String.toUpperCase() to map characters one-to-one in the way that English case mapping works. I can't, it isn't supposed to, and it doesn't...
String.equalsIgnoreCase(), on the other hand, is badly broken in that it does /not/ follow those rules. Or, since it's behaviour is clearly documented, perhaps "broken" is too strong a term -- "badly misleading" might be preferred.
-- chris
[*] Arguably the concept "same spelling" is flawed in the context of Unicode case mapping.
John Ersatznom - 16 Jan 2007 16:46 GMT > String.toUpperCase() does /not/ change the spelling of words (how could it, it > doesn't know anything about words ?). What it does follow are the correct [quoted text clipped - 8 lines] > seem to be operating on -- they know more about it than me so I'm willing to > believe them). This seems to be excessively technical when the matter under discussion is simply capitalizing strings. In any event, equalsIgnoreCase should collapse these "ligatures" of yours as well. Also, I don't notice "fi" and "FI" producing strange behavior myself -- even if the letters are often run together so the 'i' hasn't got a separate dot *when typeset*, this doesn't affect the representation of a string in a computer, only the visually displayed output (and then usually only when serious typesetting software is used). Likewise, it makes sense to represent any other logical sequence of characters in a sensible way under the hood, regardless of any rendering fanciness that is done when presenting them to the user.
> It is simply erroneous to expect String.toUpperCase() to map characters > one-to-one in the way that English case mapping works. I can't, it isn't > supposed to, and it doesn't... No, it is not erroneous to expect a method to do exactly and only what its name implies. It is erroneous, of course, to give a method a name that is misleading. If toUpperCase needs a lengthy documentation block explaining why its behavior is surprising, then it's a sure bet that it should not have been named that, since it's apparently really toUpperCaseAndDoesSomeExtraStuffToo.
> String.equalsIgnoreCase(), on the other hand, is badly broken in that it does > /not/ follow those rules. So you at least agree with me that it should be consistent with toUpperCase (and toLowerCase) -- all strings should have a single canonical toUpperCase, a single canonical toLowerCase, both should define equivalence classes on the mixed-case input strings, these should be the SAME equivalence class, and equalsIgnoreCase should implement and embody the corresponding equivalence relation.
> Or, since it's behaviour is clearly documented, > perhaps "broken" is too strong a term -- "badly misleading" might be preferred. It sounds like toUpperCase has a "badly misleading" name since it (supposedly) does transformations that go well beyond what is normally meant by everyday blokes by "to upper case", and the method name is supposed to be a reasonably meaningful capsule summary for everyday blokes of what the method does. If a method is supposed to do behavior that's surprising for any English speaker but not for a German speaker, maybe it should have a German rather than an English name? :) If it's supposed to do locale-dependent stuff, then it should have a version that accepts a Locale object. The version that doesn't shouldn't surprise English speakers; the version that does shouldn't surprise anyone familiar with its locale-specific behavior for the locale actually used. Having locale-dependent behavior invoked randomly without explicit use of Locale objects, and which furthermore doesn't use the system locale, is by itself a sign of a questionable design as well as a sure source of bugs and problems.
I've even encountered somewhere a notion that aString.length() is not even accurate in current Java versions if a string contains obscure characters. It suggests aString.<something using the obscure term "code point", apparently just Unicode-geek for "character"> as its replacement, while of course there's a ton of legacy code using length(). I don't suppose it occurred to them that the new fancy-whosit should have been a replacement length() implementation instead of some new name that doesn't suggest anything to do with the length of a string to someone who doesn't care about all the Unicode bells and whistles and just wants to process strings while remaining agnostic about what they are ultimately used for or contain? Those users will gravitate to length() (plus all that legacy code), not caring about the actual storage length of the internal representation but the length in characters of their data as a general rule. So there should be a length() method that returns the true length of the string, and if necessary a getSize() method that returns the representation's size in bytes or whatever in case someone needs such low level data. (If they persist strings as UTF-8 in a text format file that is parsed, or use serialization, then they don't.)
> [*] Arguably the concept "same spelling" is flawed in the context of Unicode > case mapping. A concept like "same spelling" can't be flawed. It's generally accepted that "color" and "colour" are the same word, but have different spellings, right? While "two" and "too" are different words spelled differently that sound the same, "tomato" and "tomato" are the same word spelled the same but pronounced differently, and "ant" (the bug) and "ant" (the build tool) are different words both spelled and pronounced the same.
Oliver Wong - 16 Jan 2007 17:45 GMT > This seems to be excessively technical when the matter under discussion is > simply capitalizing strings. The above sentence, as perceived by a linguist, is probably akin to the statement "This seems to be excessively technical, when the matter under discussion is simply not putting bugs into our software in the first place." stated by a pointy-haired boss, as perceived by a typical programmer.
[...]
>> It is simply erroneous to expect String.toUpperCase() to map characters >> one-to-one in the way that English case mapping works. I can't, it isn't >> supposed to, and it doesn't... > > No, it is not erroneous to expect a method to do exactly and only what its > name implies. Note that the name of the method is not "String.mapCharactersOneToOneInTheWayThatEnglishCaseMappingWorks()" but rather "String.toUpperCase()". Perhaps due to your limited exposure of languages (e.g. only English), you are unable to conceive of scenarios were converting a text from lower case to uppercase might not work in the same way that it does in English? That is why I gave an example to you, and repeatedly ask you not to simply take my word for it, and run it yourself, to see what the results were.
> It is erroneous, of course, to give a method a name that is misleading. If > toUpperCase needs a lengthy documentation block explaining why its > behavior is surprising, then it's a sure bet that it should not have been > named that, since it's apparently really > toUpperCaseAndDoesSomeExtraStuffToo. I believe that having "ß".toUpperCase() yield "SS" is surprising only to those who are unfamiliar with the ß character. Probably to most German speakers, this behaviour is very non-surprising, and in fact, expected.
[...]
> It sounds like toUpperCase has a "badly misleading" name since it > (supposedly) does transformations that go well beyond what is normally > meant by everyday blokes by "to upper case", and the method name is > supposed to be a reasonably meaningful capsule summary for everyday blokes > of what the method does. I think "everyday blokes" are unqualified to have any expectations of have the concepts of upper case and lower case mean in an international setting. These blokes may have a good idea of what these concepts mean in their particular language, but unless they are linguists, they probably have no idea what these concepts might mean in other languages. Such blokes are probably unqualified to request that the linguists and the unicode consortium redefine their concept of uppercase and lowercase to suit said blokes.
Similarly, an everyday bloke might be surprised about the output of the following Java program:
<code> public class Test { public static void main(String args[]) { System.out.println(0.1 + 0.2 == 0.3); } } </code>
But unless said bloke studied numerical computing, or at the very least, has a understanding of the binary representation system for numbers, said bloke is probably unqualified to request that the computer scientists and IEEE redefine floating point computation to suit said blokes.
> If a method is supposed to do behavior that's surprising for any English > speaker but not for a German speaker, maybe it should have a German rather > than an English name? :) I claim that there exists at least one English speaker for which its behaviour is not surprisingly (me).
> If it's supposed to do locale-dependent stuff, then it should have a > version that accepts a Locale object. It does. See the JavaDocs.
> The version that doesn't shouldn't surprise English speakers; It doesn't surprise me.
Are you basically saying that it should not surprise ANY English speaker? What if I had a cousin, "Surprised Sally" we call her, who is surprised at everything. And she's an English speaker. No matter what the implementation of toUpperCase is, it would surprise her.
Or are you basically saying that it should not surprise *you*? If so, then maybe you should apply for a position on the unicode consortium, so that when the next version of Unicode comes out (6.0?), perhaps you will have exerted enough influence on the standard such that toUpperCase will no longer surprise you.
> the version that does shouldn't surprise anyone familiar with its > locale-specific behavior for the locale actually used. Having > locale-dependent behavior invoked randomly without explicit use of Locale > objects, and which furthermore doesn't use the system locale, is by itself > a sign of a questionable design as well as a sure source of bugs and > problems. What locale were you using, and what did you expect the uppercase form of "ß" to be in that locale?
[...]
>> [*] Arguably the concept "same spelling" is flawed in the context of >> Unicode [quoted text clipped - 3 lines] > that "color" and "colour" are the same word, but have different spellings, > right? You'll have to define the terms "spelling" and "word" outside of the context of any one particular language (e.g. you can't assume only the Latin alphabet) before I can agree or disagree with your claim.
> While "two" and "too" are different words spelled differently that sound > the same, "tomato" and "tomato" are the same word spelled the same but > pronounced differently Ditto.
> and "ant" (the bug) and "ant" (the build tool) are different words both > spelled and pronounced the same. Could we possibly get a bigger hint? =P
- Oliver
Martin Gregorie - 16 Jan 2007 22:03 GMT > Similarly, an everyday bloke might be surprised about the output of the > following Java program: [quoted text clipped - 11 lines] > bloke is probably unqualified to request that the computer scientists and > IEEE redefine floating point computation to suit said blokes. An ordinary bloke might be surprised but any programmer who, in the last 40 years, would test equality that way rather than this:
if (Math.abs((0.1 + 0.2) - 0.3) < 0.05) { // 0.005 is an arbitrary constant: its value depends // on the value of the least significant digit in the // numbers being compared. It be should half the value // of the LSD. System.out.println("Equal"); }
doesn't know his trade. This isn't some numeric esotericism: it is basic knowledge about the representation of real numbers and is absolutely required of anybody handling real number computation.
Using a simple equality is every bit as inexcusable as using floats or doubles to hold monetary values. Both mistakes result from the same misunderstanding.
 Signature martin@ | Martin Gregorie gregorie. | Essex, UK org |
John Ersatznom - 18 Jan 2007 22:43 GMT > That is why I gave an example to you, and > repeatedly ask you not to simply take my word for it, and run it yourself, > to see what the results were. What you have not done is explain why you attacked one of my posts earlier in the thread. That is what started this whole sideline, which is irrelevant to the OP's problem.
> I believe that having "ß".toUpperCase() yield "SS" is surprising only to > those who are unfamiliar with the ß character. Probably to most German > speakers, this behaviour is very non-surprising, and in fact, expected. What is surprising (and violates the Principle of Least Surprise) is the following:
x.toFooCase().equals(y.toFooCase()) != x.equalsIgnoreCase(y) x.toFooCase().length() != x.length()
for some choices of x, y, and Foo.
You may argue that it is equalsIgnoreCase that is broken, but that still doesn't resolve the issue that strings might *change length* unexpectedly as well.
> I think "everyday blokes" are unqualified to have any expectations of > have the concepts of upper case and lower case mean in an international [quoted text clipped - 20 lines] > bloke is probably unqualified to request that the computer scientists and > IEEE redefine floating point computation to suit said blokes. I don't think this is relevant here. Someone familiar with FP math won't be surprised by the behavior of the above. But a programmer using toUpperCase on strings to key a hash table for case-insensitive lookup is going to be surprised if they do weird things like change length, compare equal for strings that aren't equalsIgnoreCase(), and the like. Remember, most programmers a) are English speaking and b) have backgrounds in various programming languages, often including ones with ASCII string classes and case-transforming methods that behave in the "usual" way -- that is, each output letter corresponds to 1 input letter under a fairly basic transformation rule.
Principle of Least Surprise is being violated.
> I claim that there exists at least one English speaker for which its > behaviour is not surprisingly (me). Yes, but you're weird, and apparently multilingual rather than *unilingual English*.
>>If it's supposed to do locale-dependent stuff, then it should have a >>version that accepts a Locale object. > > It does. See the JavaDocs. In which case the version that doesn't shouldn't behave in a surprising way, unless your system default locale is surprising, and of course THAT shouldn't happen.
>>A concept like "same spelling" can't be flawed. It's generally accepted >>that "color" and "colour" are the same word, but have different spellings, [quoted text clipped - 3 lines] > context of any one particular language (e.g. you can't assume only the Latin > alphabet) before I can agree or disagree with your claim. It suffices to mention the axiom that words with different numbers of letters are spelled differently. So if x.length() != y.length() (excuse me, codePointCount :P) then x and y are spelled differently.
Or are you now going to claim that the same spelling can have different lengths? (Encodings such as zipping the text up, UTF-8 etc. don't count.)
Lew - 18 Jan 2007 23:47 GMT > What is surprising (and violates the Principle of Least Surprise) is the > following: The documented behavior of the Java API methods String.toUpperCase() and String.toLowerCase() is completely unsurprising, at least to a practitioner of the Java art. Arguing that it should differ from what it is will yield no sweet fruit. It is what it's supposed to be.
- Lew
John W. Kennedy - 19 Jan 2007 04:52 GMT > I don't think this is relevant here. Someone familiar with FP math won't > be surprised by the behavior of the above. But a programmer using [quoted text clipped - 6 lines] > "usual" way -- that is, each output letter corresponds to 1 input letter > under a fairly basic transformation rule. Ineducable.
*PLONK*
 Signature John W. Kennedy "The blind rulers of Logres Nourished the land on a fallacy of rational virtue." -- Charles Williams. "Taliessin through Logres: Prelude"
Oliver Wong - 19 Jan 2007 17:22 GMT >> That is why I gave an example to you, and repeatedly ask you not to >> simply take my word for it, and run it yourself, to see what the results [quoted text clipped - 3 lines] > in the thread. That is what started this whole sideline, which is > irrelevant to the OP's problem. I fear I'm going to open up a whole can of twisty little worms with this one, but... Can you cite what it is I said that you consider to be an "attack"?
>> I believe that having "ß".toUpperCase() yield "SS" is surprising only >> to those who are unfamiliar with the ß character. Probably to most German [quoted text clipped - 7 lines] > > for some choices of x, y, and Foo. If you are not surprised by the fact that "ß".toUpperCase() yield "SS", then you should not be surprised that there exists some values for x such that x.toUpperCase().length() != x.length().
[Snip "everyday blokes" argument]
> I don't think this is relevant here. The relevancy is thus: You claim that the behaviour of toUpperCase should change because it's surprising to every day blokes. I am arguing that this is not a valid reason for changing the behaviour of toUpperCase, because every day blokes, not being linguists, are unqualified to make linguistic rules that may have widespread implication for languages other than their own.
[...]
> Remember, most programmers a) are English speaking and b) have backgrounds > in various programming languages, often including ones with ASCII string > classes and case-transforming methods that behave in the "usual" way -- > that is, each output letter corresponds to 1 input letter under a fairly > basic transformation rule. Are you sure about these assertions? Do you not think that there might be more Chinese/Japanese programmers than English programmers, given the huge population of Asia as compared to the western countries, and the recent ecomonic growth spurt in Asian? And what about India?
>> I claim that there exists at least one English speaker for which its >> behaviour is not surprisingly (me). > > Yes, but you're weird, and apparently multilingual rather than *unilingual > English*. I claim I am not the only programmer in the world who is unilingual English.
[...]
>>>A concept like "same spelling" can't be flawed. It's generally accepted >>>that "color" and "colour" are the same word, but have different [quoted text clipped - 6 lines] > It suffices to mention the axiom that words with different numbers of > letters are spelled differently. Two issues:
(1) Your axiom fails to satisfy my requirement that your definition must be outside the context of any one particular language. Chinese characters, for example, are not composed of letters, and so speaking about "number of letters in a word" is meaningless there.
(2) That wasn't what I was reluctant to agree with anyway. I am not arguing against the idea that "color" and "colour" are spelt differently. However, I *AM* arguing against the idea that "color" and "colour" are the same word (depending on your definition of "word" which I am awaiting), and I am arguing against the idea that "a concept like 'same spelling' can't be flawed" (depending on your definition of spelling, which I am awaiting).
Recall that there exists languages where words are not written using letters. So any definition of "spelling" which depends on "letters" is inherently flawed.
- Oliver
Mark Thornton - 19 Jan 2007 19:30 GMT > What is surprising (and violates the Principle of Least Surprise) is the > following: [quoted text clipped - 3 lines] > > for some choices of x, y, and Foo. The trouble is that some (human) languages are evidently surprising to those not aware of them. Java can't change the fact that German and Georgian exist, nor can it change how these languages behave. For me, to not uppercase ß as SS would be surprising. (Although English is my native tongue, I did learn German at school some 30 years ago.)
> x.toFooCase().equals(y.toFooCase()) != x.equalsIgnoreCase(y) I believe this problem arises because some languages effectively have more than two cases. An identity that seems obvious in a two case world, ceases to be meaningful in a more complex situation.
Mark Thornton
John W. Kennedy - 16 Jan 2007 18:56 GMT > This seems to be excessively technical when the matter under discussion > is simply capitalizing strings. In any event, equalsIgnoreCase should > collapse these "ligatures" of yours as well. Also, I don't notice "fi" > and "FI" producing strange behavior myself -- even if the letters are > often run together so the 'i' hasn't got a separate dot *when typeset*, > this doesn't affect the representation of a string in a computer, It does if Unicode U+FB01 is used.
Look, you are /way/ out of your depth on this. All you're doing is making repeated assertions about the way things "ought to" work, when in plain fact they don't work that way, and aren't supposed to. Please either get a book about Unicode and read it through, or else drop the subject.
public class FI { public static void main(String[] args) { System.out.println("\uFB01".toUpperCase()); // Result: "FI" } }
 Signature John W. Kennedy "The blind rulers of Logres Nourished the land on a fallacy of rational virtue." -- Charles Williams. "Taliessin through Logres: Prelude"
John Ersatznom - 18 Jan 2007 22:47 GMT >> This seems to be excessively technical when the matter under >> discussion is simply capitalizing strings. In any event, [quoted text clipped - 7 lines] > > Look, you are /way/ out of your depth on this. Maybe so, but I was *dragged down* by people piling onto my earlier, innocuous posting. What do you want me to do, simply concede and let you win? Why was I attacked to begin with?
I checked the history of this thread again and saw that it started with a post by one Oliver Wong. I then googled this bloke, and found in this same newsgroup a thread of around 500 articles half of them authored by him. I get the impression he's an extremely argumentative, arrogant and condescending man whose primary mission in life is to find postings in this newsgroup and attack them accusing the author of making mistakes if he finds anything in them that differs in the slightest from his personal beliefs.
That is not a useful way to discuss things, and serves only to put various people on the defensive and start long argumentative threads apropos of nothing. He really should cut it out, and I think I may just go and killfile him now, along with this thread and any others that he has polluted with his incessant pedantry and unsolicited criticism.
Chris Uppal - 19 Jan 2007 02:20 GMT > I checked the history of this thread again and saw that it started with > a post by one Oliver Wong. I then googled this bloke, and found in this [quoted text clipped - 4 lines] > he finds anything in them that differs in the slightest from his > personal beliefs. Look, we all know who you are.
I, personally, was willing to assume that your new nom de plume reflected a desire on your part to start afresh here, without the baggage of your previous (occasionally atrocious) behaviour. I have been, despite slight misgivings, happy to interact with "John Ersatznom" as if he were a brand new member of this community.
I would /still/ be willing to act on that assumption, even if you want to provoke acrimonious dispute (though somehow I doubt if you'd find it easy to persuade Oliver to join in), but this kind of glove-puppetry is just sickening. The point is not the slur against Oliver (although I respect him, and don't want to see him slagged off, I respect him enough to think that he can look after himself in these matters) but the above quoted paragraph is an insult to every reader's intelligence.
How, or even whether, other people choose to react is their affair, but you have passed the bounds of /my/ tolerance.
-- chris
Lew - 19 Jan 2007 13:49 GMT John Ersatznom wrote:
>> I checked the history of this thread again and saw that it started with >> a post by one Oliver Wong. I then googled this bloke, and found in this [quoted text clipped - 4 lines] >> he finds anything in them that differs in the slightest from his >> personal beliefs.
> Look, we all know who you are. I *thought* so!
- Lew
Chris Uppal - 16 Jan 2007 19:50 GMT [me:]
> > String.toUpperCase() does /not/ change the spelling of words (how could > > it, it doesn't know anything about words ?). What it does follow are [quoted text clipped - 12 lines] > This seems to be excessively technical when the matter under discussion > is simply capitalizing strings. 'fraid not. Case mapping is /NOT SIMPLE/, it never has been simple, and never will be. The fact that case mapping in English /is/ simple is neither here not there. That fact has mislead many Englsh-speaking programmers into making invalid assumptions about the complexity of case mapping (and other orthographical operations), and in the process either creating software which is inherently broken (in implementation or API design) or which is restricted to English text. One example of that unfortunate process is String.equalsIgnoreCase() -- which would be better named something like equalsWhileIgnoringCaseAccordingToTheRulesOfEnglish(), except that it doesn't actually inplement the contract implied by that name /either/. In fact there is no sensible name for what String.equalsIgnoreCase() does.
> Also, I don't notice "fi" > and "FI" producing strange behavior myself -- even if the letters are > often run together so the 'i' hasn't got a separate dot *when typeset*, > this doesn't affect the representation of a string in a computer, only > the visually displayed output (and then usually only when serious > typesetting software is used) That is a fair criticism of the Unicode position. It may even be correct (I don't know). The Unicode position is that it ignores ligatures (as a purely display issue), /except/ where ligature characters are needed in order to support round-tripping with other existing character sets. In this case U+00DF /is/ needed for that purpose (and may also be well established as an regularly used "character" even outside typographically advanced contexts -- I don't know).
The fact is that there are rules to follow. If those rules strike you as unnecessarily complicated, then that is your problem, not anyone else's (but you are certainly not alone). But even if you do dislike the rules, do you also want to write buggy software ? If you do write buggy software (in this respect) then, again, you are certainly not alone -- but that doesn't make it right.
> > It is simply erroneous to expect String.toUpperCase() to map characters > > one-to-one in the way that English case mapping works. I can't, it > > isn't supposed to, and it doesn't... > > No, it is not erroneous to expect a method to do exactly and only what > its name implies. But it /does/ do exactly what its name implies. Only if you have an incomplete idea of what case-mapping involves would you fail to understand the name and its implications.
> > String.equalsIgnoreCase(), on the other hand, is badly broken in that > > it does /not/ follow those rules. [quoted text clipped - 5 lines] > be the SAME equivalence class, and equalsIgnoreCase should implement and > embody the corresponding equivalence relation. But where does the "should" come from ? You can set up that kind of structure for English, no problem, but it doesn't generalise to other languages. No matter how much you may /want/ it to, it simply doesn't...
> The version that doesn't shouldn't > surprise English speakers; the version that does shouldn't surprise > anyone familiar with its locale-specific behavior for the locale > actually used. But there is /nothing/ about Java which implies that instances of java.lang.String hold English text. Indeed there is everthing to suggest otherwise (why use Unicode at all, for instance).
Once you add in Locales then you get /another/ layer of complexity, in that the case mapping may be Local-dependent /as well/ as not fitting with the preconceptions of English (only) speakers.
> Having locale-dependent behavior invoked randomly without > explicit use of Locale objects, and which furthermore doesn't use the > system locale, is by itself a sign of a questionable design as well as a > sure source of bugs and problems. There's a good deal to be said for the idea that Local-dependent operations should either take an explicit Locale as a parameter, or should use a single, /invarient/, default Locale (not installation dependent). Just as a great deal of bother would be saved if String<->byte[] conversions didn't use an implicit, and installation-dependent, character encoding. But even if the Java class library was in that ideal state, case mapping would not be simple and would not conform to the expectations of some English speaking programmers.
There are two problems here. One is that too many programmers expect complex things to be more simple than they are (which is odd when you consider how eager programmers and designers often are to make simple things complex). The other is that we are using legacy libraries which in parts were designed by programmers who were still holding on to that folorn hope. The use of default Locales is one example of that. String.equalsIgnoreCase() is another, and far worse, example.
> I've even encountered somewhere a notion that aString.length() is not > even accurate in current Java versions if a string contains obscure > characters. It depends on what you mean. String.length() returns, correctly, the number of Java "char"s in the String. No problem there. What /is/ a problem is that that is not the same as the number of characters in the Unicode text. That's a problem caused by the mis-specification of Java's chars to be 16-bit quantities. It is highly unfortunate, but there is very little that can be done about it now. It means that correct programming is more difficult than it looks, and also more difficult than it /should/ be. There is nothing in the problem space that makes this difficult (well, actually there is, but we'll pretend there isn't for now[*]), it's not an /inherently/ complex problem, but historical mistakes in Java's design mean that the API mostly works in terms of UTF-16 encoding (sequences of 16-bit values) rather than in terms of real Unicode characters.
> It suggests aString.<something using the obscure term "code > point", apparently just Unicode-geek for "character"> as its > replacement, while of course there's a ton of legacy code using > length(). For the most part, such code will remain correct. One way to think of it is that instances of java.lang.String do not, despite the name, directly represent Unicode strings (sequences of Unicode characters), but are UTF-16. I.e. only the name of the class is wrong. Most operations on UTF16 data "does the right thing" for the Unicode information it represents. For instance concatenating two UTF-16 sequences. It's only operations which mess around taking strings apart[**] which are likely to do something invalid unexpectedly, and even there they quite often work correctly.
The situation is unfortunate, but it's not really fatal. If any programmer is capable of understanding the difference between a sequence of characters and a sequence of bytes in some encoding, in the first place (necessary to do textual IO in Java at all), then adjusting to the deficiencies of the String class should not be overwhelmingly difficult.
There are issues to understand, and knowledge to be acquired; that's all...
> I don't suppose it occurred to them that the new fancy-whosit > should have been a replacement length() implementation instead of some > new name that doesn't suggest anything to do with the length of a string > to someone who doesn't care about all the Unicode bells and whistles and > just wants to process strings while remaining agnostic about what they > are ultimately used for or contain? I think they did the best they could. A better (but impossible in practise) solution would have been to redefine "char" to be a >=24 bit quantity (I'd have chosen 32-bit signed, myself), and redefine String to contain the new "char"s. It would have been nice to refactor String to separate the physical (internal) representation of the data from the logical character-based API. Unfortunately, that would have been impossible unless they made the change /very/ early -- and they missed the short window of opportunity for that. The scheme they came up with, effectively redefining what "String" and "char" mean, is probably the best possible solution. It doesn't break existing code -- in the sense that what worked before continues to work -- all that has changed is the interpretation of that code.
Code which /looks/ as if it will cope with all meaningful inputs does not (but then, it never would have done). Not a satisfactory position, but the best we are going to get.
There are issues to understand, and knowledge to be acquired; that's all...
-- chris
[*] The "length" of a Unicode string is somewhat problematical since some characters qualify others (diacritical marks etc), and some "characters" are not even characters at all. These issues are probably better thought of as technical problems caused by the (unavoidable) compromises in Unicode's design than something inherent to the problem space, but they are still issues for creators of text-aware applications (few Java applications /are/ text-aware to that degree).
[**] I should note that taking sequences of logical Unicode characters apart is also non-trivial, quite independently of Java's representational deficiencies, and may not fit with English speaking programmers' preconceptions. However, that's a different kett
|
|