Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / February 2007

Tip: Looking for answers? Try searching our database.

How to check variables for uniqueness ?

Thread view: 
krislioe@gmail.com - 21 Dec 2006 05:03 GMT
Hi all,

I have eight variables : var1, var2... var 8. All types String.
How to check that each variables has unique values ?

Thank you for your help,
xtanto
Andrew Thompson - 21 Dec 2006 05:08 GMT
krisl...@gmail.com wrote:
...
> I have eight variables : var1, var2... var 8. All types String.
> How to check that each variables has unique values ?

One way would be to create a Map, iterate the
var's and if not present in the map, add the value
as a key, else return false.

Andrew T.
Patricia Shanahan - 21 Dec 2006 06:18 GMT
> krisl...@gmail.com wrote:
> ...
[quoted text clipped - 6 lines]
>
> Andrew T.

Any particular reason for Map, rather than Set?

Note that the result of a Set add call is true if, and only if, the
value is not already in the Set.

Patricia
Andrew Thompson - 21 Dec 2006 06:25 GMT
> > krisl...@gmail.com wrote:
> > ...
[quoted text clipped - 4 lines]
> > var's and if not present in the map, add the value
> > as a key, else return false.
...
> Any particular reason for Map, rather than Set?

You mean besides, 'lack of enough consultation
of the relevant docs.'?   ;-)

> Note that the result of a Set add call is true if, and only if, the
> value is not already in the Set.

A Set sounds the go - it is just right for this task.

Andrew T.
John Ersatznom - 21 Dec 2006 06:33 GMT
>>>krisl...@gmail.com wrote:
>>>...
[quoted text clipped - 17 lines]
>
> A Set sounds the go - it is just right for this task.

HashSet<String> foo = new HashSet<String>();
foo.add(var1);
foo.add(var2);
foo.add(var3);
foo.add(var4);
foo.add(var5);
foo.add(var6);
foo.add(var7);
foo.add(var8);
if (foo.size() < 8)
    duplicateExists();
else
    duplicateDoesNotExist();

If you actually need to identify the specific duplicate pairs, you need
to compare them one by one -- 1 with all the others, 2 with all the
higher-numbered ones, and so on up to 7 and 8, using equals().

If you want case insensitivity, use e.g.

foo.add(var3.toLowerCase());

or equalsIgnoreCase().
Patricia Shanahan - 21 Dec 2006 11:30 GMT
>>>> krisl...@gmail.com wrote:
>>>> ...
[quoted text clipped - 35 lines]
> to compare them one by one -- 1 with all the others, 2 with all the
> higher-numbered ones, and so on up to 7 and 8, using equals().

To save repititious writing, I'm going to assume the strings are in an
array. The equivalent of your code would be:

HashSet<String> foo = new HashSet<String>();
for(String v:vars){
  foo.add(v);
}
if (foo.size() < vars.length)
    duplicateExists();
else
    duplicateDoesNotExist();

You can simplify finding specific duplicates by checking the foo.add
results:

HashSet<String> foo = new HashSet<String>();
for(int i=0; i<vars.length; i++){
  if(!foo.add(vars[i]){
    for(int j=0; j<i; j++){
      if(vars[i].equals(vars[j])){
        reportDuplicate(i,j);
      }
    }
  }
}

A true result from foo.add means the string was actually added to the
set, so it has no duplicate with a lower index.

Patricia
Ed Kirwan - 21 Dec 2006 12:25 GMT
> You can simplify finding specific duplicates by checking the foo.add
> results:
[quoted text clipped - 14 lines]
>
> Patricia

Perhaps using a List would obviate the need for the nest loop?

    List list = new ArrayList();
    for (int i = 0, n = vars.length; i < n; i++) {
       int duplicateIndex = list.indexOf(vars[i]);
       if (duplicateIndex != -1) {
        reportDuplicate(i, duplicateIndex);
       } else {
        list.add(vars[i]);
       }
    }

.ed

Signature

www.EdmundKirwan.com - Home of The Fractal Class Composition.

Download Fractality, free Java code analyzer:
www.EdmundKirwan.com/servlet/fractal/frac-page130.html

Remon van Vliet - 21 Dec 2006 13:28 GMT
> Perhaps using a List would obviate the need for the nest loop?
>
[quoted text clipped - 9 lines]
>
> .ed

The nested loop is only needed to allow reporting of a specific duplicate
pair. I cannot think of many practical examples where that is required
rather than simply reporting that the element to be added is a duplicate. If
it is required then I'd say you're right, using a List does result is
slightly more readable code.

That said, if the collection must not contain duplicate elements then at
least from a design and correctness perspective you should use a Set. I'd
personally do so even if that decision would result in a few extra lines of
code here and there.

Remon
Hemal  Pandya - 22 Dec 2006 05:45 GMT
[...]
> Perhaps using a List would obviate the need for the nest loop?

It will, but will be a lot more expensive. Use can use a
Map<String,Integer> to both avoid nested loop and report indexes. Yes,
it will take more memory.

[....]
Patricia Shanahan - 22 Dec 2006 06:03 GMT
Hemal Pandya wrote:
> [...]
>> Perhaps using a List would obviate the need for the nest loop?

Note that I did NOT write that.

> It will, but will be a lot more expensive. Use can use a
> Map<String,Integer> to both avoid nested loop and report indexes. Yes,
> it will take more memory.
>
> [....]
Hemal  Pandya - 22 Dec 2006 06:46 GMT
[....]
> Note that I did NOT write that.

No, you did not. Your lines would have had one more '>' at the
beginning-of-line. I apologize if I caused confusion.
Ed - 30 Dec 2006 16:15 GMT
Hemal  Pandya skrev:

> [...]
> > Perhaps using a List would obviate the need for the nest loop?
>
> It will, but will be a lot more expensive.
> [....]

Thanks for that tip, Hemal. I had no idea that Set-implementations were
so much more efficient (in this case) than List-implementations. The
output from the (no-doubt indent-mashed) code below gives:

522393 duplicated words. Using java.util.HashSet, time = 678ms.
522393 duplicated words. Using java.util.TreeSet, time = 1812ms.
522393 duplicated words. Using java.util.ArrayList, time = 157724ms.
522393 duplicated words. Using java.util.LinkedList, time = 251739ms.

import java.util.*;
import java.io.*;

class Test {
   private static String TEXT_BOOK_NAME = "war-and-peace.txt";

   public static void main(String[] args) {
    try {
       String text = readText();    // Read text into RAM
       countDuplicateWords(text, new HashSet());
       countDuplicateWords(text, new TreeSet());
       countDuplicateWords(text, new ArrayList());
       countDuplicateWords(text, new LinkedList());
    } catch (Throwable t) {
       System.out.println(t.toString());
    }
   }

   private static String readText() throws Throwable {
    BufferedReader reader =
       new BufferedReader(new FileReader(TEXT_BOOK_NAME));
    String line = null;
    StringBuffer text = new StringBuffer();
    while ((line = reader.readLine()) != null) {
       text.append(line + " ");
    }
    return text.toString();
   }

   private static void countDuplicateWords(String text,
                       Collection listOfWords) {
    int numDuplicatedWords = 0;
    long startTime = System.currentTimeMillis();
    for (StringTokenizer i = new StringTokenizer(text);
        i.hasMoreElements();) {
       String word = i.nextToken();
       if (listOfWords.contains(word)) {
        numDuplicatedWords++;
       } else {
        listOfWords.add(word);
       }
    }
    long endTime = System.currentTimeMillis();
    System.out.println(numDuplicatedWords + " duplicated words. " +
              "Using " + listOfWords.getClass().getName() +
              ", time = " + (endTime - startTime) + "ms.");
   }
}

.ed

--

www.EdmundKirwan.com - Home of The Fractal Class Composition
Lew - 30 Dec 2006 18:10 GMT
> Hemal  Pandya skrev:
>
[quoted text clipped - 60 lines]
>     }
> }

(Please do not embed TAB characters in newsgroup postings.)

You could use a HashMap if you wanted to know how many times each word occurred:

Map< String, Integer > concordance = new HashMap< String, Integer > ();
for ( StringTokenizer tok = new StringTokenizer(text);
      tok.hasMoreElements(); )
{
  String word = tok.nextToken();
  Integer kt = concordance.get( word );
  if ( kt == null )
  {
    concordance.put( word, Integer.valueOf( 0 ));
  }
  else
  {
    concordance.put( word, Integer.valueOf( kt.intValue() + 1 ));
  }
}

then get total dupes by analyzing the concordance:

int totalDupes = 0;
for ( Map.Entry< String, Integer > entry : concordance.entrySet() )
{
  if ( entry.getValue().intValue() > 1 )
  {
    ++totalDupes;
  }
}

- Lew
Ed - 30 Dec 2006 22:32 GMT
Lew skrev:

> (Please do not embed TAB characters in newsgroup postings.)
>
> You could use a HashMap if you wanted to know how many times each word occurred:

snip
> - Lew

Indeed.

And in case anyone's interested, here are the times for HashMap. Looks
like Map is in the league of Set, and not the slow-moving List. (These
times are longer than the previous times because of current CPU
loading; relativity is the key.)

522393 duplicated words. Using java.util.HashSet, time = 789ms.
522393 duplicated words. Using java.util.TreeSet, time = 2168ms.
522393 duplicated words. Using Map , time = 1180ms.
522393 duplicated words. Using java.util.ArrayList, time = 183795ms.
522393 duplicated words. Using java.util.LinkedList, time = 274781ms.

Apologies to Patricia: I see I mis-attributed her post, yet again. And
Lew, I've now become fast friends now with Linux's expand(). Let's see
whether I purged those nasty TABs:

import java.util.*;
import java.io.*;

class Test {
   private static String TEXT_BOOK_NAME = "war-and-peace.txt";

   public static void main(String[] args) {
    try {
       String text = readText();    // Read text into RAM
       countDuplicateWords(text, new HashSet());
       countDuplicateWords(text, new TreeSet());
       countDuplicateWordsMap(text);
       countDuplicateWords(text, new ArrayList());
       countDuplicateWords(text, new LinkedList());
    } catch (Throwable t) {
       System.out.println(t.toString());
    }
   }

   private static String readText() throws Throwable {
    BufferedReader reader =
       new BufferedReader(new FileReader(TEXT_BOOK_NAME));
    String line = null;
    StringBuffer text = new StringBuffer();
    while ((line = reader.readLine()) != null) {
       text.append(line + " ");
    }
    return text.toString();
   }

   private static void countDuplicateWords(String text,
                       Collection listOfWords) {
    int numDuplicatedWords = 0;
    long startTime = System.currentTimeMillis();
    for (StringTokenizer i = new StringTokenizer(text);
        i.hasMoreElements();) {
       String word = i.nextToken();
       if (listOfWords.contains(word)) {
        numDuplicatedWords++;
       } else {
        listOfWords.add(word);
       }
    }
    long endTime = System.currentTimeMillis();
    System.out.println(numDuplicatedWords + " duplicated words. " +
              "Using " + listOfWords.getClass().getName() +
              ", time = " + (endTime - startTime) + "ms.");
   }

   private static void countDuplicateWordsMap(String text) {
    int numDuplicatedWords = 0;
    Map wordsToFrequency = new HashMap();
    long startTime = System.currentTimeMillis();
    for (StringTokenizer i = new StringTokenizer(text);
        i.hasMoreElements();) {
       String word = i.nextToken();
       Integer frequency = (Integer)wordsToFrequency.get(word);
       if (frequency == null) {
        wordsToFrequency.put(word, new Integer(0));
       } else {
        int value = frequency.intValue();
        wordsToFrequency.put(word, new Integer(value + 1));
        numDuplicatedWords++;
       }
    }
    long endTime = System.currentTimeMillis();
    System.out.println(numDuplicatedWords + " duplicated words. " +
              "Using Map " +
              ", time = " + (endTime - startTime) + "ms.");
   }
}

.ed

--

www.EdmundKirwan.com - Home of The Fractal Class Composition
Lew - 31 Dec 2006 13:55 GMT
> And in case anyone's interested, here are the times for HashMap. Looks
> like Map is in the league of Set, and not the slow-moving List. (These
[quoted text clipped - 6 lines]
> 522393 duplicated words. Using java.util.ArrayList, time = 183795ms.
> 522393 duplicated words. Using java.util.LinkedList, time = 274781ms.

These times are extremely interesting.

I speculate that the greater part of the difference between HashMap and
HashSet would be the second loop through the Map. Note that though the Map was
slightly slower than the Set, it delivers more information. With the Set you
only knew how many words were duplicated; with the Map you can also figure out
which words were, and how many times each one occurred.

You could, for example, use the Map to deliver the words in order of
frequency, given the right comparator over the entry set.

- Lew
John Ersatznom - 04 Jan 2007 11:08 GMT
>> And in case anyone's interested, here are the times for HashMap. Looks
>> like Map is in the league of Set, and not the slow-moving List. (These
[quoted text clipped - 17 lines]
> You could, for example, use the Map to deliver the words in order of
> frequency, given the right comparator over the entry set.

A lot of the Map slowness is probably the churn of Integer objects
created. Using an int[1] as a "mutable Integer" would work far better
(although mutable objects in collections is normally bad, mutable values
in a map isn't generally a problem, so long as you don't have mutable keys).

On the subject of tabs, my copy of Thunderbird seems to be quietly
converting tabs into spaces, though I can't find the setting for it.
Posts apparently originally containing tabs (e.g. Ed's earlier) have
spaces when I view them, and my own posts written with tabs don't make
you complain. :) The curious thing is that incoming posts seem to have
tab->4 spaces and the editor shows tabs as 8 spaces, but they become 4
in the actual sent posting...and none of the options in Thunderbird say
anything about conversion of tabs at all, either to set their displayed
width or to actually change tabs to certain numbers of spaces. Hrm. The
"online help" doesn't open a help window, but rather hijacks my open
Firefox window, and the search there is useless on this topic too...
Oliver Wong - 21 Dec 2006 21:09 GMT
> If you want case insensitivity, use e.g.
>
> foo.add(var3.toLowerCase());

   This might not actually work, because of the fickleness of certain human
languages.

> or equalsIgnoreCase().

   Yeah, I'd essentially wrap the String in a custom class which overrides
equals to call equalsIgnoreCase, and give that to the Set.

   - Oliver
John Ersatznom - 22 Dec 2006 09:37 GMT
>>If you want case insensitivity, use e.g.
>>
>>foo.add(var3.toLowerCase());
>
>     This might not actually work, because of the fickleness of certain human
> languages.

?

>     Yeah, I'd essentially wrap the String in a custom class which overrides
> equals to call equalsIgnoreCase, and give that to the Set.

What is obviously missing from java.util is an Equalizer:

public interface Equalizer<T> {
    public boolean areEqual (T foo, T bar);
    public boolean getHash (T foo);
}

and the ability to pass these to collection constructors to use, the way
those that use order comparison can already be handed a custom comparator.

Problems caused by comparators not consitent with an object's equals
method could be avoided by supplying an Equalizer that is consistent
with the comparator, as well as it obviating the need you perceive to
wrap the String class. (Either way, by the way, you need to replace
hashCode() with a case-insensitive version too, or you'll have strings
that compare equal and have different hash codes, at least potentially.
That at least can't happen if you use add(var.toFooCase()) or similar.)
Oliver Wong - 22 Dec 2006 15:33 GMT
>>>If you want case insensitivity, use e.g.
>>>
[quoted text clipped - 4 lines]
>
> ?

   I'm not a linguist, so this may be linguistically incorrect, but it
illustrates the type of problems you can run into:

assert locale is German; //pseudcode
assert "BEISSEN".toLowerCase().equals("beissen");
assert "BEISSEN".toLowerCase().equals("beißen");

   - Oliver
John Ersatznom - 23 Dec 2006 13:14 GMT
>>>>If you want case insensitivity, use e.g.
>>>>
[quoted text clipped - 11 lines]
> assert "BEISSEN".toLowerCase().equals("beissen");
> assert "BEISSEN".toLowerCase().equals("beißen");

Yeah, and assert "Color".toLowerCase().equals("Colour".toLowerCase()).
Whenever there's multiple legitimate spellings for the same word,
there's going to be trouble if you try to make the computer "smart
enough" to treat them as equal.

Mind you, there ARE lexicographical "distance" measures that are useful
for "fuzzy-matching", such as spell-checker "suggestions" use. (Google
now suggests an alternate if it thinks you've misspelled a query term,
for example.) But you can't use those as an equality test, since they
don't define an equivalence relation -- they aren't transitive, since
you can have a.isCloseTo(b), a.isCloseTo(c), and !b.isCloseTo(c) (e.g.
where the distance is 1 from c to a, 1 from a to b, and 2 from c to b,
and 1 is the threshold). Even a threshold of 1 is too high if the result
is not only to equate "color" with "colour" but also with "colon". :)

Best to treat distinct spellings as distinct, and perhaps use a
fuzzy-match "suggested alternative" if users enter a query with no
results, e.g. if a search for "beissen" comes up empty.

Of course, if you really want to drive yourself mad, try to program the
computer to identify when two different input strings identify the same
thing in general. Good luck having it compare e.g. "Carrie-Anne Moss"
and "Lead actress in The Matrix" as equal. Sure, go ahead, you'll even
solve the NLP while you're at it so you should become rich and famous.
If you succeed. :)

Of course, all this arose in the context of "foo.equalsIgnoreCase(bar)"
vs. "foo.toLowerCase().equals(bar.toLowerCase())". Those *should* be
equal; both should be transforming words into a canonical
representation. Or else there should be another toFoo() method that
returns a canonical representation that compares equal for words that
compare equalsIgnoreCase, because the usefulness of having such a
representation to use as a key in a hashmap is obvious.
Oliver Wong - 27 Dec 2006 17:39 GMT
>>>>>If you want case insensitivity, use e.g.
>>>>>
[quoted text clipped - 13 lines]
>
> Yeah, and assert "Color".toLowerCase().equals("Colour".toLowerCase()).

{
 String originalA = "color";
 a = originalA; // "color"
 a = a.toUppercase(); // "COLOR"
 a = a.toLowercase(); // "color"
 assert a.equals(originalA);
}
{
 String originalA = "beißen";
 a = originalA; // "beißen"
 a = a.toUppercase(); // "BEISSEN"
 a = a.toLowercase(); // "beissen"
 assert a.equals(originalA);
}

   - Oliver
John Ersatznom - 29 Dec 2006 15:05 GMT
>>>assert locale is German; //pseudcode
>>>assert "BEISSEN".toLowerCase().equals("beissen");
[quoted text clipped - 9 lines]
>   assert a.equals(originalA);
> }

I don't see "colour" (with a U) in there anywhere, Oliver.
Oliver Wong - 29 Dec 2006 15:28 GMT
>>>>assert locale is German; //pseudcode
>>>>assert "BEISSEN".toLowerCase().equals("beissen");
[quoted text clipped - 11 lines]
>
> I don't see "colour" (with a U) in there anywhere, Oliver.

   You weren't intended to.

   - Oliver
John Ersatznom - 04 Jan 2007 11:11 GMT
>>>>>assert locale is German; //pseudcode
>>>>>assert "BEISSEN".toLowerCase().equals("beissen");
[quoted text clipped - 13 lines]
>
>     You weren't intended to.

Then you're missing the point entirely. "COLOR" and "colour" differ only
by capitalization while "beissen" and "beißen" differ by spelling in a
manner similar to "color" vs. "colour". Alternate spellings of the same
word can't in general be idenfitied as identical by a computer -- not
without a trip through a spellchecking dictionary or the like, anyway. I
think you may be expecting too much of Java's humble string classes.
Perhaps Collator is smart enough for you?
Andrew Thompson - 04 Jan 2007 11:25 GMT
...
> Then you're missing the point entirely. "COLOR" and "colour" differ only
> by capitalization ..

As well as the 'u' in the second word.  And from my
vague recollections of this thread (that I am not prepared
to review at this  instant) - a misunderstanding between
the spelling observed, might actually explain this (sub)
thread..(?)

Andrew T.
John Ersatznom - 05 Jan 2007 21:59 GMT
> ...
>
>>Then you're missing the point entirely. "COLOR" and "colour" differ only
>>by capitalization ..
>
> As well as the 'u' in the second word.

That wasn't supposed to be there, though the later "color" vs "colour"
(all lower case) is correct. :P I trust my meaning is still easy to glean.
Oliver Wong - 04 Jan 2007 21:26 GMT
>>>>>>assert locale is German; //pseudcode
>>>>>>assert "BEISSEN".toLowerCase().equals("beissen");
[quoted text clipped - 15 lines]
>
> Then you're missing the point entirely.

   Must be, because I was under the impression I was making a point to you,
as opposed to the other way around. I thought you were curious as to how
manually doing case-insensitive conversions could fail, as opposed to using
the build in equalsIgnoreCase().

> "COLOR" and "colour" differ only by capitalization while "beissen" and
> "beißen" differ by spelling in a manner similar to "color" vs. "colour".

   I disagree.

> Alternate spellings of the same word can't in general be idenfitied as
> identical by a computer -- not without a trip through a spellchecking
> dictionary or the like, anyway. I think you may be expecting too much of
> Java's humble string classes. Perhaps Collator is smart enough for you?

   You should take the code I posted and put it in your favorite IDE, fix
the compile errors (apparently, it's toLowerCase, not toLowercase), and run
it. You might find the results enlightening. If those results surprise you,
add a few System.out.println(a) to see what's going on.

   - Oliver
John Ersatznom - 06 Jan 2007 12:36 GMT
>>>>I don't see "colour" (with a U) in there anywhere, Oliver.
>>>
[quoted text clipped - 6 lines]
> manually doing case-insensitive conversions could fail, as opposed to using
> the build in equalsIgnoreCase().

Both will fail when you want words spelled differently to compare equal,
though Collator may have more smarts in that area.

>>"COLOR" and "colour" differ only by capitalization while "beissen" and
>>"beißen" differ by spelling in a manner similar to "color" vs. "colour".
>
>     I disagree.

On what basis? The typo I made? It was meant to say "COLOR" and "color"
differ only by capitalization while "beissen" and "beißen" differ by
spelling in a manner similar to "color" vs. "colour".

In fact the analogy goes so far as for the number of letters in the
latter two examples to differ by one in both cases, and for a two letter
region in one to correspond to a single letter at the same place in the
other in particular. And (presumably -- I don't know the German word(s))
they are in both cases variant spellings of a different word --
differing in more than just capitalization, but used interchangeably or
as regional variants rather than having distinct meanings.

>     You should take the code I posted and put it in your favorite IDE, fix
> the compile errors (apparently, it's toLowerCase, not toLowercase), and run
> it.

It would have been nice if Sun had been consistent about their own
capitalization. There's also Character.isWhitespace (in the same class!
Note lowercase s) and System.arraycopy (note lowercase c), at minimum.
:P Maybe they need to implement an isCamelCase method (note second
capital C)... :)

In any event, I suppose the real lesson here is that String (and
friends) get you primitive ordering and comparisons, perhaps somewhat
Anglocentric, and you need to use Collator and relatives for serious
language-and-locale-sensitive comparisons. I don't know the extent to
which even the latter will cope with variant spellings, mind you. There
is also a where-do-you-draw-the-line issue -- from case to slight
variations in the actual sequence of letters used on to more overt
differences, as between "huge" and "giant" -- when should those be
considered synonyms, and when different? -- and on until if you broaden
your requirements enough solving the NLP seems to be a required
component of any conforming implementation. :) Language has a fuzziness
in it in actual human usage that computers have trouble with. It's
curiously not unlike the problems that arose elsewhere here today with
float and double comparisons. You can't rely usefully on == for the most
part, and using Math.abs(x - y) < someThreshold gives an "equality" test
that's more meaninful in some ways but is not transitive any more.
Eventually linguistic equality loses transitivity too -- you can play
all kinds of games of picking close synonyms of the previous word to
grow a chain that can end in a fairly good approximation to an antonym
for your starting word, in most any language, using either phonemic
proximity or lexical proximity, and get different results with each besides.

The real upshot is simply "computers, at present, don't have the ability
to really model things in linguistics". But they know about abstract
sequences of discrete, wholly-distinct characters that happen to stand
for graphical squiggles meaningful to humans.

Play to their strengths -- the computers' *and* the humans'. :)
Oliver Wong - 08 Jan 2007 19:17 GMT
>> I thought you were curious as to how manually doing case-insensitive
>> conversions could fail, as opposed to using the build in
>> equalsIgnoreCase().
>
> Both will fail when you want words spelled differently to compare equal,
> though Collator may have more smarts in that area.

  I don't know how you define "fail" or "not fail" in this context, but the
point that I'm trying to make is that the two methods do not give the same
results and are thus not equivalent. Try running the example I provided
earlier, or try this example:

{
  System.out.println("beißen".equalsIgnoreCase("BEISSEN"));
  System.out.println("beißen".toUpperCase().equals("BEISSEN"));
}

>>>"COLOR" and "colour" differ only by capitalization while "beissen" and
>>>"beißen" differ by spelling in a manner similar to "color" vs. "colour".
>>
>>     I disagree.
>
> On what basis?

   Replace the "beißen" by "colour" and "BEISSEN" by "COLOR", and you will
see get different results, thus showing that the difference between "COLOR"
and "colour" is not of the same nature as that between "beißen" and
"BEISSEN".

   - Oliver
John Ersatznom - 08 Jan 2007 23:23 GMT
>>>I thought you were curious as to how manually doing case-insensitive
>>>conversions could fail, as opposed to using the build in
[quoted text clipped - 24 lines]
> and "colour" is not of the same nature as that between "beißen" and
> "BEISSEN".

This may show that it is "not of the same nature" as defined by certain
Java library functions, but I don't see how this is really meaningful to
people, except insofar as it "means" that the standard library has a bug
or at least a wart or misfeature of some kind. The "equalsIgnoreCase"
method should ignore case, but not spelling. It shouldn't consider
"color" equal to "colour" and it shouldn't consider "beißen" equal to
"beissen" either. Why? Because those pairs differ by spelling and not
just capitalization!
Lew - 09 Jan 2007 04:46 GMT
Oliver Wong wrote:
>> Try this example:
>>
>> {
>>    System.out.println("beißen".equalsIgnoreCase("BEISSEN"));
>>    System.out.println("beißen".toUpperCase().equals("BEISSEN"));
>> }

> ... The "equalsIgnoreCase"
> method should ignore case, but not spelling. It shouldn't consider
> ... "beißen" equal to "beissen" either.
> Why? Because those pairs differ by spelling and not
> just capitalization!

That is how equalsIgnoreCase() works:

"beißen".equalsIgnoreCase("BEISSEN"): false

- Lew
John Ersatznom - 15 Jan 2007 14:11 GMT
> That is how equalsIgnoreCase() works:
>
> "beißen".equalsIgnoreCase("BEISSEN"): false

Well, then, either Wong is completely nuts, or we're using different JDK
versions (1.6 here), or (seems least likely) toUpperCase actually alters
the spelling of some words(!) rather than just changing a-z to A-Z
(likewise accented equivalents) while leaving the rest alone.
Lew - 15 Jan 2007 15:19 GMT
Lew wrote:
>> That is how equalsIgnoreCase() works:
>>
>> "beißen".equalsIgnoreCase("BEISSEN"): false

> Well, then, either Wong is completely nuts,

The result agrees with Oliver's assertion.

> or we're using different JDK
> versions (1.6 here), or (seems least likely) toUpperCase actually alters
> the spelling of some words(!) rather than just changing a-z to A-Z
> (likewise accented equivalents) while leaving the rest alone.

AFAIK, toUpperCase() follows the socially-determined locale rules. What is the
upper case of "beißen" in German? (E.g., what would a German newspaper do?)

- Lew
John Ersatznom - 16 Jan 2007 16:30 GMT
> AFAIK, toUpperCase() follows the socially-determined locale rules. What
> is the upper case of "beißen" in German? (E.g., what would a German
> newspaper do?)

Well, it certainly shouldn't actually use a different spelling. Would an
American newspaper use "color" in article text but "COLOUR" in headlines? :)

Regardless, even if toUpperCase makes changes other than to case, even
altering the number of symbols, what does toLowerCase do, and why isn't
equalsIgnoreCase consistent with them? It should consider any two
strings equal whose toUpperCase()s are equal as decided by equals() or
whose toLowerCase()s are equal likewise, and extend this transitively as
necessary. Otherwise, equalsIgnoreCase is really equalsIgnoreFoo and
toLowerCase and toUpperCase are toLowerBar and toUpperBar -- the word
"case" is not talking about the same thing in the one as it is in the
other, and in at least one it isn't even talking about "case" at all, as
that term is commonly understood. The methods should then be renamed to
make it clear what they are really talking about -- at least the ones
that aren't really talking about "case". In this case (no pun intended),
that set apparently includes toUpperCase, which seems to make other
transformations than capital letter substitution, and should maybe be
named toAllCapsTitle, with a more logically-behaving toUpperCase also
made available.
Ian Wilson - 16 Jan 2007 16:56 GMT
>> AFAIK, toUpperCase() follows the socially-determined locale rules.
>>  What is the upper case of "beißen" in German? (E.g., what would a
>>  German newspaper do?)
>
> Well, it certainly shouldn't actually use a different spelling.

AIUI, It has to since there is not un uppercase version of the lowercase
ß ligature. The uppercase equivalent of the ß ligature "character" is
the two characters SS.

> Would an American newspaper use "color" in article text but "COLOUR"
> in headlines? :)

They might use the 6 character a\uFB04uent in article text but the 8
character AFFLUENT in headlines.
Stefan Ram - 16 Jan 2007 18:15 GMT
>AIUI, It has to since there is not un uppercase version of the lowercase
>ß ligature. The uppercase equivalent of the ß ligature "character" is
>the two characters SS.

 Yes.

 There used to be another rule, requesting to use »SZ« instead,
 when »SS« would be ambigous. For example,

     »Das Rechnen mit Massen beherrschen«
     »DAS RECHNEN MIT MASSEN BEHERRSCHEN«

 »TO BE PROFICIENT IN CALCULATIONS WITH INVOLVING MASSES«

     »Das Rechnen mit Maßen beherrschen«
     »DAS RECHNEN MIT MASZEN BEHERRSCHEN«

 »TO BE PROFICIENT IN CALCULATIONS WITH INVOLVING MEASURES«

 The »amtliche Regelung« for the language to be used in public
 Schools now is specifying that the uppercase spelling of »ß«
 always is »SS«. But according to polls only 19 % of the
 population use this regulation - most of them should be
 teachers or pupils. Everyone out of school or contracts is
 free to choose the regulation he wants to adhere to.
 Therefore an unknown part of the population might use the
 SZ-rule, although it would be deemed wrong in a public school.

 There are also some official regulations regarding telegraphy
 of the administration which demand to use »SZ« when »SS« might
 introduce ambiguity as of now. (According to a recent Usenet
 post, which I can not find now.)

 A Usenet post from 1997 claims that »sz« is always used for
 »ß« in certain messages of the »Bundeswehr« (German Federal
 Armed Forces) and by news agencies. Another usenet posting
 claims that this spelling is to be used for labels in
 technical drawings of a certain company. So it still seems to
 be used when avoiding ambiguity matters.

 Sometimes, the letter »B« is used, because it vaguely looks
 like »ß«. This is considered wrong, but for fun some people
 even use it in pronunciation, e.g., speaking of a »StraBe«
 (from »Straße« - »street«).
Stefan Ram - 16 Jan 2007 18:21 GMT
>»TO BE PROFICIENT IN CALCULATIONS WITH INVOLVING MASSES«

 .replace( "WITH ", "" )

>Schools now is specifying that the uppercase spelling of »ß«

 .replace( "Sc", "sc" )

 Sorry, I /have/ proof-read my post. But I only spot errors
 after posting it. (I do not dare to think of the error I
 still have not found or I made within this new post.)
Chris Uppal - 16 Jan 2007 20:06 GMT
>   There used to be another rule, requesting to use »SZ« instead,
>   when »SS« would be ambigous. For example, [...]

Interesting.

And takes the complexity of case-mapping into an entirely different -- word and
meaning sensitive -- direction.

It would be "nice" if we had similar rules in English.  (Not too long ago our
government was trying to introduce VAT on books, and there was a popular
campaign opposed to it.  The local branch of one bookshop had large "Don't Tax
Reading" posters up everwhere, and I rather enjoyed the ambiguity since
"Reading" (pronounced red-ing) was the name of the town....)

   -- chris
John Ersatznom - 18 Jan 2007 22:31 GMT
> They might use the 6 character a\uFB04uent in article text but the 8
> character AFFLUENT in headlines.

??

Encoding not apparently supported at my end, sorry.

Regardless, the correct way to go about doing things is to have the
usual string representation (e.g. "affluent") under the hood, however
it's actually rendered. Representation and presentation are *supposed*
to be kept separate -- that's why we invented things like CSS, also.
Lew - 18 Jan 2007 23:37 GMT
>> They might use the 6 character a\uFB04uent in article text but the 8
>> character AFFLUENT in headlines.
[quoted text clipped - 7 lines]
> it's actually rendered. Representation and presentation are *supposed*
> to be kept separate -- that's why we invented things like CSS, also.

The usual String representation of a word spelled with a ligature character
will be with the ligature character in the spelling, not with the equivalent
double-character pair. As has been stated multiple times in this thread, Java
Strings have no native construct for "word", nor for "correct spelling", nor
for what you or I think should happen. Rather, they have an adaptation of what
the Unicode folks determined should happen.

The reasoning presented in this thread has convinced me that the shortcoming
is not in toUpperCase() or toLowerCase() but in equalsIgnoreCase(), and not in
a language's own practice of how to case-convert ligatures, as "ß" to "SS" or
"SZ", but in the use of UTF-16 encoding internally.

While I agree with your statement that "[r]epresentation and presentation are
*supposed* to be kept separate", it is clear that the representation of "ß"
should be a character for "ß", and not for "ss" nor "sz". In this domain of
discourse, the character represented as "ß" may be presented as "ß", or as
"\u00DF", or "?", as the system warrants. The upper-case transformation of "ß"
is represented by "SS". That's a fact in German, it's a fact in Unicode, and
it's a fact in Java.

So, in fact, what you describe as "the correct way to go about doing things"
is, in fact, what is actually in reality happening. The "usual", in fact, the
*correct* (within the limits of UTF-16) representation is what's "under the
hood, however it's actually rendered". In fact.

- Lew
Lew - 18 Jan 2007 23:42 GMT
>> They might use the 6 character a\uFB04uent in article text but the 8
>> character AFFLUENT in headlines.
>
> ??

> Encoding not apparently supported at my end, sorry.

Ian's quoted snippet uses all 7-bit characters, so it is likely not an
encoding issue on your end.

Ian's point was that the ligature character '\uFB04' would be upper-cased to
"FF" even in the non-computer, social context of newspaper headlines. What
Java does is an echo of that.

- Lew
Lew - 18 Jan 2007 23:43 GMT
> Ian's point was that the ligature character '\uFB04' would be
> upper-cased to "FF" even in the non-computer, social context of
> newspaper headlines. What Java does is an echo of that.
>
> - Lew
Er, "FFL".
Ian Wilson - 19 Jan 2007 10:17 GMT
>> [American newspapers] might use the 6 character a\uFB04uent in article text but the 8
>> character AFFLUENT in headlines.
>
> ??
>
> Encoding not apparently supported at my end, sorry.

Apparently you are wrong.

The quoted text above is all encoded in ASCII. It wasn't intended to
present an ffi ligature on your screen. It contains an ASCII
representation of the Unicode code-point for an ffi ligature, in a form
(\uXXXX) that should be familiar to readers of Java newsgroups.

My message had this encoding:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Lets look at your headers ..
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Your headers also indicate you're using Thunderbird, as am I.

ISO-8859-1 is a superset of ASCII, so has no problems with the ASCII
text of my message.

Your newsreader only needs to be ASCII compatible to display the six
ASCII characters backlash u F B zero four.

The encodings I used ARE supported at your end, either in Thunderbird or
in Java.
Chris Uppal - 15 Jan 2007 18:48 GMT
> > That is how equalsIgnoreCase() works:
> >
> > "beißen".equalsIgnoreCase("BEISSEN"): false
>
> Well, then, either Wong is completely nuts, or we're using different JDK
> versions (1.6 here),

You mean you've tried this and found that your version gives different results
?  I find that hard to believe unless its a side effect of attemting to use
non-ASCII characters in the input to javac.  Try being explicit about using the
Unicode character (well, UTF16 value).

   public class Test
   {
       public static void
       main(String[] args)
       {
           System.out.println("bei\u00DFen -> " +
"bei\u00DFen".toUpperCase());
           System.out.println("BEISSEN".equalsIgnoreCase("bei\u00DFen"));
           System.out.println("BEISSEN".equals("bei\u00DFen".toUpperCase()));

           // or equivalently, but using octal string escapes
           System.out.println("bei\337en -> " + "bei\337en".toUpperCase());
           System.out.println("BEISSEN".equalsIgnoreCase("bei\337en"));
           System.out.println("BEISSEN".equals("bei\337en".toUpperCase()));
       }
   }

(Tested on 1.4.2, 1.5.0, and 1.6.0)

> or (seems least likely) toUpperCase actually alters
> the spelling of some words(!) rather than just changing a-z to A-Z
> (likewise accented equivalents) while leaving the rest alone.

That sounds as if you /haven't/ actually tried it.  (Nor read the documentation
for String.toUpperCase() which expounds on this subject).

String.toUpperCase() does /not/ change the spelling of words (how could it, it
doesn't know anything about words ?).  What it does follow are the correct
(insofar as the Unicode spec is correct) rules for mapping lowercase to
uppercase.  It produces the /same/ word with the /same/ spelling[*], but
(naturally) a different representation.  In this case the number of visually
separable glyphs changes because the U+00DF character (LATIN SMALL LETTER SHARP
S) is a ligature of two logical characters, long s and short s (U+017F and
U+0073), there is no upper case ligature for that combination (compare fi and
FI in English typography), so the correct uppercase version of those (logical)
characters is the sequence SS.  (At least that's the theory the Uncicode people
seem to be operating on -- they know more about it than me so I'm willing to
believe them).

It is simply erroneous to expect String.toUpperCase() to map characters
one-to-one in the way that English case mapping works.  I can't, it isn't
supposed to, and it doesn't...

String.equalsIgnoreCase(), on the other hand, is badly broken in that it does
/not/ follow those rules.  Or, since it's behaviour is clearly documented,
perhaps "broken" is too strong a term -- "badly misleading" might be preferred.

   -- chris

[*] Arguably the concept "same spelling" is flawed in the context of Unicode
case mapping.
John Ersatznom - 16 Jan 2007 16:46 GMT
> String.toUpperCase() does /not/ change the spelling of words (how could it, it
> doesn't know anything about words ?).  What it does follow are the correct
[quoted text clipped - 8 lines]
> seem to be operating on -- they know more about it than me so I'm willing to
> believe them).

This seems to be excessively technical when the matter under discussion
is simply capitalizing strings. In any event, equalsIgnoreCase should
collapse these "ligatures" of yours as well. Also, I don't notice "fi"
and "FI" producing strange behavior myself -- even if the letters are
often run together so the 'i' hasn't got a separate dot *when typeset*,
this doesn't affect the representation of a string in a computer, only
the visually displayed output (and then usually only when serious
typesetting software is used). Likewise, it makes sense to represent any
other logical sequence of characters in a sensible way under the hood,
regardless of any rendering fanciness that is done when presenting them
to the user.

> It is simply erroneous to expect String.toUpperCase() to map characters
> one-to-one in the way that English case mapping works.  I can't, it isn't
> supposed to, and it doesn't...

No, it is not erroneous to expect a method to do exactly and only what
its name implies. It is erroneous, of course, to give a method a name
that is misleading. If toUpperCase needs a lengthy documentation block
explaining why its behavior is surprising, then it's a sure bet that it
should not have been named that, since it's apparently really
toUpperCaseAndDoesSomeExtraStuffToo.

> String.equalsIgnoreCase(), on the other hand, is badly broken in that it does
> /not/ follow those rules.

So you at least agree with me that it should be consistent with
toUpperCase (and toLowerCase) -- all strings should have a single
canonical toUpperCase, a single canonical toLowerCase, both should
define equivalence classes on the mixed-case input strings, these should
be the SAME equivalence class, and equalsIgnoreCase should implement and
embody the corresponding equivalence relation.

> Or, since it's behaviour is clearly documented,
> perhaps "broken" is too strong a term -- "badly misleading" might be preferred.

It sounds like toUpperCase has a "badly misleading" name since it
(supposedly) does transformations that go well beyond what is normally
meant by everyday blokes by "to upper case", and the method name is
supposed to be a reasonably meaningful capsule summary for everyday
blokes of what the method does. If a method is supposed to do behavior
that's surprising for any English speaker but not for a German speaker,
maybe it should have a German rather than an English name? :) If it's
supposed to do locale-dependent stuff, then it should have a version
that accepts a Locale object. The version that doesn't shouldn't
surprise English speakers; the version that does shouldn't surprise
anyone familiar with its locale-specific behavior for the locale
actually used. Having locale-dependent behavior invoked randomly without
explicit use of Locale objects, and which furthermore doesn't use the
system locale, is by itself a sign of a questionable design as well as a
sure source of bugs and problems.

I've even encountered somewhere a notion that aString.length() is not
even accurate in current Java versions if a string contains obscure
characters. It suggests aString.<something using the obscure term "code
point", apparently just Unicode-geek for "character"> as its
replacement, while of course there's a ton of legacy code using
length(). I don't suppose it occurred to them that the new fancy-whosit
should have been a replacement length() implementation instead of some
new name that doesn't suggest anything to do with the length of a string
to someone who doesn't care about all the Unicode bells and whistles and
just wants to process strings while remaining agnostic about what they
are ultimately used for or contain? Those users will gravitate to
length() (plus all that legacy code), not caring about the actual
storage length of the internal representation but the length in
characters of their data as a general rule. So there should be a
length() method that returns the true length of the string, and if
necessary a getSize() method that returns the representation's size in
bytes or whatever in case someone needs such low level data. (If they
persist strings as UTF-8 in a text format file that is parsed, or use
serialization, then they don't.)

> [*] Arguably the concept "same spelling" is flawed in the context of Unicode
> case mapping.

A concept like "same spelling" can't be flawed. It's generally accepted
that "color" and "colour" are the same word, but have different
spellings, right? While "two" and "too" are different words spelled
differently that sound the same, "tomato" and "tomato" are the same word
spelled the same but pronounced differently, and "ant" (the bug) and
"ant" (the build tool) are different words both spelled and pronounced
the same.
Oliver Wong - 16 Jan 2007 17:45 GMT
> This seems to be excessively technical when the matter under discussion is
> simply capitalizing strings.

   The above sentence, as perceived by a linguist, is probably akin to the
statement "This seems to be excessively technical, when the matter under
discussion is simply not putting bugs into our software in the first place."
stated by a pointy-haired boss, as perceived by a typical programmer.

[...]

>> It is simply erroneous to expect String.toUpperCase() to map characters
>> one-to-one in the way that English case mapping works.  I can't, it isn't
>> supposed to, and it doesn't...
>
> No, it is not erroneous to expect a method to do exactly and only what its
> name implies.

   Note that the name of the method is not
"String.mapCharactersOneToOneInTheWayThatEnglishCaseMappingWorks()" but
rather "String.toUpperCase()". Perhaps due to your limited exposure of
languages (e.g. only English), you are unable to conceive of scenarios were
converting a text from lower case to uppercase might not work in the same
way that it does in English? That is why I gave an example to you, and
repeatedly ask you not to simply take my word for it, and run it yourself,
to see what the results were.

> It is erroneous, of course, to give a method a name that is misleading. If
> toUpperCase needs a lengthy documentation block explaining why its
> behavior is surprising, then it's a sure bet that it should not have been
> named that, since it's apparently really
> toUpperCaseAndDoesSomeExtraStuffToo.

   I believe that having "ß".toUpperCase() yield "SS" is surprising only to
those who are unfamiliar with the ß character. Probably to most German
speakers, this behaviour is very non-surprising, and in fact, expected.

[...]

> It sounds like toUpperCase has a "badly misleading" name since it
> (supposedly) does transformations that go well beyond what is normally
> meant by everyday blokes by "to upper case", and the method name is
> supposed to be a reasonably meaningful capsule summary for everyday blokes
> of what the method does.

   I think "everyday blokes" are unqualified to have any expectations of
have the concepts of upper case and lower case mean in an international
setting. These blokes may have a good idea of what these concepts mean in
their particular language, but unless they are linguists, they probably have
no idea what these concepts might mean in other languages. Such blokes are
probably unqualified to request that the linguists and the unicode
consortium redefine their concept of uppercase and lowercase to suit said
blokes.

   Similarly, an everyday bloke might be surprised about the output of the
following Java program:

<code>
public class Test {
public static void main(String args[]) {
 System.out.println(0.1 + 0.2 == 0.3);
}
}
</code>

   But unless said bloke studied numerical computing, or at the very least,
has a understanding of the binary representation system for numbers, said
bloke is probably unqualified to request that the computer scientists and
IEEE redefine floating point computation to suit said blokes.

> If a method is supposed to do behavior that's surprising for any English
> speaker but not for a German speaker, maybe it should have a German rather
> than an English name? :)

   I claim that there exists at least one English speaker for which its
behaviour is not surprisingly (me).

> If it's supposed to do locale-dependent stuff, then it should have a
> version that accepts a Locale object.

   It does. See the JavaDocs.

> The version that doesn't shouldn't surprise English speakers;

   It doesn't surprise me.

   Are you basically saying that it should not surprise ANY English
speaker? What if I had a cousin, "Surprised Sally" we call her, who is
surprised at everything. And she's an English speaker. No matter what the
implementation of toUpperCase is, it would surprise her.

   Or are you basically saying that it should not surprise *you*? If so,
then maybe you should apply for a position on the unicode consortium, so
that when the next version of Unicode comes out (6.0?), perhaps you will
have exerted enough influence on the standard such that toUpperCase will no
longer surprise you.

> the version that does shouldn't surprise anyone familiar with its
> locale-specific behavior for the locale actually used. Having
> locale-dependent behavior invoked randomly without explicit use of Locale
> objects, and which furthermore doesn't use the system locale, is by itself
> a sign of a questionable design as well as a sure source of bugs and
> problems.

   What locale were you using, and what did you expect the uppercase form
of "ß" to be in that locale?

[...]

>> [*] Arguably the concept "same spelling" is flawed in the context of
>> Unicode
[quoted text clipped - 3 lines]
> that "color" and "colour" are the same word, but have different spellings,
> right?

   You'll have to define the terms "spelling" and "word" outside of the
context of any one particular language (e.g. you can't assume only the Latin
alphabet) before I can agree or disagree with your claim.

> While "two" and "too" are different words spelled differently that sound
> the same, "tomato" and "tomato" are the same word spelled the same but
> pronounced differently

   Ditto.

> and "ant" (the bug) and "ant" (the build tool) are different words both
> spelled and pronounced the same.

   Could we possibly get a bigger hint? =P

   - Oliver
Martin Gregorie - 16 Jan 2007 22:03 GMT
>     Similarly, an everyday bloke might be surprised about the output of the
> following Java program:
[quoted text clipped - 11 lines]
> bloke is probably unqualified to request that the computer scientists and
> IEEE redefine floating point computation to suit said blokes.

An ordinary bloke might be surprised but any programmer who, in the last
40 years, would test equality that way rather than this:

   if (Math.abs((0.1 + 0.2) - 0.3) < 0.05)
   {
    // 0.005 is an arbitrary constant: its value depends
    // on the value of the least significant digit in the
        // numbers being compared. It be should half the value
        // of the LSD.
    System.out.println("Equal");
   }

doesn't know his trade. This isn't some numeric esotericism: it is basic
knowledge about the representation of real numbers and is absolutely
required of anybody handling real number computation.

Using a simple equality is every bit as inexcusable as using floats or
doubles to hold monetary values. Both mistakes result from the same
misunderstanding.

Signature

martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

John Ersatznom - 18 Jan 2007 22:43 GMT
> That is why I gave an example to you, and
> repeatedly ask you not to simply take my word for it, and run it yourself,
> to see what the results were.

What you have not done is explain why you attacked one of my posts
earlier in the thread. That is what started this whole sideline, which
is irrelevant to the OP's problem.

>     I believe that having "ß".toUpperCase() yield "SS" is surprising only to
> those who are unfamiliar with the ß character. Probably to most German
> speakers, this behaviour is very non-surprising, and in fact, expected.

What is surprising (and violates the Principle of Least Surprise) is the
following:

x.toFooCase().equals(y.toFooCase()) != x.equalsIgnoreCase(y)
x.toFooCase().length() != x.length()

for some choices of x, y, and Foo.

You may argue that it is equalsIgnoreCase that is broken, but that still
doesn't resolve the issue that strings might *change length*
unexpectedly as well.

>     I think "everyday blokes" are unqualified to have any expectations of
> have the concepts of upper case and lower case mean in an international
[quoted text clipped - 20 lines]
> bloke is probably unqualified to request that the computer scientists and
> IEEE redefine floating point computation to suit said blokes.

I don't think this is relevant here. Someone familiar with FP math won't
be surprised by the behavior of the above. But a programmer using
toUpperCase on strings to key a hash table for case-insensitive lookup
is going to be surprised if they do weird things like change length,
compare equal for strings that aren't equalsIgnoreCase(), and the like.
Remember, most programmers a) are English speaking and b) have
backgrounds in various programming languages, often including ones with
ASCII string classes and case-transforming methods that behave in the
"usual" way -- that is, each output letter corresponds to 1 input letter
under a fairly basic transformation rule.

Principle of Least Surprise is being violated.

>     I claim that there exists at least one English speaker for which its
> behaviour is not surprisingly (me).

Yes, but you're weird, and apparently multilingual rather than
*unilingual English*.

>>If it's supposed to do locale-dependent stuff, then it should have a
>>version that accepts a Locale object.
>
>     It does. See the JavaDocs.

In which case the version that doesn't shouldn't behave in a surprising
way, unless your system default locale is surprising, and of course THAT
shouldn't happen.

>>A concept like "same spelling" can't be flawed. It's generally accepted
>>that "color" and "colour" are the same word, but have different spellings,
[quoted text clipped - 3 lines]
> context of any one particular language (e.g. you can't assume only the Latin
> alphabet) before I can agree or disagree with your claim.

It suffices to mention the axiom that words with different numbers of
letters are spelled differently. So if x.length() != y.length() (excuse
me, codePointCount :P) then x and y are spelled differently.

Or are you now going to claim that the same spelling can have different
lengths? (Encodings such as zipping the text up, UTF-8 etc. don't count.)
Lew - 18 Jan 2007 23:47 GMT
> What is surprising (and violates the Principle of Least Surprise) is the
> following:

The documented behavior of the Java API methods String.toUpperCase() and
String.toLowerCase() is completely unsurprising, at least to a practitioner of
the Java art. Arguing that it should differ from what it is will yield no
sweet fruit. It is what it's supposed to be.

- Lew
John W. Kennedy - 19 Jan 2007 04:52 GMT
> I don't think this is relevant here. Someone familiar with FP math won't
> be surprised by the behavior of the above. But a programmer using
[quoted text clipped - 6 lines]
> "usual" way -- that is, each output letter corresponds to 1 input letter
> under a fairly basic transformation rule.

Ineducable.

*PLONK*

Signature

John W. Kennedy
"The blind rulers of Logres
Nourished the land on a fallacy of rational virtue."
  -- Charles Williams.  "Taliessin through Logres: Prelude"

Oliver Wong - 19 Jan 2007 17:22 GMT
>> That is why I gave an example to you, and repeatedly ask you not to
>> simply take my word for it, and run it yourself, to see what the results
[quoted text clipped - 3 lines]
> in the thread. That is what started this whole sideline, which is
> irrelevant to the OP's problem.

   I fear I'm going to open up a whole can of twisty little worms with this
one, but... Can you cite what it is I said that you consider to be an
"attack"?

>>     I believe that having "ß".toUpperCase() yield "SS" is surprising only
>> to those who are unfamiliar with the ß character. Probably to most German
[quoted text clipped - 7 lines]
>
> for some choices of x, y, and Foo.

   If you are not surprised by the fact that "ß".toUpperCase() yield "SS",
then you should not be surprised that there exists some values for x such
that x.toUpperCase().length() != x.length().

[Snip "everyday blokes" argument]

> I don't think this is relevant here.

   The relevancy is thus: You claim that the behaviour of toUpperCase
should change because it's surprising to every day blokes. I am arguing that
this is not a valid reason for changing the behaviour of toUpperCase,
because every day blokes, not being linguists, are unqualified to make
linguistic rules that may have widespread implication for languages other
than their own.

[...]
> Remember, most programmers a) are English speaking and b) have backgrounds
> in various programming languages, often including ones with ASCII string
> classes and case-transforming methods that behave in the "usual" way --  
> that is, each output letter corresponds to 1 input letter under a fairly
> basic transformation rule.

   Are you sure about these assertions? Do you not think that there might
be more Chinese/Japanese programmers than English programmers, given the
huge population of Asia as compared to the western countries, and the recent
ecomonic growth spurt in Asian? And what about India?

>>     I claim that there exists at least one English speaker for which its
>> behaviour is not surprisingly (me).
>
> Yes, but you're weird, and apparently multilingual rather than *unilingual
> English*.

   I claim I am not the only programmer in the world who is unilingual
English.

[...]

>>>A concept like "same spelling" can't be flawed. It's generally accepted
>>>that "color" and "colour" are the same word, but have different
[quoted text clipped - 6 lines]
> It suffices to mention the axiom that words with different numbers of
> letters are spelled differently.

   Two issues:

   (1) Your axiom fails to satisfy my requirement that your definition must
be outside the context of any one particular language. Chinese characters,
for example, are not composed of letters, and so speaking about "number of
letters in a word" is meaningless there.

   (2) That wasn't what I was reluctant to agree with anyway. I am not
arguing against the idea that "color" and "colour" are spelt differently.
However, I *AM* arguing against the idea that "color" and "colour" are the
same word (depending on your definition of "word" which I am awaiting), and
I am arguing against the idea that "a concept like 'same spelling' can't be
flawed" (depending on your definition of spelling, which I am awaiting).

   Recall that there exists languages where words are not written using
letters. So any definition of "spelling" which depends on "letters" is
inherently flawed.

   - Oliver
Mark Thornton - 19 Jan 2007 19:30 GMT
> What is surprising (and violates the Principle of Least Surprise) is the
> following:
[quoted text clipped - 3 lines]
>
> for some choices of x, y, and Foo.

The trouble is that some (human) languages are evidently surprising to
those not aware of them. Java can't change the fact that German and
Georgian exist, nor can it change how these languages behave. For me, to
not uppercase ß as SS would be surprising. (Although English is my
native tongue, I did learn German at school some 30 years ago.)

> x.toFooCase().equals(y.toFooCase()) != x.equalsIgnoreCase(y)

I believe this problem arises because some languages effectively have
more than two cases. An identity that seems obvious in a two case world,
ceases to be meaningful in a more complex situation.

Mark Thornton
John W. Kennedy - 16 Jan 2007 18:56 GMT
> This seems to be excessively technical when the matter under discussion
> is simply capitalizing strings. In any event, equalsIgnoreCase should
> collapse these "ligatures" of yours as well. Also, I don't notice "fi"
> and "FI" producing strange behavior myself -- even if the letters are
> often run together so the 'i' hasn't got a separate dot *when typeset*,
> this doesn't affect the representation of a string in a computer,

It does if Unicode U+FB01 is used.

Look, you are /way/ out of your depth on this. All you're doing is
making repeated assertions about the way things "ought to" work, when in
plain fact they don't work that way, and aren't supposed to. Please
either get a book about Unicode and read it through, or else drop the
subject.

public class FI {
    public static void main(String[] args) {
        System.out.println("\uFB01".toUpperCase()); // Result: "FI"
    }
}

Signature

John W. Kennedy
"The blind rulers of Logres
Nourished the land on a fallacy of rational virtue."
  -- Charles Williams.  "Taliessin through Logres: Prelude"

John Ersatznom - 18 Jan 2007 22:47 GMT
>> This seems to be excessively technical when the matter under
>> discussion is simply capitalizing strings. In any event,
[quoted text clipped - 7 lines]
>
> Look, you are /way/ out of your depth on this.

Maybe so, but I was *dragged down* by people piling onto my earlier,
innocuous posting. What do you want me to do, simply concede and let you
win? Why was I attacked to begin with?

I checked the history of this thread again and saw that it started with
a post by one Oliver Wong. I then googled this bloke, and found in this
same newsgroup a thread of around 500 articles half of them authored by
him. I get the impression he's an extremely argumentative, arrogant and
condescending man whose primary mission in life is to find postings in
this newsgroup and attack them accusing the author of making mistakes if
he finds anything in them that differs in the slightest from his
personal beliefs.

That is not a useful way to discuss things, and serves only to put
various people on the defensive and start long argumentative threads
apropos of nothing. He really should cut it out, and I think I may just
go and killfile him now, along with this thread and any others that he
has polluted with his incessant pedantry and unsolicited criticism.
Chris Uppal - 19 Jan 2007 02:20 GMT
> I checked the history of this thread again and saw that it started with
> a post by one Oliver Wong. I then googled this bloke, and found in this
[quoted text clipped - 4 lines]
> he finds anything in them that differs in the slightest from his
> personal beliefs.

Look, we all know who you are.

I, personally, was willing to assume that your new nom de plume reflected a
desire on your part to start afresh here, without the baggage of your previous
(occasionally atrocious) behaviour.  I have been, despite slight misgivings,
happy to interact with "John Ersatznom" as if he were a brand new member of
this community.

I would /still/ be willing to act on that assumption, even if you want to
provoke acrimonious dispute (though somehow I doubt if you'd find it easy to
persuade Oliver to join in), but this kind of glove-puppetry is just sickening.
The point is not the slur against Oliver (although I respect him, and don't
want to see him slagged off, I respect him enough to think that he can look
after himself in these matters) but the above quoted paragraph is an insult to
every reader's intelligence.

How, or even whether, other people choose to react is their affair, but you
have passed the bounds of /my/ tolerance.

   -- chris
Lew - 19 Jan 2007 13:49 GMT
John Ersatznom wrote:

>> I checked the history of this thread again and saw that it started with
>> a post by one Oliver Wong. I then googled this bloke, and found in this
[quoted text clipped - 4 lines]
>> he finds anything in them that differs in the slightest from his
>> personal beliefs.

> Look, we all know who you are.

I *thought* so!

- Lew
Chris Uppal - 16 Jan 2007 19:50 GMT
[me:]
> > String.toUpperCase() does /not/ change the spelling of words (how could
> > it, it doesn't know anything about words ?).  What it does follow are
[quoted text clipped - 12 lines]
> This seems to be excessively technical when the matter under discussion
> is simply capitalizing strings.

'fraid not.  Case mapping is /NOT SIMPLE/, it never has been simple, and never
will be.  The fact that case mapping in English /is/ simple is neither here not
there.  That fact has mislead many Englsh-speaking programmers into making
invalid assumptions about the complexity of case mapping (and other
orthographical operations), and in the process either creating software which
is inherently broken (in implementation or API design) or which is restricted
to English text.  One example of that unfortunate process is
String.equalsIgnoreCase() -- which would be better named something like
equalsWhileIgnoringCaseAccordingToTheRulesOfEnglish(), except that it doesn't
actually inplement the contract implied by that name /either/.  In fact there
is no sensible name for what String.equalsIgnoreCase() does.

> Also, I don't notice "fi"
> and "FI" producing strange behavior myself -- even if the letters are
> often run together so the 'i' hasn't got a separate dot *when typeset*,
> this doesn't affect the representation of a string in a computer, only
> the visually displayed output (and then usually only when serious
> typesetting software is used)

That is a fair criticism of the Unicode position.  It may even be correct (I
don't know).  The Unicode position is that it ignores ligatures (as a purely
display issue), /except/ where ligature characters are needed in order to
support round-tripping with other existing character sets.  In this case U+00DF
/is/ needed for that purpose (and may also be well established as an regularly
used "character" even outside typographically advanced contexts -- I don't
know).

The fact is that there are rules to follow.  If those rules strike you as
unnecessarily complicated, then that is your problem, not anyone else's (but
you are certainly not alone).  But even if you do dislike the rules, do you
also want to write buggy software ?  If you do write buggy software (in this
respect) then, again, you are certainly not alone -- but that doesn't make it
right.

> > It is simply erroneous to expect String.toUpperCase() to map characters
> > one-to-one in the way that English case mapping works.  I can't, it
> > isn't supposed to, and it doesn't...
>
> No, it is not erroneous to expect a method to do exactly and only what
> its name implies.

But it /does/ do exactly what its name implies.  Only if you have an incomplete
idea of what case-mapping involves would you fail to understand the name and
its implications.

> > String.equalsIgnoreCase(), on the other hand, is badly broken in that
> > it does /not/ follow those rules.
[quoted text clipped - 5 lines]
> be the SAME equivalence class, and equalsIgnoreCase should implement and
> embody the corresponding equivalence relation.

But where does the "should" come from ?   You can set up that kind of structure
for English, no problem, but it doesn't generalise to other languages.  No
matter how much you may /want/ it to, it simply doesn't...

> The version that doesn't shouldn't
> surprise English speakers; the version that does shouldn't surprise
> anyone familiar with its locale-specific behavior for the locale
> actually used.

But there is /nothing/ about Java which implies that instances of
java.lang.String hold English text.  Indeed there is everthing to suggest
otherwise (why use Unicode at all, for instance).

Once you add in Locales then you get /another/ layer of complexity, in that the
case mapping may be Local-dependent /as well/ as not fitting with the
preconceptions of English (only) speakers.

> Having locale-dependent behavior invoked randomly without
> explicit use of Locale objects, and which furthermore doesn't use the
> system locale, is by itself a sign of a questionable design as well as a
> sure source of bugs and problems.

There's a good deal to be said for the idea that Local-dependent operations
should either take an explicit Locale as a parameter, or should use a single,
/invarient/, default Locale (not installation dependent).  Just as a great deal
of bother would be saved if String<->byte[] conversions didn't use an implicit,
and installation-dependent, character encoding.  But even if the Java class
library was in that ideal state, case mapping would not be simple and would not
conform to the expectations of some English speaking programmers.

There are two problems here.  One is that too many programmers expect complex
things to be more simple than they are (which is odd when you consider how
eager programmers and designers often are to make simple things complex).  The
other is that we are using legacy libraries which in parts were designed by
programmers who were still holding on to that folorn hope.  The use of default
Locales is one example of that.  String.equalsIgnoreCase() is another, and far
worse, example.

> I've even encountered somewhere a notion that aString.length() is not
> even accurate in current Java versions if a string contains obscure
> characters.

It depends on what you mean.  String.length() returns, correctly, the number of
Java "char"s in the String.  No problem there.  What /is/ a problem is that
that is not the same as the number of characters in the Unicode text.  That's a
problem caused by the mis-specification of Java's chars to be 16-bit
quantities.  It is highly unfortunate, but there is very little that can be
done about it now.  It means that correct programming is more difficult than it
looks, and also more difficult than it /should/ be.  There is nothing in the
problem space that makes this difficult (well, actually there is, but we'll
pretend there isn't for now[*]), it's not an /inherently/ complex problem, but
historical mistakes in Java's design mean that the API mostly works in terms of
UTF-16 encoding (sequences of 16-bit values) rather than in terms of real
Unicode characters.

> It suggests aString.<something using the obscure term "code
> point", apparently just Unicode-geek for "character"> as its
> replacement, while of course there's a ton of legacy code using
> length().

For the most part, such code will remain correct.  One way to think of it is
that instances of java.lang.String do not, despite the name, directly represent
Unicode strings (sequences of Unicode characters), but are UTF-16.   I.e. only
the name of the class is wrong.   Most operations on UTF16 data "does the right
thing" for the Unicode information it represents.  For instance concatenating
two UTF-16 sequences.  It's only operations which mess around taking strings
apart[**] which are likely to do something invalid unexpectedly, and even there
they quite often work correctly.

The situation is unfortunate, but it's not really fatal.  If any programmer is
capable of understanding the difference between a sequence of characters and a
sequence of bytes in some encoding, in the first place (necessary to do textual
IO in Java at all), then adjusting to the deficiencies of the String class
should not be overwhelmingly difficult.

There are issues to understand, and knowledge to be acquired; that's all...

> I don't suppose it occurred to them that the new fancy-whosit
> should have been a replacement length() implementation instead of some
> new name that doesn't suggest anything to do with the length of a string
> to someone who doesn't care about all the Unicode bells and whistles and
> just wants to process strings while remaining agnostic about what they
> are ultimately used for or contain?

I think they did the best they could.  A better (but impossible in practise)
solution would have been to redefine "char" to be a >=24 bit quantity (I'd have
chosen 32-bit signed, myself), and redefine String to contain the new "char"s.
It would have been nice to refactor String to separate the physical (internal)
representation of the data from the logical character-based API.
Unfortunately, that would have been impossible unless they made the change
/very/ early -- and they missed the short window of opportunity for that.   The
scheme they came up with, effectively redefining what "String" and "char" mean,
is probably the best possible solution.  It doesn't break existing code -- in
the sense that what worked before continues to work -- all that has changed is
the interpretation of that code.

Code which /looks/ as if it will cope with all meaningful inputs does not (but
then, it never would have done).  Not a satisfactory position, but the best we
are going to get.

There are issues to understand, and knowledge to be acquired; that's all...

   -- chris

[*] The "length" of a Unicode string is somewhat problematical since some
characters qualify others (diacritical marks etc), and some "characters" are
not even characters at all.  These issues are probably better thought of as
technical problems caused by the (unavoidable) compromises in Unicode's design
than something inherent to the problem space, but they are still issues for
creators of text-aware applications (few Java applications /are/ text-aware to
that degree).

[**] I should note that taking sequences of logical Unicode characters apart is
also non-trivial, quite independently of Java's representational deficiencies,
and may not fit with English speaking programmers' preconceptions.  However,
that's a different kett