Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / January 2007

Tip: Looking for answers? Try searching our database.

Problem with searching in TreeSet expressions starting with a given text

Thread view: 
setar - 04 Jan 2007 11:49 GMT
I store in TreeSet expressions in some natural language. This can be
any natural language. The content of a set is sorted in this language
by language collator:
collator = Collator.getInstance(locale);
collator.compare(string1, string2)

Sorting is working correctly. For example in English the letters have
following order: ..., c, d, e, ... so words: cat, dog, ear will be
sorted in the given order.

Sorting is also working correctly in other languages. In non-English
languages there can be additional letters. For example in Polish there
is a 'ć' letter (c with an accent). This letters change the order of
other letter. In Polish the order is: ..., c, ć, d, e, .... and for
example following words have given order: cena (en. price), ćma (en.
moth), duma (en. pride), echo (en. echo).

So index is build correctly for any language and it is sorted
correctly.

Now I want to receive a range of its sorted values which are strings
beginning with a given string.
For English it is easy. If I want to get all expressions beginning with
a string "cat" I can invoke subSet method with the first argument
equal to this string and the second argument the same as the first but
with last letter replaced with its SUCCESOR. Successors in English are
easy to determine:
TreeSet index;
index.subSet("cat", "cau");

But for other languages determining successors of letters isn't easy.
For example in Polish the successor of letter 'c' is 'ć' not 'd'. I
could create tables with orders of letters for all languages, but there
is too many languages... I was looking in Java API for method
retrieving successor of given character but I found nothing. Does
anybody know if there is such a method?

Second way of retrieving expressions beginning with a given string is
invoking subSet with following arguments:
TreeSet index;
index.subSet("cat", "cat" + Character.MAX_VALUE);

I tested it with few string and everything seemed to work ok. But
recently I tried to return all expressions starting with a word
"cat", not only with a string. So I want to receive string "cat
is running", but I don't want to receive expression "catch". So I
invoked subSet with space after string "cat":
TreeSet index;
index.subSet("cat ", "cat " + Character.MAX_VALUE);

Unfortunately method returned also the "catch" word. I don't know
why subSet is working in this way.

Does anybody know how to correct described ideas or maybe problem can
be solved in other way?
Robert Klemme - 04 Jan 2007 12:42 GMT
> I store in TreeSet expressions in some natural language. This can be
> any natural language. The content of a set is sorted in this language
[quoted text clipped - 48 lines]
> Unfortunately method returned also the "catch" word. I don't know
> why subSet is working in this way.

It may be due to the workings of the Comparator you are using.  Try
comparing "cat ", "catch" and "cat " + Character.MAX_VALUE with your
Comparator and see the outcome.  After all Character.MAX_VALUE might not
be a valid Unicode character.

IIRC there is a method in Character that checks whether a char is a
valid character.  You could use that to find the max legal character and
try again with that.

> Does anybody know how to correct described ideas or maybe problem can
> be solved in other way?

Some thoughts: it may be that you stretch the TreeSet too far.  Could be
that with better access to internals you are better off, so you might
have to craft your own data structure (Tries come to mind).

A two phase approach might work as well: use subSet() as the first line
to reduce the set's size and then iterate through it and use some other
mechanism like a regexp to further narrow down your result.

Kind regards

    robert
setar - 04 Jan 2007 14:11 GMT
>> Second way of retrieving expressions beginning with a given string is
>> invoking subSet with following arguments:
[quoted text clipped - 30 lines]
>reduce the set's size and then iterate through it and use some other
>mechanism like a regexp to further narrow down your result.

Yes, you are right. These strings are sorted in this order: "cat ", "catch",
"cat " + Character.MAX_VALUE, so this is the reason why subSet returns also
"catch ". My collator ignores white spaces, maybe it is true for all
collators. In my project it was useful feature to ignore white spaces while
sorting set so I forgot about this. I will try to apply yours ideas.

Thanks
setar - 04 Jan 2007 23:51 GMT
> Yes, you are right. These strings are sorted in this order: "cat ",
> "catch", "cat " + Character.MAX_VALUE, so this is the reason why subSet
> returns also "catch ". My collator ignores white spaces, maybe it is true
> for all collators. In my project it was useful feature to ignore white
> spaces while sorting set so I forgot about this. I will try to apply yours
> ideas.

Strings returned by subSet should be additionally filtered. If we get only
these strings which start with searching string (String.startsWith)
everything will work. We must filter because RuleBasedCollaror "ignores"
some characters while sorting. Among these characters is space.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.