Java Forum / General / February 2007
Most efficient algorithm for matching
Timasmith - 13 Feb 2007 02:27 GMT Suppose I have a list of 10,000 items that you can order, each order item has an ID and a Name.
If you type in a string and I want to find every item that CONTAINS that string, what is the most efficient way?
I happen to be using java but I am open to general solutions.
Obviously I could just loop through the list and do a match on each string but I cant help thinking one could be more efficient thatn O(n).
A traditional database btree index doesnt help, but I think search engines must use something fast.
thanks
Tim
Knute Johnson - 13 Feb 2007 02:49 GMT > Suppose I have a list of 10,000 items that you can order, each order > item has an ID and a Name. [quoted text clipped - 14 lines] > > Tim If your data isn't sorted just looping through it is the fastest way.
 Signature Knute Johnson email s/nospam/knute/
Timasmith - 13 Feb 2007 03:25 GMT On Feb 12, 9:49 pm, Knute Johnson <nos...@rabbitbrush.frazmtn.com> wrote:
> Timasmithwrote: > > Suppose I have a list of 10,000 items that you can order, each order [quoted text clipped - 24 lines] > > - Show quoted text - Maybe. What if I constrained the search to say that the string must have a least 2 characters. So I guess you could take all 10,000 entries, determine all of the sequences and put those into a hashtable. So if I search on chair it would find all the indexed entries with ch in the them, then look through the results. If <5% of the entries have ch in them, I would imagine that would help.
Lew - 13 Feb 2007 03:48 GMT > On Feb 12, 9:49 pm, Knute Johnson <nos...@rabbitbrush.frazmtn.com> > wrote: >> Timasmithwrote: >>> Suppose I have a list of 10,000 items that you can order, each order >>> item has an ID and a Name. Maybe some more outré structure like a trie will help: <http://en.wikipedia.org/wiki/Trie> in particular a suffix tree: <http://en.wikipedia.org/wiki/Suffix_tree>
other structures: <http://en.wikipedia.org/wiki/List_of_data_structures>
- Lew
Eric Sosman - 13 Feb 2007 03:54 GMT > On Feb 12, 9:49 pm, Knute Johnson <nos...@rabbitbrush.frazmtn.com> > wrote: [quoted text clipped - 26 lines] > entries with ch in the them, then look through the results. If <5% of > the entries have ch in them, I would imagine that would help. If you're doing just a few searches, a straight-line pass over all the data is probably best.
If you're doing many searches, it may make sense to invest a fairly large amount of work in initialization to make the searches themselves run faster. Here's one way:
Map<String,Set<Item>> map = new HashMap<String,Set<Item>>(); for (Item item : itemCollection) { for (String key : item.setOfAllKeys()) { Set<Item> set = map.get(key); if (set == null) { set = new Set<Item>(); map.put(key, set); } set.add(item); } }
That is, you have a map that associates each substring with a set of all the Items having that substring. Each Item appears in as many sets as it has distinct substrings.
It may be desirable to endow setOfAllKeys() with a little more intelligence than a simple brute-force decomposition would exhibit. If one of your Items is "LEFT HANDED MONKEY WRENCH" there's probably not much to be gained by indexing it under "FT" and "D M" and "AND" -- yet indexing under "WRENCHES" might be a good idea, even though it's not a substring of the name.
See also "The Art of Computer Programming, Volume III: Sorting and Searching" by D.E. Knuth. IIRC, one approach he describes for this kind of search is to start with "LEFT HANDED MONKEY WRENCH" and enter it in all of the lists LE, EF, FT, ... DE, ED. Then given the search key "DED MONK" you would form the intersection of the lists for DE, ED, ..., NK (keeping the lists sorted would be helpful, or you could use bit vectors) and finish by inspecting only those Items in the intersection.
 Signature Eric Sosman esosman@acm-dot-org.invalid
Knute Johnson - 13 Feb 2007 05:38 GMT > On Feb 12, 9:49 pm, Knute Johnson <nos...@rabbitbrush.frazmtn.com> > wrote: [quoted text clipped - 26 lines] > entries with ch in the them, then look through the results. If <5% of > the entries have ch in them, I would imagine that would help. With 10,000 items, I doubt very much that you can beat a linear search by very much with a complicated data structure. With 10,000,00 I can see where it would be a different story. A linear search is pretty easy to code. 10,000 classes of 1000 bytes only takes up 10mb. Easy to store in memory. 10 million items on the other hand would probably have to be kept on some kind of permanent storage. Or you might want to use some real database manager instead and use its searching capabilities.
 Signature Knute Johnson email s/nospam/knute/
Daniel Pitts - 13 Feb 2007 20:35 GMT > On Feb 12, 9:49 pm, Knute Johnson <nos...@rabbitbrush.frazmtn.com> > wrote: [quoted text clipped - 34 lines] > entries with ch in the them, then look through the results. If <5% of > the entries have ch in them, I would imagine that would help. You could also look into a third-party indexing library.
Check out: Lucene. or for a web service: Solr
Lucene was developed by an (ex?) Google employee. Solr was developed at, and is used extensively at, CNET Networks.
bugbear - 13 Feb 2007 09:57 GMT > Suppose I have a list of 10,000 items that you can order, each order > item has an ID and a Name. > > If you type in a string and I want to find every item that CONTAINS > that string, what is the most efficient way? Just to clarify; you wish to detect your target string (e.g. "har") within the Name of your item object?
e.g. "harold" is a hit, "harry morse" is a hit, "mr haroldson" is a hit, "nigel smith" is a miss?
BugBear
Timasmith - 13 Feb 2007 11:45 GMT > Timasmithwrote: > > Suppose I have a list of 10,000 items that you can order, each order [quoted text clipped - 10 lines] > > BugBear Correct.
bugbear - 13 Feb 2007 12:27 GMT >> Timasmithwrote: >>> Suppose I have a list of 10,000 items that you can order, each order [quoted text clipped - 8 lines] > > Correct. OK. Any other constraints on alphabet (i.e. number of distinct characters allowed for use in "Name"), or search targets (whole words?), number of insertions to the item list compared with number of retrievals (searches)?
BugBear
Timasmith - 13 Feb 2007 20:20 GMT > Timasmithwrote: > >> Timasmithwrote: [quoted text clipped - 15 lines] > > BugBear No constraints on alphabet, most of the search targets are single words from 3-30 characters.
Insertions is rare, the list does not change from day to day, retreivals is measured in thousands or even tens of thousands per hour.
bugbear - 15 Feb 2007 15:51 GMT > No constraints on alphabet, most of the search targets are single > words from 3-30 characters. > > Insertions is rare, the list does not change from day to day, > retreivals is measured in thousands or even tens of thousands per hour. Given the other info in the thread, I would try the following hybrid hack.
Concatanate all the Names into a single array of characters, separated by some "impossible" character (likely nul) with a second array containing the offsets within the character "pool" back to your Items.
Use a fast text search to locate your target within the pool.
I would suggest:
A Fast Generic Sequence Matching Algorithm David R. Musser, Gor V. Nishanov http://citeseer.ist.psu.edu/586484.html
Having all the data in a memory contiguous pool should speed things along, and the search should be very fast.
Then use the (sorted, of course) offset array to turn the offset into Items references.
BugBear
bugbear - 15 Feb 2007 15:55 GMT > A Fast Generic Sequence Matching Algorithm > David R. Musser, Gor V. Nishanov > http://citeseer.ist.psu.edu/586484.html Source - not JAVA
http://www.cs.rpi.edu/~musser/gp/gensearch/index.html
BugBear
Christian - 13 Feb 2007 20:57 GMT Timasmith schrieb:
>> Timasmithwrote: >>> Suppose I have a list of 10,000 items that you can order, each order [quoted text clipped - 10 lines] > > Correct. search for ukkonen's algorithm ... its an algorithm for substring trees
building is in O(n) storage needed in O(n) and searching in O(1) that may fit your needs ...
Christian - 15 Feb 2007 20:08 GMT Christian schrieb:
> Timasmith schrieb: >>> Timasmithwrote: [quoted text clipped - 19 lines] > and searching in O(1) > that may fit your needs ... if you are not willing to index you could also try do use
Boyer Moore ... that needs no indexing and should be in average case
O(m +n/m)
with n textlenght and m length of the searched substring
(with substringtrees searching is not in O(1) its only independent from n which in most cases should be good enough)
Jeff Higgins - 13 Feb 2007 12:35 GMT > Suppose I have a list of 10,000 items that you can order, each order > item has an ID and a Name. [quoted text clipped - 3 lines] > > I happen to be using java but I am open to general solutions. http://lucene.apache.org/java/docs/
> Obviously I could just loop through the list and do a match on each > string but I cant help thinking one could be more efficient thatn [quoted text clipped - 6 lines] > > Tim David Segall - 14 Feb 2007 05:56 GMT >> Suppose I have a list of 10,000 items that you can order, each order >> item has an ID and a Name. [quoted text clipped - 5 lines] > >http://lucene.apache.org/java/docs/ I don't think so. From the docs "In order to prevent extremely slow WildcardQueries, a Wildcard term should not start with one of the wildcards * or ?".
Chris Uppal - 14 Feb 2007 19:10 GMT > > http://lucene.apache.org/java/docs/ > I don't think so. From the docs "In order to prevent extremely slow > WildcardQueries, a Wildcard term should not start with one of the > wildcards * or ?". I think you must be misinterpretting the Lucene documentation. Lucene is specifically /for/ searching large volumes of text.
I suspect that what it is trying to warn of is that it you want to look for the needle, "needle", in a multiGB haystack, you should ask Lucent to look for "needle" not for "*needle".
-- chris
David Segall - 15 Feb 2007 15:11 GMT >> > http://lucene.apache.org/java/docs/ >> I don't think so. From the docs "In order to prevent extremely slow [quoted text clipped - 3 lines] >I think you must be misinterpretting the Lucene documentation. Lucene is >specifically /for/ searching large volumes of text. True, but the OP does not have large volumes of text. He does not specify the amount but, to me, "a name" implies at most a few hundred characters.
>I suspect that what it is trying to warn of is that it you want to look for the >needle, "needle", in a multiGB haystack, you should ask Lucent to look for >"needle" not for "*needle". The OP emphasised that "I want to find every item that CONTAINS that string". He asked for the ability to efficiently look for "*needle". It is clear from the documentation that Lucene is optimised to look for "words", that it can cope with "wor*" but that it is not the right tool for "*ords".
Lew - 15 Feb 2007 16:06 GMT >>>> http://lucene.apache.org/java/docs/ >>> I don't think so. From the docs "In order to prevent extremely slow >>> WildcardQueries, a Wildcard term should not start with one of the >>> wildcards * or ?".
> The OP emphasised that "I want to find every item that CONTAINS that > string". He asked for the ability to efficiently look for "*needle". > It is clear from the documentation that Lucene is optimised to look > for "words", that it can cope with "wor*" but that it is not the right > tool for "*ords". But the /search/ term doesn't start with a wildcard, the /search/ term is "ords" or "needle".
Consider
String s = "Text base in which to search for words."; boolean containsSearchTerm = (s.indexOf( "ords" ) >= 0);
Note that this finds if the item "CONTAINS that string", but that the /search/ term does not start with a wildcard. The OP was asking for this capability across secondary storage.
- Lew
David Segall - 16 Feb 2007 12:08 GMT >>>>> http://lucene.apache.org/java/docs/ >>>> I don't think so. From the docs "In order to prevent extremely slow [quoted text clipped - 9 lines] >But the /search/ term doesn't start with a wildcard, the /search/ term is >"ords" or "needle". True, but Lucene's /index/ only contains "words".
>Consider > >String s = "Text base in which to search for words."; >boolean containsSearchTerm = (s.indexOf( "ords" ) >= 0); Exactly! For Lucene, that is a wild card search which must be performed on _every_ word in the index. The OP would be far better off using that code on his 10,000 raw strings.
Anyway, our debate over using Lucene has been overtaken by Christians suggestion to use Ukkonen's algorithm. Apart from the thorny problem of interpreting multi-word searches, which applies to all the suggestions, the algorithm seems ideal. An early result of a Google search contains a lucid explanation and some source code <http://marknelson.us/1996/08/01/suffix-trees/>.
Lew - 16 Feb 2007 14:25 GMT > Anyway, our debate over using Lucene has been overtaken by Christians > suggestion to use Ukkonen's algorithm. Apart from the thorny problem > of interpreting multi-word searches, which applies to all the > suggestions, the algorithm seems ideal. An early result of a Google > search contains a lucid explanation and some source code > <http://marknelson.us/1996/08/01/suffix-trees/>. Lew wrote on 2002-02-10:
>> Maybe some more outré structure like a trie will help: >> <http://en.wikipedia.org/wiki/Trie> [quoted text clipped - 3 lines] >> other structures: >> <http://en.wikipedia.org/wiki/List_of_data_structures> - Lew
Lew - 16 Feb 2007 23:02 GMT > Lew wrote on 2002-02-10: Oops. That was on 2/12, not 2/10.
- Lew
Daniel Pitts - 14 Feb 2007 19:39 GMT > >> Suppose I have a list of 10,000 items that you can order, each order > >> item has an ID and a Name. [quoted text clipped - 9 lines] > WildcardQueries, a Wildcard term should not start with one of the > wildcards * or ?". You would use stemming to index the substrings, rather than wildcards to search for it.
David Segall - 15 Feb 2007 15:37 GMT >> >> Suppose I have a list of 10,000 items that you can order, each order >> >> item has an ID and a Name. [quoted text clipped - 12 lines] >You would use stemming to index the substrings, rather than wildcards >to search for it. Stemming in Lucene reduces token terms to their "_linguistic_ root form. e.g. reduces 'fishing' and 'fishes' to 'fish'". It would not help find 'shing' in 'fishing'.
Chris Uppal - 15 Feb 2007 19:23 GMT > Stemming in Lucene reduces token terms to their "_linguistic_ root > form. e.g. reduces 'fishing' and 'fishes' to 'fish'". It would not > help find 'shing' in 'fishing'. Ah, I see what you mean now. Thanks.
-- chris
Daniel Pitts - 15 Feb 2007 20:45 GMT > >> >> Suppose I have a list of 10,000 items that you can order, each order > >> >> item has an ID and a Name. [quoted text clipped - 16 lines] > form. e.g. reduces 'fishing' and 'fishes' to 'fish'". It would not > help find 'shing' in 'fishing'. Linguistic roots are only ONE implementation of stemming. Its perfectly acceptible to stem "suffixes".
Chris Uppal - 13 Feb 2007 16:51 GMT > Suppose I have a list of 10,000 items that you can order, each order > item has an ID and a Name. > > If you type in a string and I want to find every item that CONTAINS > that string, what is the most efficient way? Efficient text searching is a /big/ subject. No, let me be more precise: it's /HUGE/ subject.
If your list of items is reasonably short, and the items themselves are reasonably short, then you can do yourself a big favour by ignoring efficiency and just searching linearly. A lot depends on the application, but I'm not convinced that 10K items is enough to justify heavier-weight solutions.
Next up, there are efficient text searching algorithms, such as Boyer-Moore, and the interesting newish breed of "bit-parallel" or "bitmap" searching algorithms. If you have enough text, and if the overhead (in time, space, or convenience) of maintaining an index is unacceptable, then looking into those algorithms would be sensible. Note that there variants on these algorithms which do approximate matching. You might look for "agrep" for instance.
Also Wikipeadia has quite a few pages on text searching. http://en.wikipedia.org/wiki/Category:Search_algorithms
Lastly, if you need greater speed than such methods can provide (and search engines /'definitely/ do ;-) then you'd need to use some sort of indexing.
The simplest way to approach that is to ensure that your database has something like "full text search and indexing" in its feature set -- and use that.
Alternatively, there are off-the-shelf packages for bulk test searching. Jeff Higgins has already pointed you towards Apache Lucene, which is one such package.
-- chris
S Perryman - 15 Feb 2007 09:00 GMT >> Suppose I have a list of 10,000 items that you can order, each order >> item has an ID and a Name.
>> If you type in a string and I want to find every item that CONTAINS >> that string, what is the most efficient way?
> Efficient text searching is a /big/ subject. No, let me be more precise: > it's > /HUGE/ subject.
> If your list of items is reasonably short, and the items themselves are > reasonably short, then you can do yourself a big favour by ignoring > efficiency > and just searching linearly. A lot depends on the application, but I'm > not > convinced that 10K items is enough to justify heavier-weight solutions.
> Next up, there are efficient text searching algorithms, such as > Boyer-Moore, > and the interesting newish breed of "bit-parallel" or "bitmap" searching > algorithms. I have heard of the "bitmap" algorithms, but don't know how they compare to the really good std ones like KMP (Boyer-Moore is subject to degeneracy to O(M * N) for certain alphabets while KMP should have fixed overhead except probably for massively long search patterns - computing the "failure" function etc) .
Regards, Steven Perryman
H. S. Lahman - 13 Feb 2007 17:07 GMT Responding to Timasmith...
> Suppose I have a list of 10,000 items that you can order, each order > item has an ID and a Name. The key will be ordering, but for just 10K items one probably doesn't want to get too elaborate.
> If you type in a string and I want to find every item that CONTAINS > that string, what is the most efficient way? I assume you mean in the Name rather than the ID. I also assume that the data gets stored once but is searched often.
Is there a limit on the lengths of the Names? What is the average length? The answers to these questions will drive tailoring the solutions below to trade-off speed vs. effort. Can you define the ID (e.g., an index for the order the data items are stored)?
> I happen to be using java but I am open to general solutions. > > Obviously I could just loop through the list and do a match on each > string but I cant help thinking one could be more efficient thatn > O(n). I would be inclined to make use of a couple of practical characteristics. In searching ASCII keys most rejections will occur in the first few characters. (It has been many, many moons since I've done searches so I forget the actual numbers for the distribution.) Also, string matching algorithms for a given starting point in each string are quite efficient.
I think one "cheap" approach would be to keep an ordered list of digraphs or trigraphs that appear anywhere in the Name. That list gets updated whenever an item is added/removed from the data. The list would contain multiple entries consisting of the item ID and the start position of the digraph/trigraph in the Name. (Maybe the remaining length if the search keys are likely to be long so that you can eliminate embedded substrings that would extend beyond the Name length.)
The data itself would be ordered in an array and the "ID" above would be an index into the storage position (either the original ID or one you create when storing the data). Your algorithm would find the starting digraph/trigraph in the list above and "walk" the entries for that digraph/trigraph. For each one it would do a lookup to get the data and use a conventional substring matching algorithm for the full search string at the indicated start point.
There are even less elaborate versions. You can simply keep a list of data items that contain the starting letter of the search string somewhere and search just them. Then all you need is to store the data so you can access items by an index referenced in the starting letter list.
Bottom line: you have a trade-off between how fast you want to be and how much work you want to put into storing the data items. As a fairly general rule the search will be in direct proportional to how much space you have for storing supporting infrastructure like index tables.
For example, you could create 26 trees whose nodes were letters. The root node in each tree would be one of the possible first letters of the Name. Each child node would be a character that immediately followed the parent as the next letter of the Name (pretty much like a B-Tree). Each node would have a list of items that contained the substring of letters leading to that node. Then you could create a similar set of trees where the root nodes were the second letter of the name and so on. The search algorithm would "walk" the trees in each set from the node with the proper starting letter. This would be very fast but it would require substantially more storage space than the original data. [One could get even more carried away if one uses a digraph or trigraph as the root node in each tree. B-)]
************* There is nothing wrong with me that could not be cured by a capful of Drano.
H. S. Lahman hsl@pathfindermda.com Pathfinder Solutions http://www.pathfindermda.com blog: http://pathfinderpeople.blogs.com/hslahman "Model-Based Translation: The Next Step in Agile Development". Email info@pathfindermda.com for your copy. Pathfinder is hiring: http://www.pathfindermda.com/about_us/careers_pos3.php. (888)OOA-PATH
Timasmith - 13 Feb 2007 20:50 GMT > Responding toTimasmith... > [quoted text clipped - 79 lines] > Pathfinder is hiring:http://www.pathfindermda.com/about_us/careers_pos3.php. > (888)OOA-PATH So I can point you to a file which is near exactly what I would be searching.
http://www.fda.gov/cder/ndc/ziptext.zip
It is a list of medications. There are 50,000 which is a reasonable representation. Look at column G in listings.txt. If you take that it is 1.5MB in size.
If I load up the items into an ArrayList and then search linearly for items matching WELLBUTRIN
then do to 1000 searches takes 9-11 seconds. So 100 searches a second seems rather fast, maybe I am worrying about nothing.
<pre> // build file String contents = FileUtil.getTextContents("C:\\temp\\a.txt"); java.util.StringTokenizer st = new java.util.StringTokenizer(contents,"\n\r"); ArrayList<String> list = new ArrayList<String>(50000); while (st.hasMoreTokens()) { list.add(st.nextToken()); }
// start timer System.out.println(Timer.getNow()); String match = "WELLBUTRIN"; ArrayList<Integer> matches = new ArrayList<Integer>(); for (int i=0; i<1000; i++) { matches.clear(); for(int j=0; j<list.size(); j++) { if (list.get(j).indexOf(match) > -1) { matches.add(j); } } } //stop timer System.out.println(Timer.getNow()); //System.out.println(matches); </pre>
Knute Johnson - 14 Feb 2007 00:22 GMT >> Responding toTimasmith... >> [quoted text clipped - 117 lines] > //System.out.println(matches); > </pre> I'll bet if you do that in a String[] it will be even faster.
 Signature Knute Johnson email s/nospam/knute/
H. S. Lahman - 14 Feb 2007 17:49 GMT Responding to Timasmith...
> So I can point you to a file which is near exactly what I would be > searching. [quoted text clipped - 10 lines] > then do to 1000 searches takes 9-11 seconds. So 100 searches a second > seems rather fast, maybe I am worrying about nothing. It depends upon the context. If you only need the search as a response to particular interactive user request, then that 10 ms will not be noticeable compared to the hundreds of milliseconds for the user's typing. (It will probably take longer just to read the data from disk.)
OTOH, if your software is in the bowels of an HMO's multi-user application processing 1K pharmacy coverage requests a second, you probably need to speed things up a tad. [Though I don't know why a pharmacy would be providing partial keys; maybe they can't read the doctor's writing. B-)] In fact, then you probably don't want to read the DB at all for an individual request. You probably want to read the data into memory at startup and answer the queries from memory. If the data is only using 1.5 Mb, you could probably have another 5-10 Mb of supporting data structures for the searches w/o getting into page faulting trouble.
Bottom line: I think you need to look at the situation and determine what is acceptable performance and then do any trade-offs from there.
************* There is nothing wrong with me that could not be cured by a capful of Drano.
H. S. Lahman hsl@pathfindermda.com Pathfinder Solutions http://www.pathfindermda.com blog: http://pathfinderpeople.blogs.com/hslahman "Model-Based Translation: The Next Step in Agile Development". Email info@pathfindermda.com for your copy. Pathfinder is hiring: http://www.pathfindermda.com/about_us/careers_pos3.php. (888)OOA-PATH
Ron Jeffries - 13 Feb 2007 21:43 GMT >Suppose I have a list of 10,000 items that you can order, each order >item has an ID and a Name. [quoted text clipped - 10 lines] >A traditional database btree index doesnt help, but I think search >engines must use something fast. Before doing anything else, I'd suggest that you code up the simple search and time it. My guess is it'll fly.
Ron Jeffries www.XProgramming.com
bugbear - 14 Feb 2007 13:43 GMT > Before doing anything else, I'd suggest that you code up the simple > search and time it. My guess is it'll fly. Agreed in principle, but it depends what "fast enough" means.
For human interaction 1/100 of a second is good. For human interaction with multiple users (e.g. webserver) you might want more 1/1000.
For the evaluation loop inside a genetic algorithm you might be aiming for micro seconds.
BugBear
Lew - 14 Feb 2007 20:17 GMT > For human interaction 1/100 of a second is good. > For human interaction with multiple users (e.g. webserver) > you might want more 1/1000. No reason to be faster for any one transaction if they interleave well.
Where did you come up with the .01s figure? Is there ergonomic research behind it?
- Lew
bugbear - 15 Feb 2007 15:53 GMT >> For human interaction 1/100 of a second is good. >> For human interaction with multiple users (e.g. webserver) [quoted text clipped - 4 lines] > Where did you come up with the .01s figure? Is there ergonomic research > behind it? No ; I'm afraid I made it up, on the grounds that 1/100th is clearly "good enough". Slower might also be good enough, but I'm confident that 1/100 is OK.
(hell, it's quicker than many people's monitor refresh!)
BugBear
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|