Java Forum / General / December 2005
Help: what is the quickest way to find out whether a string contains another string?
tuweiwen@gmail.com - 05 Dec 2005 02:47 GMT Hi all, can anyone give me a hint on what approach I should take to this simple problem: what is the quickest way to find out whether a string contains another string? For example,
String a = "a very long string, UP TO 100k"; String b = "a string of two or three words";
//---- now I need to write this function boolean a_contains_b (String a, String b) { //I don't need to worry about the location of b, just want to know //whether a has b inside it. Returns true or false }
This function will be called Thousands of times in an app.
What's best way to handle this to achieve best performance?
Currently I am using "if (a.indexOf(b) !=-1) " in the above "a_contains_b()" function. But I don't like it at all. It feels dumb, isn't it? Do I have the wrong approach at the outset?
Thanks a lot. I appreciate your help.
Brandon McCombs - 05 Dec 2005 04:36 GMT > Hi all, can anyone give me a hint on what approach I should take to > this simple problem: what is the quickest way to find out whether a [quoted text clipped - 18 lines] > > Thanks a lot. I appreciate your help. Unless I'm missing something wouldn't the contains() method of String class be what you want??
boolean contains(CharSequence s) Returns true if and only if this string contains the specified sequence of char values.
Rhino - 05 Dec 2005 05:00 GMT > Hi all, can anyone give me a hint on what approach I should take to > this simple problem: what is the quickest way to find out whether a [quoted text clipped - 18 lines] > > Thanks a lot. I appreciate your help. One approach would be to use "regular expressions", often abbreviated "regexp". If you look at the Pattern class in the J2SE API, you'll find an overview describing how to use them. A full tutorial on regular expressions can be found in the online version of the Java Tutorial at http://java.sun.com/docs/books/tutorial/extra/regex/index.html.
Rhino
Chris Smith - 05 Dec 2005 19:47 GMT > One approach would be to use "regular expressions", often abbreviated > "regexp". If you look at the Pattern class in the J2SE API, you'll find an > overview describing how to use them. Why would anyone want to use regular expressions to look for a plain String? Regular expressions are a tool for deciding whether a String or some subset thereof matches a defined regular language. Although string searching is a trivial degeneration of the problem addressed by regular expressions, there's no need to introduce the added complexity.
If you are set on using regular expressions, converting the arbitrary input string into a pattern that matches only itself is the first task. Prior to Java 1.5, this was very difficult. In 1.5, it's encapsulated in a method called Pattern.quote(String). Once that's done, you can search for that pattern. It's far easier, though, to use String.indexOf (prior to 1.5) or String.contains (as of 1.5).
 Signature www.designacourse.com The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
Rhino - 05 Dec 2005 22:23 GMT >> One approach would be to use "regular expressions", often abbreviated >> "regexp". If you look at the Pattern class in the J2SE API, you'll find [quoted text clipped - 13 lines] > search for that pattern. It's far easier, though, to use String.indexOf > (prior to 1.5) or String.contains (as of 1.5). I didn't say regular expressions were the _best_ way, just that they were an option.
The original poster mentioned that he was searching very long strings for other strings. I don't know if regular contains() or similar methods have length limitations and I've never looked into whether the more traditional methods are more or less efficient than techniques based on regular expressions. Offhand, I suspected that regular expressions might be more efficient on longer strings but the other replies to this thread indicate that regular expressions are actually _less_ efficient that contains() or its brethren.
Therefore, I learned something along the way, too.
Rhino
Alan Moore - 06 Dec 2005 11:40 GMT I'd like to add a note about the inefficiency of regexes. The regex ".*is not.*" is inherently slow, and not a fair test of regex speed. Adding the dot-stars before and after the real search string is a hack for when you can only use String#matches(). When using the find() method, you can get rid of them and the search will go much faster. When I ran the OP's second test program with the regex "is not", it took 1052ms (compared to 4486ms for ".*is not.*"). That was still three times as long as the contains() or indexOf() approaches, so I'm not pushing for regex use here. Just setting the record straight.
Roedy Green - 05 Dec 2005 07:16 GMT >Hi all, can anyone give me a hint on what approach I should take to >this simple problem: what is the quickest way to find out whether a >string contains another string? For example, an approach is to do a quick check before you scan.
You break your big string into words with a regex split and put each word into a HashSet.
Then you break you short string into words with that same precompiled regex and look up each word in turn in the HashSet. If any word is not found, there is no point in further processing. If you do find all three, then use indexOf or contains.
If you want to get fancy, use a hashMap to track the offset of the FIRST ocurrence of each word. Then you take the min of the three offsets for starting off your indexof search and the max of the three for your stopping search point.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
tuweiwen@gmail.com - 05 Dec 2005 07:50 GMT Thanks a lot, guys. Roedy, I will try that approach later.
I did a little testing on the performance and got an interesting result. Here is the program.
import java.util.*; /* This is an un-scientific, probably way off-mark, testing comparing the speed of three functions: String.contains(), String.indexOf() and String.matches(). Note that String.matches() use regular expression approach according to the java doc
Result: the speeds of indexOf() and contains() are about the same, but matches() is at least ten times slower. Is regular expression bad in terms of performance?
Problems with this tesing: 1. The target string a is not long enough. As I said, in my app, it can be up to 100k. The performance may change if string a gets bigger. 2. In all loops, string a and string b are the same. In my app, at the least string a is always changing, such as newly downloaded web pages. 3. Right now, it always matches inside the loop. should do a randomized String a and string b.
Testing machine: WinXP Pro, P3 500, 256MB Ram, JDK 1.5.0_05 */
public class Test {
public static void main(String args[]) {
int loop = 1000000;
int matches = 0;
String a = "This is a testing string. It is not very long."; String b = "is not";
long startTime = System.currentTimeMillis();
for (int i = 0; i < loop; i++) { if (a.contains(b)) matches++; //took about 1620 miliseconds //if (a.matches(".*not.*")) matches++; //took about 25200 miliseconds //if (a.indexOf(b) != -1) matches++; //took about 1520 miliseconds }
long endTime = System.currentTimeMillis();
System.out.println("\nRunning " + loop + " times took " + (endTime - startTime) + " miliseconds."); }
}
right now, this part of my app (processing the downloaded web pages) takes lots of time. so i am looking for ways to speed it up.
tuweiwen@gmail.com - 05 Dec 2005 08:09 GMT This time I use explicit regular expression instead of String.matches(), and pre-compile the regex. Speed gets better a little bit. Just replace the old main() method with this new one. Also need import java.util.regex.*; at the top.
public static void main(String args[]) { int loop = 1000000; int matches = 0; String a = "This is a testing string. It is not very long."; String b = "is not";
Pattern pat = Pattern.compile(".*is not.*"); Matcher mat;
long startTime = System.currentTimeMillis();
for (int i = 0; i < loop; i++) { //if (a.contains(b)) matches++; //took about 1620 miliseconds //if (a.indexOf(b) != -1) matches++; //took about 1520 miliseconds if (pat.matcher(a).find()) matches++; //took about 18400 milis
} long endTime = System.currentTimeMillis(); System.out.println("\nRunning " + loop + " times took " + (endTime - startTime) + " miliseconds."); }
Chris Uppal - 05 Dec 2005 10:27 GMT > Result: the speeds of indexOf() and contains() are about the same, > but matches() is at least ten times slower. > Is regular expression bad in terms of performance? Yes.
Regexp matching is an order of magnitude more complex than a simple scan of the string. Although the time (for normal regexps) is linear in the length of the String, just as for naive scanning, the constants are significantly greater -- as you have found.
Another problem, by the way, is that if you are looking for a string like "this is *not* easy", then you'll have to pre-process the search string to escape the special characters.
You've already been pointed at better algorithms for searching strings than a naive scan. I just wanted to add a few warnings. One is that you should verify that string matching is /actually/ the bottleneck in your application before wasing too much time looking for alternatives. Another is that all these algorithms have significant start-up time, so you should be careful to measure using representative data -- a string literal that you can type into a Java program is /not/ representative of a 100K String. (BTW it's distinctly unusual to have such long strings in a program, how do they come above ?) The last is that if you are looking at Roedy's BMP implementation, that is artificially restricted[*] to chararacters in the range 0...255 in both the search and target strings.
([*] Or it was when I last looked, and the restriction is -- in fact -- justifiable, but it may not rule out that implementation for your purposes.)
-- chris
Roedy Green - 05 Dec 2005 18:16 GMT On Mon, 5 Dec 2005 10:27:51 -0000, "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly quoted someone who said :
>You've already been pointed at better algorithms for searching strings than a >naive scan. I just wanted to add a few warnings. see http://mindprod.com/products1.html#BOYER for a faster indexOf
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 05 Dec 2005 18:15 GMT >Result: the speeds of indexOf() and contains() are about the same, >but matches() is at least ten times slower. >Is regular expression bad in terms of performance? Don't you believe your own experiment?
In general the more specialised a tool is to the task at hand the faster it will be.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Thomas Weidenfeller - 05 Dec 2005 08:35 GMT > What's best way to handle this to achieve best performance? "best" always depends on the context, the word as such is rather meaningless. I assume you mean fast. Then, contrary to other suggestions, regexps are not a good choice at all. We have discussed this in the past.
Your problem is a standard problem in CS, it has been extensively studied and many textbooks discuss it. Look for the following keywords in books, Google, etc.:
Boyer-Moore, Knuth-Morris-Pratt or Rabin-Karp.
There are for sure more such algorithms.
/Thomas
 Signature The comp.lang.java.gui FAQ: ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
zero - 05 Dec 2005 12:21 GMT Thomas Weidenfeller <nobody@ericsson.invalid> wrote in news:dn0u58$ika$1 @news.al.sw.ericsson.se:
> Your problem is a standard problem in CS, it has been extensively > studied and many textbooks discuss it. Look for the following keywords > in books, Google, etc.: > > Boyer-Moore, Knuth-Morris-Pratt or Rabin-Karp. I find my knowledge of such standard algorithms to be quite limited. Can you suggest a good source - online or in print - to broaden my knowledge?
TIA zero
 Signature Beware the False Authority Syndrome
Chris Uppal - 05 Dec 2005 14:28 GMT > > Boyer-Moore, Knuth-Morris-Pratt or Rabin-Karp.
> I find my knowledge of such standard algorithms to be quite limited. Can > you suggest a good source - online or in print - to broaden my knowledge? The wikipedia article:
http://en.wikipedia.org/wiki/String_searching_algorithm
seems like a fair starting point on string searching. It has links to the wikipedia entries on BMP and RK algorithms, and som external links too.
If you'd rather read a book (or would like to read a book too), then I recommend Robert Sedgewick's algorithm books as particularly readable surveys of algorithms in common use. His 1-volumne "Algorithms in <language>" books have several pages on string searching, and /lots/ of other stuff that you might find interesting. He's also in the process of putting out an expanded N-volume version of "Algorithms in <language>", unfortunately only two volumes have come out so far, and I think his expanded coverage of string searching won't appear until in vol.3 comes out (which Amazon seem to think will be in next July).
Knuth is, of course, the ultimate algorithm book, but Sedgewick is also an authoritative and knowledgable specialist -- and his books are a /lot/ easier to read being (a) better explained and motivated, (b) aimed at practising programmers rather than (near-)mathematicians, (c) much shorter...
-- chris
tuweiwen@gmail.com - 05 Dec 2005 15:32 GMT Thomas thanks for the tip. I went on to download and run Roedy's Boyer program and it works great. I used quite large strings of Chinese characters (target string of size 7KB and pattern string of 50 character-long, to avoid it falling back to String.indexOf() . I am running a Chinese version of WinXP) to test it, and it works too. Thanks Roedy.
Chris, the strings are downloaded web pages in memory. What my app (a spider) does is to find out if a page contains one of several pattern strings, if it is, then it's saved to disk. I would like to know if my current approach is an efficient one.
Chris Smith - 05 Dec 2005 19:50 GMT > Chris, the strings are downloaded web pages in memory. What my app (a > spider) does is to find out if a page contains one of several pattern > strings, if it is, then it's saved to disk. I would like to know if my > current approach is an efficient one. (Realizing your addressing your comment to Chris Uppal, but I'll respond anyway. Hope you don't mind.)
In this case, you can be practically guaranteed that String matching is not your bottleneck. It will be much faster than retrieving those same strings over a network connection.
 Signature www.designacourse.com The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
Hiran Chaudhuri - 06 Dec 2005 13:04 GMT >> What's best way to handle this to achieve best performance? > > "best" always depends on the context, the word as such is rather > meaningless. I assume you mean fast. Then, contrary to other suggestions, > regexps are not a good choice at all. We have discussed this in the past. Agree, best is ambiguous. But the message subject line contains 'quickest', so the question must be runtime.
Hiran
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|