Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / December 2005

Tip: Looking for answers? Try searching our database.

Help: what is the quickest way to find out whether a string contains another string?

Thread view: 
tuweiwen@gmail.com - 05 Dec 2005 02:47 GMT
Hi all, can anyone give me a hint on what approach I should take to
this simple problem: what is the quickest way to find out whether a
string contains another string? For example,

String a = "a very long string, UP TO 100k";
String b = "a string of two or three words";

//---- now I need to write this function
boolean a_contains_b (String a, String b) {
 //I don't need to worry about the location of b, just want to know
 //whether a has b inside it. Returns true or false
}

This function will be called Thousands of times in an app.

What's best way to handle this to achieve best performance?

Currently I am using "if (a.indexOf(b) !=-1) " in the above
"a_contains_b()"  function. But I don't like it at all. It feels dumb,
isn't it? Do I have the wrong approach at the outset?

Thanks a lot. I appreciate your help.
Brandon McCombs - 05 Dec 2005 04:36 GMT
> Hi all, can anyone give me a hint on what approach I should take to
> this simple problem: what is the quickest way to find out whether a
[quoted text clipped - 18 lines]
>
> Thanks a lot. I appreciate your help.

Unless I'm missing something wouldn't the contains() method of String
class be what you want??

boolean    contains(CharSequence s)
          Returns true if and only if this string contains the
specified      sequence of char values.
Rhino - 05 Dec 2005 05:00 GMT
> Hi all, can anyone give me a hint on what approach I should take to
> this simple problem: what is the quickest way to find out whether a
[quoted text clipped - 18 lines]
>
> Thanks a lot. I appreciate your help.

One approach would be to use "regular expressions", often abbreviated
"regexp". If you look at the Pattern class in the J2SE API, you'll find an
overview describing how to use them. A full tutorial on regular expressions
can be found in the online version of the Java Tutorial at
http://java.sun.com/docs/books/tutorial/extra/regex/index.html.

Rhino

Chris Smith - 05 Dec 2005 19:47 GMT
> One approach would be to use "regular expressions", often abbreviated
> "regexp". If you look at the Pattern class in the J2SE API, you'll find an
> overview describing how to use them.

Why would anyone want to use regular expressions to look for a plain
String?  Regular expressions are a tool for deciding whether a String or
some subset thereof matches a defined regular language.  Although string
searching is a trivial degeneration of the problem addressed by regular
expressions, there's no need to introduce the added complexity.

If you are set on using regular expressions, converting the arbitrary
input string into a pattern that matches only itself is the first task.  
Prior to Java 1.5, this was very difficult.  In 1.5, it's encapsulated
in a method called Pattern.quote(String).  Once that's done, you can
search for that pattern.  It's far easier, though, to use String.indexOf
(prior to 1.5) or String.contains (as of 1.5).

Signature

www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Rhino - 05 Dec 2005 22:23 GMT
>> One approach would be to use "regular expressions", often abbreviated
>> "regexp". If you look at the Pattern class in the J2SE API, you'll find
[quoted text clipped - 13 lines]
> search for that pattern.  It's far easier, though, to use String.indexOf
> (prior to 1.5) or String.contains (as of 1.5).

I didn't say regular expressions were the _best_ way, just that they were
an option.

The original poster mentioned that he was searching very long strings for
other strings. I don't know if regular contains() or similar methods have
length limitations and I've never looked into whether the more traditional
methods are more or less efficient than techniques based on regular
expressions. Offhand, I suspected that regular expressions might be more
efficient on longer strings but the other replies to this thread indicate
that regular expressions are actually _less_ efficient that contains() or
its brethren.

Therefore, I learned something along the way, too.

Rhino
Alan Moore - 06 Dec 2005 11:40 GMT
I'd like to add a note about the inefficiency of regexes.  The regex
".*is not.*" is inherently slow, and not a fair test of regex speed.
Adding the dot-stars before and after the real search string is a hack
for when you can only use String#matches().  When using the find()
method, you can get rid of them and the search will go much faster.
When I ran the OP's second test program with the regex "is not", it
took 1052ms (compared to 4486ms for ".*is not.*").  That was still
three times as long as the contains() or indexOf() approaches, so I'm
not pushing for regex use here.  Just setting the record straight.
Roedy Green - 05 Dec 2005 07:16 GMT
>Hi all, can anyone give me a hint on what approach I should take to
>this simple problem: what is the quickest way to find out whether a
>string contains another string? For example,

an approach is to do a quick check before you scan.

You break your big string into words with a regex split and put each
word into a HashSet.

Then you break you short string into words with that same precompiled
regex and look up each word in turn in the HashSet.  If any word is
not found, there is no point in further processing. If you do find all
three, then use indexOf or contains.

If you want to get fancy, use a hashMap to track the offset of the
FIRST ocurrence of each word.  Then you take the min of the three
offsets for starting off your indexof search and the max of the three
for your stopping search point.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

tuweiwen@gmail.com - 05 Dec 2005 07:50 GMT
Thanks a lot, guys. Roedy, I will try that approach later.

I did a little testing on the performance and got an interesting
result. Here is the program.

import java.util.*;
/*
This is an un-scientific, probably way off-mark, testing
comparing the speed of three functions:
String.contains(), String.indexOf() and String.matches().
Note that String.matches() use regular expression approach
according to the java doc

Result: the speeds of indexOf() and contains() are about the same,
but matches() is at least ten times slower.
Is regular expression bad in terms of performance?

Problems with this tesing:
1. The target string a is not long enough. As I said, in my app, it
can be up to 100k. The performance may change if string a gets bigger.
2. In all loops, string a and string b are the same. In my app, at the
least string a is always changing, such as newly downloaded web pages.
3. Right now, it always matches inside the loop. should do a randomized
String a and string b.

Testing machine: WinXP Pro, P3 500, 256MB Ram, JDK 1.5.0_05
*/

public class Test {

    public static void main(String args[]) {

        int loop = 1000000;

        int matches = 0;

        String a = "This is a testing string. It is not very long.";
        String b = "is not";

        long startTime = System.currentTimeMillis();

        for (int i = 0; i < loop; i++) {
            if (a.contains(b)) matches++; //took about 1620 miliseconds
            //if (a.matches(".*not.*")) matches++; //took about 25200
miliseconds
            //if (a.indexOf(b) != -1) matches++; //took about 1520 miliseconds
        }

        long endTime = System.currentTimeMillis();

        System.out.println("\nRunning " + loop + " times took " + (endTime -
startTime) + " miliseconds.");
    }

}

right now, this part of my app (processing the downloaded web pages)
takes lots of time. so i am looking for ways to speed it up.
tuweiwen@gmail.com - 05 Dec 2005 08:09 GMT
This time I use explicit regular expression instead of
String.matches(), and pre-compile the regex. Speed gets better a little
bit.
Just replace the old main() method with this new one.  Also need import
java.util.regex.*; at the top.

    public static void main(String args[]) {
        int loop = 1000000;
        int matches = 0;
        String a = "This is a testing string. It is not very long.";
        String b = "is not";

        Pattern pat = Pattern.compile(".*is not.*");
        Matcher mat;

        long startTime = System.currentTimeMillis();

        for (int i = 0; i < loop; i++) {
            //if (a.contains(b)) matches++; //took about 1620 miliseconds
            //if (a.indexOf(b) != -1) matches++; //took about 1520
miliseconds
            if (pat.matcher(a).find()) matches++; //took about 18400 milis

        }
        long endTime = System.currentTimeMillis();
        System.out.println("\nRunning " + loop + " times took " + (endTime -
startTime) + " miliseconds.");
    }
Chris Uppal - 05 Dec 2005 10:27 GMT
> Result: the speeds of indexOf() and contains() are about the same,
> but matches() is at least ten times slower.
> Is regular expression bad in terms of performance?

Yes.

Regexp matching is an order of magnitude more complex than a simple scan of the
string.  Although the time (for normal regexps) is linear in the length of the
String, just as for naive scanning, the constants are significantly greater --
as you have found.

Another problem, by the way, is that if you are looking for a string like "this
is *not* easy", then you'll have to pre-process the search string to escape the
special characters.

You've already been pointed at better algorithms for searching strings than a
naive scan.  I just wanted to add a few warnings.  One is that you should
verify that string matching is /actually/ the bottleneck in your application
before wasing too much time looking for alternatives.  Another is that all
these algorithms have significant start-up time, so you should be careful to
measure using representative data -- a string literal that you can type into a
Java program is /not/ representative of a 100K String.  (BTW it's distinctly
unusual to have such long strings in a program, how do they come above ?)  The
last is that if you are looking at Roedy's BMP implementation, that is
artificially restricted[*] to chararacters in the range 0...255 in both the
search and target strings.

([*] Or it was when I last looked, and the restriction is -- in fact --
justifiable, but it may not rule out that implementation for your purposes.)

   -- chris
Roedy Green - 05 Dec 2005 18:16 GMT
On Mon, 5 Dec 2005 10:27:51 -0000, "Chris Uppal"
<chris.uppal@metagnostic.REMOVE-THIS.org> wrote, quoted or indirectly
quoted someone who said :

>You've already been pointed at better algorithms for searching strings than a
>naive scan.  I just wanted to add a few warnings.
see http://mindprod.com/products1.html#BOYER
for a faster indexOf
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Roedy Green - 05 Dec 2005 18:15 GMT
>Result: the speeds of indexOf() and contains() are about the same,
>but matches() is at least ten times slower.
>Is regular expression bad in terms of performance?

Don't you believe your own experiment?  

In general the more specialised a tool is to the task at hand the
faster it will be.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Thomas Weidenfeller - 05 Dec 2005 08:35 GMT
> What's best way to handle this to achieve best performance?

"best" always depends on the context, the word as such is rather
meaningless. I assume you mean fast. Then, contrary to other
suggestions, regexps are not a good choice at all. We have discussed
this in the past.

Your problem is a standard problem in CS, it has been extensively
studied and many textbooks discuss it. Look for the following keywords
in books, Google, etc.:

Boyer-Moore, Knuth-Morris-Pratt or Rabin-Karp.

There are for sure more such algorithms.

/Thomas
Signature

The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/

zero - 05 Dec 2005 12:21 GMT
Thomas Weidenfeller <nobody@ericsson.invalid> wrote in news:dn0u58$ika$1
@news.al.sw.ericsson.se:

> Your problem is a standard problem in CS, it has been extensively
> studied and many textbooks discuss it. Look for the following keywords
> in books, Google, etc.:
>
> Boyer-Moore, Knuth-Morris-Pratt or Rabin-Karp.

I find my knowledge of such standard algorithms to be quite limited.  Can
you suggest a good source - online or in print - to broaden my knowledge?

TIA
zero

Signature

Beware the False Authority Syndrome

Chris Uppal - 05 Dec 2005 14:28 GMT
> > Boyer-Moore, Knuth-Morris-Pratt or Rabin-Karp.

> I find my knowledge of such standard algorithms to be quite limited.  Can
> you suggest a good source - online or in print - to broaden my knowledge?

The wikipedia article:

   http://en.wikipedia.org/wiki/String_searching_algorithm

seems like a fair starting point on string searching.  It has links to the
wikipedia entries on BMP and RK algorithms, and som external links too.

If you'd rather read a book (or would like to read a book too), then I
recommend Robert Sedgewick's algorithm books as particularly readable surveys
of algorithms in common use.  His 1-volumne "Algorithms in <language>" books
have several pages on string searching, and /lots/ of other stuff that you
might find interesting.  He's also in the process of putting out an expanded
N-volume version of "Algorithms in <language>", unfortunately only two volumes
have come out so far, and I think his expanded coverage of string searching
won't appear until in vol.3 comes out (which Amazon seem to think will be in
next July).

Knuth is, of course, the ultimate algorithm book, but Sedgewick is also an
authoritative and knowledgable specialist -- and his books are a /lot/ easier
to read being (a) better explained and motivated, (b) aimed at practising
programmers rather than (near-)mathematicians, (c) much shorter...

   -- chris
tuweiwen@gmail.com - 05 Dec 2005 15:32 GMT
Thomas thanks for the tip. I went on to download and run Roedy's Boyer
program and it works great.  I used quite large strings of Chinese
characters (target string of size 7KB and pattern string of 50
character-long, to avoid it falling back to String.indexOf() . I am
running a Chinese version of WinXP) to test it, and it works too.
Thanks Roedy.

Chris, the strings are downloaded web pages in memory. What my app (a
spider) does is to find out if a page contains one of several pattern
strings, if it is, then it's saved to disk.  I would like to know if my
current approach is an efficient one.
Chris Smith - 05 Dec 2005 19:50 GMT
> Chris, the strings are downloaded web pages in memory. What my app (a
> spider) does is to find out if a page contains one of several pattern
> strings, if it is, then it's saved to disk.  I would like to know if my
> current approach is an efficient one.

(Realizing your addressing your comment to Chris Uppal, but I'll respond
anyway.  Hope you don't mind.)

In this case, you can be practically guaranteed that String matching is
not your bottleneck.  It will be much faster than retrieving those same
strings over a network connection.

Signature

www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Hiran Chaudhuri - 06 Dec 2005 13:04 GMT
>> What's best way to handle this to achieve best performance?
>
> "best" always depends on the context, the word as such is rather
> meaningless. I assume you mean fast. Then, contrary to other suggestions,
> regexps are not a good choice at all. We have discussed this in the past.

Agree, best is ambiguous. But the message subject line contains 'quickest',
so the question must be runtime.

Hiran


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.