Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / June 2007

Tip: Looking for answers? Try searching our database.

Is matching against several regex's so clumsy?

Thread view: 
joosteto@gmail.com - 22 Jun 2007 12:05 GMT
/*
I'd like to search for several regex's in a (large) String, walking
through the string.
In order not to copy the String all the time, I thought I'd use
matcherObject.find(position), where
position is set position=macherObject.end() whenever a regex is found.
For example, search for the regex's:
    ABLEWORD:   \b\S*able\b
    FULWORD:    \b\S*ful\b
    ANYWORD:    \b\S+\b
    SPACE:      \s+

The only way I found was to create a Pattern and a Matcher for each
regex I want to search for, and use \\G
to make the matcherObject.find(position) start at position (not the
"previous match" as the documentation
claims), as I do in the code below.

Now, my question is: does it really have to be this clumsy?
(declaring two objects for each regex, having to copy end position
from last match, etc)

And, does "\G" really mean match from start index for
matcherObject.find(index), and not match from end
of previous match, as claimed by the documentation
http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html
*/

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Scan {
   public Scan() {
   }
   public static void main(String[] args){

       int pos=0;
       String s="a beautiful string with matchable words";

       Pattern able=Pattern.compile("\\G\\b(\\S*able)\\b");
       Matcher matchAble=able.matcher(s);

       Pattern ful=Pattern.compile("\\G\\b(\\S*ful)\\b");
       Matcher matchFul=ful.matcher(s);

       Pattern any=Pattern.compile("\\G(\\S+)");
       Matcher matchAny=any.matcher(s);

       Pattern space=Pattern.compile("\\G(\\s+)");
       Matcher matchSpace=space.matcher(s);

       while(pos<s.length()){
           if(matchAble.find(pos)){
               pos=matchAble.end();
               System.out.print("ABLE:  \""+matchAble.group(1)+"\",
");
           } else if(matchFul.find(pos)){
               pos=matchFul.end();
               System.out.print("FUL: \""+matchFul.group(1)+"\", ");
           } else if(matchAny.find(pos)){
               pos=matchAny.end();
               System.out.print("ANY: \""+matchAny.group(1)+"\", ");
           } else if(matchSpace.find(pos)){
               pos=matchSpace.end();
               System.out.print("SPACE: \""+matchSpace.group(1)+"\",
");
           } else {
               System.out.println("No match found at:
\""+s.substring(pos)+"\"");
               break;
           }
       }
   }
}
Oliver Wong - 22 Jun 2007 18:05 GMT
> /*
> I'd like to search for several regex's in a (large) String, walking
[quoted text clipped - 17 lines]
> (declaring two objects for each regex, having to copy end position
> from last match, etc)

   It looks like you're reinventing lexical analysis. You may find it
less clumsy to reuse the existing algorithms and tools:
http://en.wikipedia.org/wiki/Lexical_analysis

> And, does "\G" really mean match from start index for
> matcherObject.find(index), and not match from end
> of previous match, as claimed by the documentation
> http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html
> */

   I'd assume the documentation is correct, but I haven't verified it
personally.

   - Oliver
joosteto@gmail.com - 25 Jun 2007 10:34 GMT
> >     ABLEWORD:   \b\S*able\b
> >     FULWORD:    \b\S*ful\b
> >     ANYWORD:    \b\S+\b
> >     SPACE:      \s+

>     It looks like you're reinventing lexical analysis. You may find it
True:). Also, I'm learning java.

> less clumsy to reuse the existing algorithms and tools:http://en.wikipedia.org/wiki/Lexical_analysis

Thanks, I'll take a look at jflex.
Joshua Cranmer - 22 Jun 2007 21:01 GMT
> And, does "\G" really mean match from start index for
> matcherObject.find(index), and not match from end of previous match, as
> claimed by the documentation
> http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html */

I would trust the documentation, especially given your code:

> [snip]
>         Pattern able=Pattern.compile("\\G\\b(\\S*able)\\b"); Matcher
[quoted text clipped - 3 lines]
>         matchFul=ful.matcher(s);
> [snip]

"\G" probably means from the end of the previous match, but you're using
four different matchers, so the end of the "previous" match that the
Matcher sees is not the one you thinking of.
joosteto@gmail.com - 25 Jun 2007 10:25 GMT
> > And, does "\G" really mean match from start index for
> > matcherObject.find(index), and not match from end of previous match, as
[quoted text clipped - 14 lines]
> four different matchers, so the end of the "previous" match that the
> Matcher sees is not the one you thinking of.

The code works perfectly OK, and it \G matches not from the start of
the previous match, but form index in the matchFil.find(index). That
is indeed not as it is described in the manual.
Roedy Green - 23 Jun 2007 10:41 GMT
On Fri, 22 Jun 2007 11:05:00 -0000, "joosteto@gmail.com"
<joosteto@gmail.com> wrote, quoted or indirectly quoted someone who
said :

>I'd like to search for several regex's in a (large) String, walking
>through the string.
[quoted text clipped - 6 lines]
>     ANYWORD:    \b\S+\b
>     SPACE:      \s+

what you might do if you need more speed is use a Boyer Moore
algorithm to search for several strings simultaneously.  When you find
a decent candidate, then fire up your regexes.

I have written a single search Boyer Moore you could use to get
started.

See http://mindprod.com/products1.html#BOYER

Regexes are for lightweight parsing tasks.  You might be needing a
parser.  See http://mindprod.com/jgloss/parser.html
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.