Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / November 2005

Tip: Looking for answers? Try searching our database.

Regular expression question -- exclude substring

Thread view: 
dreamerbin@gmail.com - 07 Nov 2005 23:19 GMT
Hi,

I'm having trouble extracting substrings using regular expression. Here
is my problem:

Want to find the substring that is immediately before a given
substring. For example: from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
My regular expression
"(00.*?01) target_mark"
will extract
"00 noise1 01 noise2 00 target 01".

I'm thinking that the solution to my problem might be to use a regular
expression to exclude the substring "target_mark", which will replace
the part of ".*" above. However, I don't know how to exclude a
substring. Can anyone help on this? Or maybe give another solution to
my problem? Thanks very much.
hiwa - 07 Nov 2005 23:40 GMT
I think your description of requirement is vague and imprecise.
Here is a stub in the dark:

/*
from
"00 noise1 01 noise2 00 target 01 target_mark",
want to get
"00 target 01"
which is before
"target_mark".
*/
import java.util.regex.*;

public class Dreamer{

 public static void main(String[] args){
   String input = "00 noise1 01 noise2 00 target 01 target_mark";
   String regex = "00\\s\\S+\\s01(?=\\starget_mark)";

   Pattern pat = Pattern.compile(regex);
   Matcher mat = pat.matcher(input);
   while (mat.find()){
     System.out.println(mat.group());
   }
 }
}
Benji - 07 Nov 2005 23:50 GMT
> I think your description of requirement is vague and imprecise.
> Here is a stub in the dark:

stab?  =)

>     String regex = "00\\s\\S+\\s01(?=\\starget_mark)";

this will not work if "noise" contains any whitespace.  (I'm not sure
if it can or not - like you said, it was vague.)

Signature

Of making better designs there is no end,
 and much refactoring wearies the body.

dreamerbin@gmail.com - 08 Nov 2005 02:45 GMT
Great! Thanks a lot, guys. I appologize for my vague problem
description. Actually "noise" does contain whitespace since it is
normally a long string. I'm not sure if hiwa's solution can be tuned to
meet this requirement. For Benji's solution, fortunately "00" is short.
So this does solve my problem. But I'm curious to know how to solve the
general problem if the start symbol ("00" here) is long, such as
"startsymbol". Do we have to write down all the combinations in the
regular expression? Why did you say 'you can't say "anything that
doesn't contain 00"'? Is it a common sense?
Benji - 08 Nov 2005 03:07 GMT
> "startsymbol". Do we have to write down all the combinations in the
> regular expression? Why did you say 'you can't say "anything that
> doesn't contain 00"'? Is it a common sense?

Yes, you would have to write all of the combinations that it could not be.

Because of the way that regular expressions are turned into DFAs,
implementing a generalized "not" would be exactly as complicated as
doing what I did.  (the size of the DFA would be 2^n, where n is the
length of the string)

While it would be possible for them to implement, they chose not to,
since if someone did something like

"^(this really long string)", then the compiled DFA could be much larger
than the user intended if they didn't know what they were doing.

Signature

Of making better designs there is no end,
 and much refactoring wearies the body.

John C. Bollinger - 10 Nov 2005 12:30 GMT
>                                But I'm curious to know how to solve the
> general problem if the start symbol ("00" here) is long, such as
> "startsymbol". Do we have to write down all the combinations in the
> regular expression? Why did you say 'you can't say "anything that
> doesn't contain 00"'? Is it a common sense?

Benji gave a good answer to your actual question.  I just wanted to
reiterate Chris Uppal's recent assertion in another thread that regexes
are often not the right tool for the job.  It may be faster, clearer,
and all-around better to tokenize the input into grammatical tokens
first ("startsymbol" would then be a single token, for instance) and
then parse it with appropriate code (often not very difficult).  For
parsing complex languages you might want to look into the class of tools
that are designed for this job, lexical analyzers and parsers / compiler
compilers.

Signature

John Bollinger
jobollin@indiana.edu

Benji - 07 Nov 2005 23:41 GMT
> My regular expression
> "(00.*?01) target_mark"
> will extract
> "00 noise1 01 noise2 00 target 01".

> I'm thinking that the solution to my problem might be to use a regular
> expression to exclude the substring "target_mark", which will replace
> the part of ".*" above. However, I don't know how to exclude a
> substring. Can anyone help on this? Or maybe give another solution to
> my problem? Thanks very much.

in order to get the smallest substring out of that, you have to exclude
"00" from appearing in .*.  (I'm not sure why you have the question mark -
* intrinsically includes 0 repetitions)

since you can't say "anything that doesn't contain 00", my best whack at
it would be
"(00((0[^0])|([^0]))*01) target_mark"

Signature

Of making better designs there is no end,
 and much refactoring wearies the body.

Benji - 07 Nov 2005 23:44 GMT
> My regular expression
> "(00.*?01) target_mark"
> will extract
> "00 noise1 01 noise2 00 target 01".

> I'm thinking that the solution to my problem might be to use a regular
> expression to exclude the substring "target_mark", which will replace
> the part of ".*" above. However, I don't know how to exclude a
> substring. Can anyone help on this? Or maybe give another solution to
> my problem? Thanks very much.

in order to get the smallest substring out of that, you have to exclude
"00" from appearing in .*.  (I'm not sure why you have the question mark -
* intrinsically includes 0 repetitions)

since you can't say "anything that doesn't contain 00", you have to say
"(either 0 followed by not 0, or not 0) repeated"
my best whack at it would be
"(00((0[^0])|([^0]))*01) target_mark"

Signature

Of making better designs there is no end,
 and much refactoring wearies the body.

John C. Bollinger - 08 Nov 2005 03:03 GMT
>>My regular expression
>>"(00.*?01) target_mark"
[quoted text clipped - 10 lines]
> "00" from appearing in .*.  (I'm not sure why you have the question mark -
> * intrinsically includes 0 repetitions)

The ? makes the * a "reluctant quantifier".  Together, *? means the
smallest number of repetitions necessary for a successful match, and it
is a concise way of solving the type problem you raise.  Unfortunately,
it needs help to work in a situation where the opening and closing
delimiters are different.  Understanding why that's the case goes hand
in hand with finding a solution; the key here is that the regex engine
works from the left, and only backtracks when it absolutely has to.  In
the example, the regex engine latches on to the first 00 as the start of
the group, and never has to backtrack (to start the group at the second
00 instead) because it can obtain a complete match without doing so.

Here's an example of a pattern that works:
    Pattern pat = Pattern.compile(".*(00.*?01) target_mark");
    Matcher matcher = pat.matcher(
            "00 noise1 01 noise2 00 target 01 target_mark");
    if (matcher.matches()) {  // or .find() or .lookingAt()
        System.out.println("The pattern selected this group: '"
            + matcher.group(1) + "'");
    } else {
        System.out.println("The pattern did not match");
    }

Output is:
The pattern selected this group: '00 target 01'

*Why* that works is left as an exercise for the reader :-)

Extra credit: is the reluctant quantifier necessary in the above
pattern?  Would a greedy ("normal") one do the job as well?  (Hint:
yes.) Why or why not?

> since you can't say "anything that doesn't contain 00",  you have to say
> "(either 0 followed by not 0, or not 0) repeated"
> my best whack at it would be
> "(00((0[^0])|([^0]))*01) target_mark"

And that one does the job for the sample data too, though it can fail if
the target or noise is permitted to abut the 01 delimiter without
intervening whitespace.

Signature

John Bollinger
jobollin@indiana.edu



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.