Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / May 2006

Tip: Looking for answers? Try searching our database.

regexp lookahead

Thread view: 
Michael Powe - 03 May 2006 17:30 GMT
I experimented a bit with the Java regexp lookahead functionality, and
the results don't make sense to me.  The test is below.

8<========================================>8
    public static void main (String [] args)
        {
               // this is negative lookahead
            String re = "(.*)\\[(?!\\S+)\\](.*)";
            // positive lookahead
           //String re = "(.*)\\[(?=\\S+)\\](.*)";
            String test = "this is [sometext] and some more";
            String test2 = "this is [] and some more";
           
            Pattern p = Pattern.compile(re);
            Matcher m = p.matcher(test);
            if (m.find()) {
                System.out.println("success match one");
                for (int i = 0; i <= m.groupCount(); i++) {
                System.out.println("Group " + i + " " + m.group(i));
                }
            } else {
                System.out.println("fail match one");
            }
            Matcher m2 = p.matcher(test2);
            if (m2.find()) {
                System.out.println("success match two");

                for (int i = 0; i <= m2.groupCount(); i++) {
                System.out.println("Group " + i + " " + m2.group(i));
                }
            } else {
                System.out.println("fail match two");
            }
        } // end main

8<========================================>8
Here's the output for positive lookahead:

cd /home/powem/src/java/
/opt/jdk1.5/bin/java DateTest

fail match one
success match two
Group 0 this is [] and some more
Group 1 this is
Group 2  and some more

And for negative lookahead:

cd /home/powem/src/java/
/opt/jdk1.5/bin/java DateTest

fail match one
fail match two
8<========================================>8

Thus, negative lookahead appears to be useless since it fails whether
the text is there or not.  Positive lookahead appears to do the
opposite of what you would expect, it fails if the condition is true
(text is there) and succeeds if the condition is false.

Things that make me go "hmmm."

Am I making some fundamental error here?

Thanks.

mp

Signature

Michael Powe        michael@trollope.org        Naugatuck CT USA
"We had pierced the veneer of outside things.  We had `suffered,
starved, and triumphed, groveled down yet grasped at glory, grown
bigger in the bigness of the whole.'  We had seen God in his
splendors, heard the text that Nature renders.  We had reached the
naked soul of man." -- Sir Ernest Shackleton, <South>

Jussi Piitulainen - 03 May 2006 21:40 GMT
> I experimented a bit with the Java regexp lookahead functionality,
> and the results don't make sense to me. The test is below.

...

Negative lookahead:

> String re = "(.*)\\[(?!\\S+)\\](.*)";

The look-ahead pattern and the following pattern match at the same
position: (?!\S+) matches the empty string between the \[ and
something that _fails_ to match \S+ at that position, and that
something should start with the \]. Where can this happen?

Positive lookahead:

> String re = "(.*)\\[(?=\\S+)\\](.*)";

The look-ahead pattern and the following pattern match at the same
position: (?=\S+) matches the empty string between the \[ and before
an \S+, and that \S+ should start with the \]. Where can this happen?

(Javadoc for 1.4.2 was not too helpful here, so I experimented a bit,
never having used these myself.)
Michael Powe - 04 May 2006 11:58 GMT
>>>>> "Jussi" == Jussi Piitulainen <jpiitula@ling.helsinki.fi> writes:

   Jussi> Michael Powe writes:

   >> I experimented a bit with the Java regexp lookahead
   >> functionality, and the results don't make sense to me. The test
   >> is below.

   Jussi> ...

   Jussi> Negative lookahead:

   >> String re = "(.*)\\[(?!\\S+)\\](.*)";

   Jussi> The look-ahead pattern and the following pattern match at
   Jussi> the same position: (?!\S+) matches the empty string between
   Jussi> the \[ and something that _fails_ to match \S+ at that
   Jussi> position, and that something should start with the
   Jussi> \]. Where can this happen?

In my test, it happens everywhere -- the regexp fails when there's
nothing there and when there's text there.

   Jussi> Positive lookahead:

   >> String re = "(.*)\\[(?=\\S+)\\](.*)";

   Jussi> The look-ahead pattern and the following pattern match at
   Jussi> the same position: (?=\S+) matches the empty string between
   Jussi> the \[ and before an \S+, and that \S+ should start with
   Jussi> the \]. Where can this happen?

The reason for my testing was because the regexp fails to match the
case where there is nothing between the brackets.  Note that the
brackets are not included in the group, they are part of the original
text string only:

[(\\S+)]

This fails as indicated in my example.  It's explainable but, to me
anyway, counterintuitive that "positive" lookahead -- which is
supposed to *confirm* the existence of a match -- fails when there is
a match and succeeds when there isn't.

In the real-world case that led me to examine the lookahead option, I
had a regexp matching a long string (9 group captures) that failed
when one of the expected groups, inside a bracket pair, was
empty. \\S+ does not match inside [] and thus caused the whole regex
to fail.  I would like to see a useful, nontrivial application of
lookahead.  It doesn't appear to me that there is one.

And the negative lookahead just appears broken.

   Jussi> (Javadoc for 1.4.2 was not too helpful here, so I
   Jussi> experimented a bit, never having used these myself.)

I actually have Habibi's book, _Java Regular Expressions_, but IMO it
is not very useful if you already have good knowledge of regex.  It
does have some value as a method reference and for information about
how things work behind the scenes.  However, I don't know that I
needed to spend that much money for that amount of information.  Its
explanations and sample code for lookahead, however, are incomplete
and trivial, respectively.  And, finding a typographical error on page
2 and another on page 3 is really offputting.

Ironically, Habibi criticizes perl's conditional construct in regex,
and it is exactly that construct that I need in the case described
here.

Thanks.

mp

Signature

Michael Powe        michael@trollope.org        Naugatuck CT USA
War is a sociological safety valve that cleverly diverts popular
hatred for the ruling classes into a happy occasion to mutilate or
kill foreign enemies. - Ernest Becker

Jussi Piitulainen - 04 May 2006 17:21 GMT
> >>>>> "Jussi" == Jussi Piitulainen writes:
>
[quoted text clipped - 10 lines]
> In my test, it happens everywhere -- the regexp fails when there's
> nothing there and when there's text there.

Right, except I would say _nowhere_ rather than everywhere. If (?!\S+)
matches, \] does not. If \] matches, (?!\S+) does not.

>     Jussi> Positive lookahead:
>
[quoted text clipped - 7 lines]
> The reason for my testing was because the regexp fails to match the
> case where there is nothing between the brackets.  Note that the

I thought that was the case that succeeded. That pattern is just like
(.*)\[\](.*) with an extra condition that the part of input that
matches \](.*) must also match \S+, which it does, since the ] is
there.

Are you sure that you understand that a lookahead pattern always
consumes an empty string? So your whole pattern can only match a pair
of brackets [], with the two groups on each side of it.

> In the real-world case that led me to examine the lookahead option,
> I had a regexp matching a long string (9 group captures) that failed
> when one of the expected groups, inside a bracket pair, was empty.
> \\S+ does not match inside [] and thus caused the whole regex to
> fail.

\S matches the right bracket, and eats it, too. (?=\S+) also matches
the right bracket but doesn't eat it.

Nine groups sounds rather complicated. Do you need to do it all in one
expression?

> I would like to see a useful, nontrivial application of lookahead.
> It doesn't appear to me that there is one.

I think there is a candidate in the other post I made, this morning I
think, where someone wanted to split a certain file at each <?xml...>
thingamajic in it.

(Which reminds me, you might consider the use of non-greedy patterns,
like .*?, since those .* try to eat the bracket pairs, too, and that
may lead to something that feels unintuitive.)

> And the negative lookahead just appears broken.

Let me contrive an example of sorts: a maximal digit sequence not
bounded by a . or a - or an e.

import java.util.regex.Matcher;
import java.util.regex.Pattern;
class NonLook {
 public static void main(String [] _) {
   Matcher m = Pattern
     .compile("(?<![.e\\-\\d])\\d++(?![.e\\-])")
     .matcher("pi 3.14 314e-2 1024 e 2.7 27e-1 31415926");
   while (m.find()) {
     System.out.println(m.group(0));
   }
 }
}

Ok, I had to throw in a lookbehind, a possessive quantifier in \d++,
and a \d inside the lookbehind. This does not eat the preceding or
following character, and matches even where there is no following
character at all. It seems to work.

>     Jussi> (Javadoc for 1.4.2 was not too helpful here, so I
>     Jussi> experimented a bit, never having used these myself.)
>
> I actually have Habibi's book, _Java Regular Expressions_, but IMO
> it is not very useful if you already have good knowledge of regex.

Does it tell what (?>X) does? Sun's doc says it matches "X, as an
independent, non-capturing group". I have no idea what an independent
group is. (I know that I'm not looking at the latest documentation.)

...

> Ironically, Habibi criticizes perl's conditional construct in regex,
> and it is exactly that construct that I need in the case described
> here.

There are likely to be other ways.

If your problem is that a pair of brackets in your input may contain
an empty string that you need to match, then you need to match an
empty string there. There is no way around that.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.