Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / June 2007

Tip: Looking for answers? Try searching our database.

buggy regexp

Thread view: 
ilyabo@gmail.com - 31 May 2007 15:33 GMT
Hello all!

The following line hangs in Java:

Pattern.matches("((<BR>)|([^<>]+))*", "aaaaaaaaaaaaaaaaaaaaaaaa
<BR><Bx>")

What's wrong with the regular expression?

Thanks in advance
Ilya
Robert Klemme - 31 May 2007 15:40 GMT
> Hello all!
>
[quoted text clipped - 4 lines]
>
> What's wrong with the regular expression?

You nest "+" and "*" which can lead to bad backtracking (which you seem
to experience).  If you wait long enough you'll see the result.

If you just want to test for the presence of "<BR>" you can do this:

boolean match = str.indexOf("<BR>") != -1;

Or as regexp

Pattern p = Pattern.compile("<BR>");
boolean match = p.matcher(s).find();

Kind regards

    robert
ilyabo@gmail.com - 31 May 2007 18:43 GMT
Hello Robert,

> You nest "+" and "*" which can lead to bad backtracking (which you seem
> to experience).  If you wait long enough you'll see the result.

You are right it doesn't hang, it just works very slowly - it takes
about two minutes to finish this match. And the more "a"-letters
before the <BR>s are in the input string, the longer it takes.

> If you just want to test for the presence of "<BR>" you can do this:
>
[quoted text clipped - 4 lines]
> Pattern p = Pattern.compile("<BR>");
> boolean match = p.matcher(s).find();

I want actually to check the validness of an HTML string where only
the cerain tags are allowed. The original expression was:
(?ui)(([^<>]+)|(<\\/?(p|br|em|strong|strike|i|b|ul|ol|li)\\s*\\/?\
\s*>))*
but I reduced it to make the problem clearer.

Interestingly, the same regular expression match works instantly in
Perl:
$_ = "aaaaaaaaaaaaaaaaaaaaaaaa <BR><Bx>";
print "yes" if (m/^((<BR>)|([^<>]+))*$/);

And if I move <Bx> to the beginning of the string or change it to <BR>
then it works instantly also in Java.

Ilya
Oliver Wong - 31 May 2007 22:25 GMT
[snip regexp stuff]

> I want actually to check the validness of an HTML string where only
> the cerain tags are allowed. The original expression was:
> (?ui)(([^<>]+)|(<\\/?(p|br|em|strong|strike|i|b|ul|ol|li)\\s*\\/?\
> \s*>))*
> but I reduced it to make the problem clearer.

   Note that it's impossible to check the validity of HTML using
"traditional" regular expressions, as HTML is not a regular language. Perl
regular expressions are actually more powerful than traditional regular
expressions; I'm not sure if Perl regular expressions are powerful enough
to recognize context-free languages or not. HTML is a context-free
language.

   I think you'll have better luck using a pre-written HTML parser, or at
worst, writing your own full blown parser for checking your strings for
validity.

   - Oliver
ilyabo@gmail.com - 01 Jun 2007 08:41 GMT
Hi Oliver,

in fact I don't want to really check the validity of HTML, but just
ensure that only allowed tags are used. So the logic of the regular
expression should be very simple: if there is something between <>
then it must be one of the predefined tags.

This is surely not impossible: the regular expression which I cited
works in Java too - it just takes too much time in Java, that's the
problem.

Regards
Ilya

>     Note that it's impossible to check the validity of HTML using
> "traditional" regular expressions, as HTML is not a regular language. Perl
> regular expressions are actually more powerful than traditional regular
> expressions; I'm not sure if Perl regular expressions are powerful enough
> to recognize context-free languages or not. HTML is a context-free
> language.
Robert Klemme - 01 Jun 2007 12:34 GMT
> in fact I don't want to really check the validity of HTML, but just
> ensure that only allowed tags are used. So the logic of the regular
> expression should be very simple: if there is something between <>
> then it must be one of the predefined tags.

Actually that logic does not work because it won't handle <> in comments
and attributes properly.  Using a full blown HTML parser (of which there
are plenty around) is certainly the safer choice.

> This is surely not impossible: the regular expression which I cited
> works in Java too - it just takes too much time in Java, that's the
> problem.

Often, changing approaches is more efficient.  Why don't you do

// this regexp should be improved!
private static final Pattern TAG = Pattern.compile("<(\\w+)[^>]*>");
private static final Set valid = createValidTags();

static boolean isValid(CharSequence s) {
  Matcher m = TAG.matcher(s);

  while ( m.find() ) {
    if ( !valid.contains(m.group(1).lowerCase()) ) {
      return false;
    }
  }

  return true;
}

Kind regards

    robert
ilyabo@gmail.com - 01 Jun 2007 13:21 GMT
Hello Robert,

> Actually that logic does not work because it won't handle <> in comments
> and attributes properly.

It doesn't have to.

> Often, changing approaches is more efficient.  Why don't you do
[code omitted]

Yes, that's just the same as what Ingo suggested and it works well
also on very long input strings.

Thank you!

Best regards
Ilya
ilyabo@gmail.com - 31 May 2007 19:15 GMT
And thanks, Robert, removing "+" helps alot, at least, for this input
string. But if I try a longer one, it still takes ages to match it.

Regards
Ilya

> > Pattern.matches("((<BR>)|([^<>]+))*", "aaaaaaaaaaaaaaaaaaaaaaaa
> > <BR><Bx>")
[quoted text clipped - 3 lines]
> You nest "+" and "*" which can lead to bad backtracking (which you seem
> to experience).  If you wait long enough you'll see the result.
Ingo Menger - 01 Jun 2007 09:39 GMT
> And thanks, Robert, removing "+" helps alot, at least, for this input
> string. But if I try a longer one, it still takes ages to match it.

This is because your RE is a bit silly.
You try to match the whole document, which is going to take long.
I'd do it like this:

p = Pattern.compile("<(\\w+)\\b[^>]*>");  // matches also <A
href="foo">
m = p.matcher(s);
correct = true;
while (correct && m.find()) {
 correct = allowed(m.group(1));
}
ilyabo@gmail.com - 01 Jun 2007 12:06 GMT
Thanks a lot, that's a good hint!

Regards
Ilya

> p = Pattern.compile("<(\\w+)\\b[^>]*>");  // matches also <A
> href="foo">
[quoted text clipped - 3 lines]
>   correct = allowed(m.group(1));
> }


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.