Hello Robert,
> You nest "+" and "*" which can lead to bad backtracking (which you seem
> to experience). If you wait long enough you'll see the result.
You are right it doesn't hang, it just works very slowly - it takes
about two minutes to finish this match. And the more "a"-letters
before the <BR>s are in the input string, the longer it takes.
> If you just want to test for the presence of "<BR>" you can do this:
>
[quoted text clipped - 4 lines]
> Pattern p = Pattern.compile("<BR>");
> boolean match = p.matcher(s).find();
I want actually to check the validness of an HTML string where only
the cerain tags are allowed. The original expression was:
(?ui)(([^<>]+)|(<\\/?(p|br|em|strong|strike|i|b|ul|ol|li)\\s*\\/?\
\s*>))*
but I reduced it to make the problem clearer.
Interestingly, the same regular expression match works instantly in
Perl:
$_ = "aaaaaaaaaaaaaaaaaaaaaaaa <BR><Bx>";
print "yes" if (m/^((<BR>)|([^<>]+))*$/);
And if I move <Bx> to the beginning of the string or change it to <BR>
then it works instantly also in Java.
Ilya
Oliver Wong - 31 May 2007 22:25 GMT
[snip regexp stuff]
> I want actually to check the validness of an HTML string where only
> the cerain tags are allowed. The original expression was:
> (?ui)(([^<>]+)|(<\\/?(p|br|em|strong|strike|i|b|ul|ol|li)\\s*\\/?\
> \s*>))*
> but I reduced it to make the problem clearer.
Note that it's impossible to check the validity of HTML using
"traditional" regular expressions, as HTML is not a regular language. Perl
regular expressions are actually more powerful than traditional regular
expressions; I'm not sure if Perl regular expressions are powerful enough
to recognize context-free languages or not. HTML is a context-free
language.
I think you'll have better luck using a pre-written HTML parser, or at
worst, writing your own full blown parser for checking your strings for
validity.
- Oliver
ilyabo@gmail.com - 01 Jun 2007 08:41 GMT
Hi Oliver,
in fact I don't want to really check the validity of HTML, but just
ensure that only allowed tags are used. So the logic of the regular
expression should be very simple: if there is something between <>
then it must be one of the predefined tags.
This is surely not impossible: the regular expression which I cited
works in Java too - it just takes too much time in Java, that's the
problem.
Regards
Ilya
> Note that it's impossible to check the validity of HTML using
> "traditional" regular expressions, as HTML is not a regular language. Perl
> regular expressions are actually more powerful than traditional regular
> expressions; I'm not sure if Perl regular expressions are powerful enough
> to recognize context-free languages or not. HTML is a context-free
> language.
Robert Klemme - 01 Jun 2007 12:34 GMT
> in fact I don't want to really check the validity of HTML, but just
> ensure that only allowed tags are used. So the logic of the regular
> expression should be very simple: if there is something between <>
> then it must be one of the predefined tags.
Actually that logic does not work because it won't handle <> in comments
and attributes properly. Using a full blown HTML parser (of which there
are plenty around) is certainly the safer choice.
> This is surely not impossible: the regular expression which I cited
> works in Java too - it just takes too much time in Java, that's the
> problem.
Often, changing approaches is more efficient. Why don't you do
// this regexp should be improved!
private static final Pattern TAG = Pattern.compile("<(\\w+)[^>]*>");
private static final Set valid = createValidTags();
static boolean isValid(CharSequence s) {
Matcher m = TAG.matcher(s);
while ( m.find() ) {
if ( !valid.contains(m.group(1).lowerCase()) ) {
return false;
}
}
return true;
}
Kind regards
robert
ilyabo@gmail.com - 01 Jun 2007 13:21 GMT
Hello Robert,
> Actually that logic does not work because it won't handle <> in comments
> and attributes properly.
It doesn't have to.
> Often, changing approaches is more efficient. Why don't you do
[code omitted]
Yes, that's just the same as what Ingo suggested and it works well
also on very long input strings.
Thank you!
Best regards
Ilya
And thanks, Robert, removing "+" helps alot, at least, for this input
string. But if I try a longer one, it still takes ages to match it.
Regards
Ilya
> > Pattern.matches("((<BR>)|([^<>]+))*", "aaaaaaaaaaaaaaaaaaaaaaaa
> > <BR><Bx>")
[quoted text clipped - 3 lines]
> You nest "+" and "*" which can lead to bad backtracking (which you seem
> to experience). If you wait long enough you'll see the result.
Ingo Menger - 01 Jun 2007 09:39 GMT
> And thanks, Robert, removing "+" helps alot, at least, for this input
> string. But if I try a longer one, it still takes ages to match it.
This is because your RE is a bit silly.
You try to match the whole document, which is going to take long.
I'd do it like this:
p = Pattern.compile("<(\\w+)\\b[^>]*>"); // matches also <A
href="foo">
m = p.matcher(s);
correct = true;
while (correct && m.find()) {
correct = allowed(m.group(1));
}
ilyabo@gmail.com - 01 Jun 2007 12:06 GMT
Thanks a lot, that's a good hint!
Regards
Ilya
> p = Pattern.compile("<(\\w+)\\b[^>]*>"); // matches also <A
> href="foo">
[quoted text clipped - 3 lines]
> correct = allowed(m.group(1));
> }