> I have the following problem to solve with regular expressions:
>
[quoted text clipped - 13 lines]
> But I don't know how to combine this with the other rules. I'd
> appreciate any help.
This is hard to get right, even for someone who's not worried
about the syntax of regular expressions. For example, the rules
you've given are very simple to write as a RE, but unfortunately
allow a user to insert arbitrary Javascript into the anchor tag.
Oops.
If you're doing this for real, don't. Find a third-party
implementation that you trust and that you can, through
a relaxed license or cash money, swipe.
If it's homework .. the easiest way to write a complicated
regular expression is to write simpler ones first. I'd start
by detecting pairs of (<,>), and buld from there. Edi
Weitz's Regex Coach is pretty cool:
http://weitz.de/regex-coach/

Signature
Mark Jeffcoat
Austin, TX
> I have the following problem to solve with regular expressions:
>
[quoted text clipped - 13 lines]
> But I don't know how to combine this with the other rules. I'd
> appreciate any help.
I would suggest using an existing xml parser to validate tags, its much
less error-prone than RE, and more flexible when your business rules
change.
a SAX parser would probably be "good enough" in this case.
"birgit" <kbirgit@gmx.de> wrote or quoted in
Message-ID: <1163698138.575981.138400@h54g2000cwb.googlegroups.com>:
> - only chars of the ISO-8859-1 charset are allowed
> - only <b>, </b>, <br />, <a ...> and </a> tags are allowed (no other
> HTML Tags)
Probably ...
<code>
// Untested ..
//
// Precompile ..
//
String NotISO_8859_1 = "[^\u0000-\u00FF]+";
String HTML_MARKUP = "</?\\w+[^>]*>";
String HTML_ALLOWED = "(<b>) | " +
"(</b>) | " +
"(<br\\s+/>) | " +
"(<a[^>]*>) | " +
"(</a>) ";
Pattern pNotISO_8859_1 = Pattern.compile(NotISO_8859_1);
Pattern pHTML_MARKUP = Pattern.compile(HTML_MARKUP);
Pattern pHTML_ALLOWED =
Pattern.compile(HTML_ALLOWED,
Pattern.CASE_INSENSITIVE |
Pattern.COMMENTS );
//
// checking routine.
//
String src = ... // usr input
Matcher m;
m = pNotISO_8859_1.matcher(src);
if (m.find()) {
// return Error.
}
m = pHTML_MARKUP.matcher(src);
while (m.find()) {
Matcher ha_m = pHTML_ALLOWED.matcher(m.group());
if (!ha_m.find()) {
// return Error.
}
}
// return OK
</code>
birgit - 17 Nov 2006 08:17 GMT
Thanks a lot for the answers so far.
I know that regular expressions are not optimal for checking HTML-Tags
but in my case I would really like to use them anyway.
I am using OpenCms structured contents and in the xml schema
definitions I can define validationrules as regular expressions and the
content gets checked automatically - very comfortable. Usually an
editor should not be able to enter any other HTML-Tags but the ones I
provide through buttons, except if he copy and pastes something or if I
allow him to view and edit the source.
So if I don't want to modify the source code of OpenCms I need to use
regular expressions and if possible one which checks everything or
maybe two or three wich can be executed one after the other.
Unfortunately the last suggestion can't be used although I am sure it
would work.
Anyone some more suggestions?
Red Orchid - 17 Nov 2006 10:08 GMT
"birgit" <kbirgit@gmx.de> wrote or quoted in
Message-ID: <1163751466.422799.185550@h54g2000cwb.googlegroups.com>:
> [snip]
> So if I don't want to modify the source code of OpenCms I need to use
> regular expressions and if possible one which checks everything or
> maybe two or three wich can be executed one after the other.
> Unfortunately the last suggestion can't be used although I am sure it
> would work.
Maybe it is possible to write one regex which checks everything
with Conditional and Lookaround.
But, as I know on,
Java RegEx library do not support Conditional.
(I don't know the reason.)
If you think the regex is possible with Conditional,
it will be worth searching a library that supports
Conditional.
birgit - 17 Nov 2006 12:04 GMT
So far, I don't want to change any of the code which uses the regular
expression and so this is unfortunately no solution.
I am not really good with regular expressions! But I've got something
which works in most situations, but if there is a wrong HTML Code at
the end of the text it seems like an endless loop. Can anyone correct
this regular expression.
((<b>)??|(<\/b>)??|(<br\s+\/>)??|(<a[^>]*>)??|(<\/a>)??|([\u0000-\u003B\u003D\u003F-\u00FF]*)??)*?
I also know that with this expression no '>' and '<' are allowed in
normal Texts. Can I solve this somehow.
Thanks for your help!
Daniel Pitts - 17 Nov 2006 19:29 GMT
> So far, I don't want to change any of the code which uses the regular
> expression and so this is unfortunately no solution.
[quoted text clipped - 10 lines]
>
> Thanks for your help!
Actually, I think the proper way to handle your problem is to HTML
escape the whole thing, and then have "psuedo tags" similar to BBCode.
[b]bold[/b] [url=http://lala/]Link[/url]
That way, everything is safe for HTML, and you have control of what is
added.
Mark Jeffcoat - 17 Nov 2006 09:47 GMT
> "birgit" <kbirgit@gmx.de> wrote or quoted in
> Message-ID: <1163698138.575981.138400@h54g2000cwb.googlegroups.com>:
[quoted text clipped - 7 lines]
> <code>
> // Untested ..
[snip code]
In some sense, this is technically excellent. I pasted your
code into a method object, replacing '//return Error' and
'//return OK' with 'return false' and 'return true', and
it worked exactly as specified, first time out of the box.
Not bad for untested. In fact, it makes a big improvement
over the original spec: a naive RE matcher for '<a ...>' would
have accepted strings like '<a></a><i>Anything</i> can go in
here as long as I finish with an angle bracket>', and your
pattern rejects that sort of thing nicely.
However, it also accepts this string as valid:
<a id="code" expr="alert('0wn3d.')"
style="background:url('javascript:eval(document.all.code.expr)')"></a>
This is not just a theoretical attack. When I put that "link"
on my own webpage, Safari and Firefox ignored the code, but
IE happily executed it.
Better add another RE to strip out "javascript", right?
Meditate on this story:
http://namb.la/popular/tech.html
It's possible that for this application, the right
cost/benefit trade-off is to leave the holes open, and
hope that nobody abuses them to. Just be aware of what
you're doing.
(If I wanted to do this, I'd take a look at Slashcode; they're
solving exactly this problem in the comment submissions, and
their solution has been tested by years of attacks. It's
GPL'd (and in perl), so you can't just paste their solution in
directly, but if they have a solution you can re-implement in
Java as a set of RE tests, you'd likely have a winner.)

Signature
Mark Jeffcoat
Austin, TX