Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / November 2006

Tip: Looking for answers? Try searching our database.

Problem with regular expressionsq

Thread view: 
birgit - 16 Nov 2006 17:28 GMT
I have the following problem to solve with regular expressions:

An editor can enter data through a webfronted editor. Certain HTML Tag
are allowed others are not. When he saves the data the content should
be check using regular expressions.
- only chars of the ISO-8859-1 charset are allowed
- only <b>, </b>, <br />, <a ...> and </a> tags are allowed (no other
HTML Tags)

The HTML-Tags don't need to get deleted. The user only gets an error
message telling him to change his data according to the rules.

I know how to check for the charset:
[\u0000-\u00FF]*

But I don't know how to combine this with the other rules. I'd
appreciate any help.
Mark Jeffcoat - 16 Nov 2006 19:09 GMT
> I have the following problem to solve with regular expressions:
>
[quoted text clipped - 13 lines]
> But I don't know how to combine this with the other rules. I'd
> appreciate any help.

This is hard to get right, even for someone who's not worried
about the syntax of regular expressions. For example, the rules
you've given are very simple to write as a RE, but unfortunately
allow a user to insert arbitrary Javascript into the anchor tag.
Oops.

If you're doing this for real, don't. Find a third-party
implementation that you trust and that you can, through
a relaxed license or cash money, swipe.

If it's homework .. the easiest way to write a complicated
regular expression is to write simpler ones first. I'd start
by detecting pairs of (<,>), and buld from there. Edi
Weitz's Regex Coach  is pretty cool:
   http://weitz.de/regex-coach/

Signature

Mark Jeffcoat
Austin, TX

Daniel Pitts - 16 Nov 2006 19:34 GMT
> I have the following problem to solve with regular expressions:
>
[quoted text clipped - 13 lines]
> But I don't know how to combine this with the other rules. I'd
> appreciate any help.

I would suggest using an existing xml parser to validate tags, its much
less error-prone than RE, and more flexible when your business rules
change.
a SAX parser would probably be "good enough" in this case.
Red Orchid - 16 Nov 2006 21:57 GMT
"birgit" <kbirgit@gmx.de> wrote or quoted in
Message-ID: <1163698138.575981.138400@h54g2000cwb.googlegroups.com>:

> - only chars of the ISO-8859-1 charset are allowed
> - only <b>, </b>, <br />, <a ...> and </a> tags are allowed (no other
> HTML Tags)

Probably ...

<code>
//  Untested ..

//
// Precompile ..
//

String NotISO_8859_1 = "[^\u0000-\u00FF]+";

String HTML_MARKUP = "</?\\w+[^>]*>";

String HTML_ALLOWED = "(<b>)             | " +
                                 "(</b>)            | " +
                                 "(<br\\s+/>)  | " +
                                 "(<a[^>]*>)      | " +
                                 "(</a>)              ";

   
Pattern pNotISO_8859_1 = Pattern.compile(NotISO_8859_1);

Pattern pHTML_MARKUP = Pattern.compile(HTML_MARKUP);

Pattern pHTML_ALLOWED =
             Pattern.compile(HTML_ALLOWED,
                                  Pattern.CASE_INSENSITIVE |
                                  Pattern.COMMENTS            );

//
// checking routine.
//

String src = ... // usr input

Matcher m;
       
m = pNotISO_8859_1.matcher(src);

if (m.find()) {
           
   // return Error.
}

m = pHTML_MARKUP.matcher(src);

while (m.find()) {

   Matcher ha_m = pHTML_ALLOWED.matcher(m.group());
       
   if (!ha_m.find()) {
           
       // return Error.
   }
}  

// return OK

</code>
birgit - 17 Nov 2006 08:17 GMT
Thanks a lot for the answers so far.

I know that regular expressions are not optimal for checking HTML-Tags
but in my case I would really like to use them anyway.
I am using OpenCms structured contents and in the xml schema
definitions I can define validationrules as regular expressions and the
content gets checked automatically - very comfortable. Usually an
editor should not be able to enter any other HTML-Tags but the ones I
provide through buttons, except if he copy and pastes something or if I
allow him to view and edit the source.
So if I don't want to modify the source code of OpenCms I need to use
regular expressions and if possible one which checks everything or
maybe two or three wich can be executed one after the other.
Unfortunately the last suggestion can't be used although I am sure it
would work.

Anyone some more suggestions?
Red Orchid - 17 Nov 2006 10:08 GMT
"birgit" <kbirgit@gmx.de> wrote or quoted in
Message-ID: <1163751466.422799.185550@h54g2000cwb.googlegroups.com>:

>  [snip]
> So if I don't want to modify the source code of OpenCms I need to use
> regular expressions and if possible one which checks everything or
> maybe two or three wich can be executed one after the other.
> Unfortunately the last suggestion can't be used although I am sure it
> would work.

Maybe it is possible to write one regex which checks everything
with Conditional and Lookaround.

But, as I know on,
Java RegEx library do not support Conditional.
(I don't know the reason.)

If you think the regex is possible with Conditional,
it will be worth searching a library that supports
Conditional.
birgit - 17 Nov 2006 12:04 GMT
So far, I don't want to change any of the code which uses the regular
expression and so this is unfortunately no solution.

I am not really good with regular expressions! But I've got something
which works in most situations, but if there is a wrong HTML Code at
the end of the text it seems like an endless loop. Can anyone correct
this regular expression.

((<b>)??|(<\/b>)??|(<br\s+\/>)??|(<a[^>]*>)??|(<\/a>)??|([\u0000-\u003B\u003D\u003F-\u00FF]*)??)*?

I also know that with this expression no '>' and '<' are allowed in
normal Texts. Can I solve this somehow.

Thanks for your help!
Daniel Pitts - 17 Nov 2006 19:29 GMT
> So far, I don't want to change any of the code which uses the regular
> expression and so this is unfortunately no solution.
[quoted text clipped - 10 lines]
>
> Thanks for your help!

Actually, I think the proper way to handle your problem is to HTML
escape the whole thing, and then have "psuedo tags" similar to BBCode.
[b]bold[/b]   [url=http://lala/]Link[/url]

That way, everything is safe for HTML, and you have control of what is
added.
Mark Jeffcoat - 17 Nov 2006 09:47 GMT
> "birgit" <kbirgit@gmx.de> wrote or quoted in
> Message-ID: <1163698138.575981.138400@h54g2000cwb.googlegroups.com>:
[quoted text clipped - 7 lines]
> <code>
> //  Untested ..

[snip code]

In some sense, this is technically excellent. I pasted your
code into a method object, replacing '//return Error' and
'//return OK' with 'return false' and 'return true', and
it worked exactly as specified, first time out of the box.

Not bad for untested. In fact, it makes a big improvement
over the original spec: a naive RE matcher for '<a ...>' would
have accepted strings like '<a></a><i>Anything</i> can go in
here as long as I finish with an angle bracket>', and your
pattern rejects that sort of thing nicely.

However, it also accepts this string as valid:

   <a id="code" expr="alert('0wn3d.')"
   style="background:url('javascript:eval(document.all.code.expr)')"></a>

This is not just a theoretical attack. When I put that "link"
on my own webpage, Safari and Firefox ignored the code, but
IE happily executed it.

Better add another RE to strip out "javascript", right?

Meditate on this story:
     http://namb.la/popular/tech.html

It's possible that for this application, the right
cost/benefit trade-off is to leave the holes open, and
hope that nobody abuses them to.  Just be aware of what
you're doing.

(If I wanted to do this, I'd take a look at Slashcode; they're
solving exactly this problem in the comment submissions, and
their solution has been  tested by years of attacks. It's
GPL'd (and in perl), so you can't just paste their solution in
directly, but if they have a solution you can re-implement in
Java as a set of RE tests, you'd likely have a winner.)

Signature

Mark Jeffcoat
Austin, TX



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.