Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / July 2007

Tip: Looking for answers? Try searching our database.

regex: find semicolon that is not part of an entity

Thread view: 
Robert Watkins - 16 Jul 2007 17:44 GMT
Okay, I have something that works, but I don't like it:

String SEMICOLON_NOT_ENTITY_REGEX = "(?<!&#?\\w{1,20}+);";

The part of the regex that says, "at least 1 but not more than 20" is a
horrible addition that is the only way I could find around the odd
Exception:

java.util.regex.PatternSyntaxException: Look-behind group does not have an
obvious maximum length near index 9
(?<!&#?\w+);
        ^

Is there any way around this?
Daniel Pitts - 16 Jul 2007 20:58 GMT
On Jul 16, 9:44 am, Robert Watkins <rwatkinsNOS...@NOSPAMfoo-bar.org>
wrote:
> Okay, I have something that works, but I don't like it:
>
[quoted text clipped - 10 lines]
>
> Is there any way around this?
Does this regex work for you: "(&#?[^;]*;[^&]*)?;"

Alternatively, you can use an XML parser, and wherever you find a ; in
character data, you found a ; not part of an entity :-)
Robert Watkins - 17 Jul 2007 13:40 GMT
> Does this regex work for you: "(&#?[^;]*;[^&]*)?;"

Nope -- that one also finds semicolons that /are/ part of entities.

> Alternatively, you can use an XML parser, and wherever you find a ; in
> character data, you found a ; not part of an entity :-)

Hmmm. I suppose, but that way more heavy-handed than I was hoping for (and
probably far less performant than the regex I've got that works). All I'm
doing is splitting a String on semicolons, while keeping entities intact.

Thanks,
-- Robert
Roedy Green - 16 Jul 2007 21:42 GMT
On Mon, 16 Jul 2007 16:44:17 GMT, Robert Watkins
<rwatkinsNOSPAM@NOSPAMfoo-bar.org> wrote, quoted or indirectly quoted
someone who said :

>String SEMICOLON_NOT_ENTITY_REGEX

I have written some classes for interconverting entities and chars.
They use manual parsers. see
http://mindprod.com/products.html#ENTITIES
They also include big tables of known entities.
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Robert Watkins - 17 Jul 2007 13:42 GMT
> On Mon, 16 Jul 2007 16:44:17 GMT, Robert Watkins
> <rwatkinsNOSPAM@NOSPAMfoo-bar.org> wrote, quoted or indirectly quoted
[quoted text clipped - 6 lines]
> http://mindprod.com/products.html#ENTITIES
> They also include big tables of known entities.

Thanks, but I don't want to convert the entities. All I'm doing is
splitting a String on semicolons, while keeping entities intact.

Thanks,
-- Robert
Oliver Wong - 17 Jul 2007 18:09 GMT
> All I'm doing is
> splitting a String on semicolons, while keeping entities intact.

   It might be easier to solve the inverse problem, then: Find the
strings that are separated by semi colons that are not part of an entity.
I think the regular expression would look something like:

([^&;]|&[^&;]*;)*(&[^&;]*)?

   Where the last bit "(&[^&;]*)?" is only necessary if you want to allow
for malformed XML where you have an unterminated entity (e.g.
"<BadXML>Hello World &unterminated</BadXML>"

   What the regexp basically says is:

<pseudoRegExp>
(
   Any character except '&' and ';'
 OR
   an entity; that is, '&' followed by any character except '&' and ';'
followed by ';'
) zero or more times
optionally followed by an unterminated entity.
</pseudoRegExp>

   - Oliver
Robert Watkins - 18 Jul 2007 19:10 GMT
>> All I'm doing is
>> splitting a String on semicolons, while keeping entities intact.
[quoted text clipped - 4 lines]
>
> ([^&;]|&[^&;]*;)*(&[^&;]*)?

Thanks for this approach. It does work. Given that I will not allow
malformaed entities, I've changed the regex to:

 ([^&;]|&#?\\w+;)*

which also restricts the entity-specific regex a bit. It took me a while
to respond because (along with all my other work!) I did a fair bit of
testing with your approach, my original approach and with yet another
approach: splitting on all semicolons, then reconstructing the strings
that were split at the end of entities. What surprised me was that there
weren't any hugely significant differences in the performance of all
three approaches. I expected the string reconstruction to be way slower
than the others, but the greatest difference in timing was a mere 8%
(which could certainly be considered more significant in different
contexts).

I'm a bit reticent to admit that I started out as a Perl programmer --
and as such have always fancied myself fairly good with regular
expressions -- but you've got me stumped here. While I was able to parse
your original regex easily enough not to need your kindly provided
pseudoRegExp, I have to admit that I can't figure out why the first
character class needs to be [^&;]. Why does the & have to figure in; why
could it not simply be:

<pseudoRegExp>
 (
   any character not a semicolon
   OR
   any entity
 ) zero or more times
</pseudoRegExp>

I tried this and it simply doesn't work, but I can't think why.
Robert Watkins - 18 Jul 2007 19:34 GMT
Don't it always happen that way? I answered my own question moments
after posting this repsonse to you.

It's a matter of order. The regex you provided, and which I modified,
tries to match the sole semicolons first, but without the & in the
character class it finds the semicolons in entities before the regex has
tried to match entities. Just switching things around a bit:

 (&#?\\w+;|[^;])*

Gives the expected results and is (for me) much clearer, being
essentially what I was looking for in my question to you:

<pseudoRegExp>
 (
   any entity
   OR
   any character not a semicolon
 ) zero or more times
</pseudoRegExp>

In any case, thank you again, you certainly pointed me in the right
direction, having me look at the problem the other way 'round.

-- Robert

>>> All I'm doing is
>>> splitting a String on semicolons, while keeping entities intact.
[quoted text clipped - 38 lines]
>
> I tried this and it simply doesn't work, but I can't think why.
Roedy Green - 17 Jul 2007 20:06 GMT
On Tue, 17 Jul 2007 12:42:03 GMT, Robert Watkins
<rwatkinsNOSPAM@NOSPAMfoo-bar.org> wrote, quoted or indirectly quoted
someone who said :

>Thanks, but I don't want to convert the entities. All I'm doing is
>splitting a String on semicolons, while keeping entities intact.

One way to skin that cat would be to convent the entities back to
chars, then split on semicolons, then put the entities back.
Consider there are also decimal and hex entities.

You could use your code to find entities, or use mine.
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.