
Signature
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com
> On Mon, 16 Jul 2007 16:44:17 GMT, Robert Watkins
> <rwatkinsNOSPAM@NOSPAMfoo-bar.org> wrote, quoted or indirectly quoted
[quoted text clipped - 6 lines]
> http://mindprod.com/products.html#ENTITIES
> They also include big tables of known entities.
Thanks, but I don't want to convert the entities. All I'm doing is
splitting a String on semicolons, while keeping entities intact.
Thanks,
-- Robert
Oliver Wong - 17 Jul 2007 18:09 GMT
> All I'm doing is
> splitting a String on semicolons, while keeping entities intact.
It might be easier to solve the inverse problem, then: Find the
strings that are separated by semi colons that are not part of an entity.
I think the regular expression would look something like:
([^&;]|&[^&;]*;)*(&[^&;]*)?
Where the last bit "(&[^&;]*)?" is only necessary if you want to allow
for malformed XML where you have an unterminated entity (e.g.
"<BadXML>Hello World &unterminated</BadXML>"
What the regexp basically says is:
<pseudoRegExp>
(
Any character except '&' and ';'
OR
an entity; that is, '&' followed by any character except '&' and ';'
followed by ';'
) zero or more times
optionally followed by an unterminated entity.
</pseudoRegExp>
- Oliver
Robert Watkins - 18 Jul 2007 19:10 GMT
>> All I'm doing is
>> splitting a String on semicolons, while keeping entities intact.
[quoted text clipped - 4 lines]
>
> ([^&;]|&[^&;]*;)*(&[^&;]*)?
Thanks for this approach. It does work. Given that I will not allow
malformaed entities, I've changed the regex to:
([^&;]|&#?\\w+;)*
which also restricts the entity-specific regex a bit. It took me a while
to respond because (along with all my other work!) I did a fair bit of
testing with your approach, my original approach and with yet another
approach: splitting on all semicolons, then reconstructing the strings
that were split at the end of entities. What surprised me was that there
weren't any hugely significant differences in the performance of all
three approaches. I expected the string reconstruction to be way slower
than the others, but the greatest difference in timing was a mere 8%
(which could certainly be considered more significant in different
contexts).
I'm a bit reticent to admit that I started out as a Perl programmer --
and as such have always fancied myself fairly good with regular
expressions -- but you've got me stumped here. While I was able to parse
your original regex easily enough not to need your kindly provided
pseudoRegExp, I have to admit that I can't figure out why the first
character class needs to be [^&;]. Why does the & have to figure in; why
could it not simply be:
<pseudoRegExp>
(
any character not a semicolon
OR
any entity
) zero or more times
</pseudoRegExp>
I tried this and it simply doesn't work, but I can't think why.
Robert Watkins - 18 Jul 2007 19:34 GMT
Don't it always happen that way? I answered my own question moments
after posting this repsonse to you.
It's a matter of order. The regex you provided, and which I modified,
tries to match the sole semicolons first, but without the & in the
character class it finds the semicolons in entities before the regex has
tried to match entities. Just switching things around a bit:
(&#?\\w+;|[^;])*
Gives the expected results and is (for me) much clearer, being
essentially what I was looking for in my question to you:
<pseudoRegExp>
(
any entity
OR
any character not a semicolon
) zero or more times
</pseudoRegExp>
In any case, thank you again, you certainly pointed me in the right
direction, having me look at the problem the other way 'round.
-- Robert
>>> All I'm doing is
>>> splitting a String on semicolons, while keeping entities intact.
[quoted text clipped - 38 lines]
>
> I tried this and it simply doesn't work, but I can't think why.
Roedy Green - 17 Jul 2007 20:06 GMT
On Tue, 17 Jul 2007 12:42:03 GMT, Robert Watkins
<rwatkinsNOSPAM@NOSPAMfoo-bar.org> wrote, quoted or indirectly quoted
someone who said :
>Thanks, but I don't want to convert the entities. All I'm doing is
>splitting a String on semicolons, while keeping entities intact.
One way to skin that cat would be to convent the entities back to
chars, then split on semicolons, then put the entities back.
Consider there are also decimal and hex entities.
You could use your code to find entities, or use mine.

Signature
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com