Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / June 2005

Tip: Looking for answers? Try searching our database.

Problem with java.regex.Matcher? - Test.java (0/1)

Thread view: 
*bicker* - 18 Jun 2005 11:26 GMT
We've isolated a problem we're encountering to java.regex.
We're trying to apply the Pattern ^[^~!@#$%^&*|]+$ to
strings passed in from a method that returns data from an
XML file.  Unfortunately, the method that obtains the data
is in the platform we develop on, and we don't have the
source code.  The platform provider has reviewed the problem
and has indicated that they feel that it is a bug in Java.  

The string you see in the attached Java file is windows-1253
character (decimal) 146.  As you can see, it is not in the
Pattern, and since the Pattern is trying to find strings
that do not have any of the indicated characters, we should
get a Pattern match.  We don't.  Not with that character.
The same seems to be the case with all characters between
131 and 160.  Beyond that it seems to be okay.  

Does anyone have any insight into why this would occur?
perhaps what we're doing wrong in our Pattern, or what the
platform developer may be doing wrong in their provision of
the data from the XML file?

Thanks!

(I'll paste the Java code here, but understand that we've
found that pasting the data tends to translate it to a
code-page other than the one the customer data was
originally in, so the problem seems to magically "go away".
Of course, it doesn't, because we still have to work with
the actual customer data, and preserve the integrity of the
data the customer actually entered.)

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test
{
   public static boolean test(String strInput)
   {
       strInput = "’";
       System.out.println("strInput="+strInput);
       Pattern patternTitle =
Pattern.compile("^[^~!@#$%^&*|]+$");
       Matcher m = patternTitle.matcher(strInput);
       boolean result = m.find();
       return result;
   }
}
Alan Krueger - 18 Jun 2005 21:28 GMT
> We've isolated a problem we're encountering to java.regex.
> We're trying to apply the Pattern ^[^~!@#$%^&*|]+$ to
[quoted text clipped - 3 lines]
> source code.  The platform provider has reviewed the problem
> and has indicated that they feel that it is a bug in Java.  

You didn't specify which platform on which you're having the problem.

> The string you see in the attached Java file is windows-1253
> character (decimal) 146.  As you can see, it is not in the
> Pattern, and since the Pattern is trying to find strings
> that do not have any of the indicated characters, we should
> get a Pattern match.  We don't.  Not with that character.

Running the code you posted under jdk1.5.0_03 on Windows XP, it returns
true.  What result are you seeing and under which JVM?
*bicker* - 18 Jun 2005 21:39 GMT
A Sat, 18 Jun 2005 15:28:09 -0500, Alan Krueger
<wgzkid502@sneakemail.com> escribió:
> You didn't specify which platform on which you're having the problem.

Sorry.  This is JDK1.4.2_08 on Windows XP.

> Running the code you posted under jdk1.5.0_03 on Windows XP, it returns
> true.  What result are you seeing and under which JVM?

We're getting false.  

How did you access the code?  If you copied and pasted it
out of the message, you'll indeed get the correct result
(true).  It may be necessary to replace the character in
strInput manually (using the key-code Alt-0146).

--
bicker®
Dale King - 20 Jun 2005 03:59 GMT
> A Sat, 18 Jun 2005 15:28:09 -0500, Alan Krueger
> <wgzkid502@sneakemail.com> escribió:
[quoted text clipped - 8 lines]
>
> We're getting false.  

I get true as well in JDK1.5. Perhaps it was a bug in 1.4.

> How did you access the code?  If you copied and pasted it
> out of the message, you'll indeed get the correct result
> (true).  It may be necessary to replace the character in
> strInput manually (using the key-code Alt-0146).

Putting any non-ASCII character into a Java source file without escaping
it is a very bad idea. It means that your code can have different
behavior depending on which machine the code is compiled on. The Java
source file is just a stream of bytes. That stream must be translated
into characters using some character encoding. If you don't specify an
encoding on the command line it will use the default one for the
platform. The particular byte you are using (0x92) will have vastly
different translation between Windoze and Linux. On Windoze that curly
quote will translate to the unicode character 0x2019. On Linux (which
likely uses ISO-8859-1) the 0x92 will get translated into 0x0092 Unicode
which is the PU2 control character.

This is the reason that Sun includes the Native2Ascii program.

If it is a bug in 1.4.2 it is probably that it did not properly handle
the full Unicode set for negated character classes. I see bug 4872664
<http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4872664> in the
database that sounds exactly like what you descrbe, but it was
supposedly fixed in 1.4.2_04.
Signature

 Dale King

*bicker* - 20 Jun 2005 10:52 GMT
A Mon, 20 Jun 2005 02:59:03 GMT, Dale King
<DaleWKing@insightbb.nospam.com> escribió:
> Putting any non-ASCII character into a Java source file without escaping
> it is a very bad idea. It means that your code can have different
> behavior depending on which machine the code is compiled on. The Java
> source file is just a stream of bytes. That stream must be translated
> into characters using some character encoding.

That's why I immediately went to the platform vendor.  They
provide us a facility to bring XML data into our
application.  They assured me that they read the encoding
from the XML file (in this case "windows-1250") and use
that.  I confirmed what you suggested, that 0x92 is
converted to 0x2019, so it seems they're doing the right
thing there.

> If it is a bug in 1.4.2 it is probably that it did not properly handle
> the full Unicode set for negated character classes. I see bug 4872664
> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4872664> in the
> database that sounds exactly like what you descrbe, but it was
> supposedly fixed in 1.4.2_04.

Thank you!  I just checked my IDE, and I'm using
JDK1.4.2_01.  I have JDK1.4.2_08 installed on my unit test
box, but all our customers have JVMs at 1.4.2_01 and all the
rest of my team is on JDK1.4.2_01.  This must be the problem
we're encountering.  I'll have everyone upgrade.

--
bicker®
Dale King - 21 Jun 2005 04:14 GMT
> A Mon, 20 Jun 2005 02:59:03 GMT, Dale King
> <DaleWKing@insightbb.nospam.com> escribió:
[quoted text clipped - 12 lines]
> converted to 0x2019, so it seems they're doing the right
> thing there.

The character conversion I was talking about had nothing to do with your
XML vendor, but in the java compiler. Your example showed this line:

        strInput = "’";

which had the byt 0x92 in it. How the compiler handles that will differ
from one platform to another. In reality you probably don't have that
text in your program, but it comes from your XML parser, but I just
wanted to make sure you knew that anything other than ASCII may not work
like you think it will in a Java source file.

>>If it is a bug in 1.4.2 it is probably that it did not properly handle
>>the full Unicode set for negated character classes. I see bug 4872664
[quoted text clipped - 7 lines]
> rest of my team is on JDK1.4.2_01.  This must be the problem
> we're encountering.  I'll have everyone upgrade.

In the future when you suspect a bug you might want to do what I did and
search the bug database.

Signature

 Dale King

*bicker* - 21 Jun 2005 12:04 GMT
A Tue, 21 Jun 2005 03:14:34 GMT, Dale King
<DaleWKing@insightbb.nospam.com> escribió:
> > Thank you!  I just checked my IDE, and I'm using
> > JDK1.4.2_01.  I have JDK1.4.2_08 installed on my unit test
[quoted text clipped - 3 lines]
> In the future when you suspect a bug you might want to do what I did and
> search the bug database.

To be honest, I was still convinced that either we or our
vendor was doing something wrong.  I'll be sure to not make
such a hasty conclusion again!  <grin>

--
bicker®


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.