Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / October 2005

Tip: Looking for answers? Try searching our database.

Regular Expression and string Matching/Replace

Thread view: 
sanjay010@yahoo.com - 05 Oct 2005 06:54 GMT
I have a list of key words. It has around 1000 key word now but can
grow to 5000 keywords.

My web application displays lot of texts which are stored in the
database. My requirement is to scan each text for the occurance of any
of the above keywords. If any keyword is present I have to replace that
with some custom values, before showing it to the user.

I was thinking of using using regular expression for replacing the
keyword in the text using matcher.replaceAll method as follows:
          Pattern pattern = Pattern.compile(patternStr);
          Matcher matcher = pattern.matcher(inputStr);
          String output = matcher.replaceAll(replacementStr);

But My pattern string will have around 5000 keywords with the 'OR'
Logical Operator like- keyword1| keyword2 I keyword3 | ..........

Will such a big pattern string adversly affect the performance?  What
can I do to speed up the performance?(Since my keyword list is not
static i would prefer to do the replacement just before showing the
text to the user)
Any suggestions are most welcome.
Eric Sosman - 05 Oct 2005 17:10 GMT
sanjay010@yahoo.com wrote On 10/05/05 01:54,:
> I have a list of key words. It has around 1000 key word now but can
> grow to 5000 keywords.
[quoted text clipped - 18 lines]
> text to the user)
> Any suggestions are most welcome.

   Can you separate the inputStr into a sequence of
"words," some of which are keywords and some not?  If
so, you could prepare a HashMap containing the keywords
and their replacements, and look up each "word" in the
Map: if it's there, you output the replacement instead
of the word, but if it's not you output the word itself.
(If all the keywords have the same replacement text, you
might use a HashSet instead of a HashMap.)

Signature

Eric.Sosman@sun.com

Chris Smith - 05 Oct 2005 18:32 GMT
> I have a list of key words. It has around 1000 key word now but can
> grow to 5000 keywords.

My first thought is that if you have a variable number of up to 5000
keywords, you don't want to try to statically maintain a regular
expression that matches all of them.  Instead, you probably want to keep
them in a configuration file of some type.

From that point, you have several options:

1. If the people editing this file can be trusted to understand regular
expressions, then you could have them put regular expressions into the
file.  You are probably better off walking through the list and calling
replaceAll up to 5000 times... but if you insist, then you could
concatenate the expressions, remembering to use non-grouping parentheses
to contain each subexpression and ensure that it doesn't fall prey to
unexpected operator precedence.

2. If it would be better to store plain text keywords in the
configuration file, then you need a plain text search-and-replace.  It
doesn't take too long to write one, but there are some performance
pitfalls.  Instead, Google for "skeetutil" and use Jon's utility package
for the job.  Jon is very knowledgable, and has thought through a lot of
the performace quirks that you might run into my creating this yourself.  
It's a shame that the standard API doesn't solve this trivial problem,
but unfortunately everyone at Sun was blinded by regexp-mania when they
added this stuff to the java.lang.String API.

Signature

www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

sanjay010@yahoo.com - 05 Oct 2005 21:01 GMT
Hi Chris,
I can concatenate the 5000 keywords to create a regular expression as
you have suggested. I can store this in a table and update this when
the list of keywords change.
But my question is will this long regular expression affect the
performance of matching and replacement of string.

> > I have a list of key words. It has around 1000 key word now but can
> > grow to 5000 keywords.
[quoted text clipped - 30 lines]
> Chris Smith - Lead Software Developer/Technical Trainer
> MindIQ Corporation
Roedy Green - 06 Oct 2005 06:41 GMT
>But my question is will this long regular expression affect the
>performance of matching and replacement of string.

Regular Expressions are for finding patterns, not for doing
translation.  An analogy  would be using a garage door opener for a
hammer.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Again taking new Java programming contracts.

John C. Bollinger - 07 Oct 2005 04:36 GMT
> Hi Chris,
> I can concatenate the 5000 keywords to create a regular expression as
> you have suggested. I can store this in a table and update this when
> the list of keywords change.
> But my question is will this long regular expression affect the
> performance of matching and replacement of string.

Relative to what?  Every line of executable code affects performance, so
yes your approach will do.  I'm guessing that what you really want to
know is "will it be fast enough?"  There's no way to be sure other than
to test it and see.  I'd guess, however, that it will be slow.  Quite
possibly slower than using 5000 single-keyword regular expressions to do
the same job.  That, and ugly, too.

On the other hand, if it fits your requirements then Eric's suggestion
is brilliant.  Easy to write, easy to maintain, and as fast as you're
likely to get.  Tokenizing the input document might be a bottleneck
there, but the keyword matching would be lightning fast, and the
tokenization is still much lighter than your proposed huge-regex approach.

I don't suppose it would work to do the substitution once for each
document, in advance, and remember the result?  That is the fastest
possible solution at service time.

Signature

John Bollinger
jobollin@indiana.edu

Roedy Green - 06 Oct 2005 06:38 GMT
>Will such a big pattern string adversly affect the performance?

I think that will be impractical on many grounds.

What you do instead in split your text into words perhaps using a
regex split.
See http://mindprod.com/jgloss/regex.html

Then you look up the word in a HashMap. If it is in there, replace it
with the corresponding value, building up your new String in a
StringBuilder.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Again taking new Java programming contracts.



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.