Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / August 2006

Tip: Looking for answers? Try searching our database.

Non-cpu intensive indexing...

Thread view: 
jagonzal@gmail.com - 07 Aug 2006 17:00 GMT
Hi,

I have a bunch of String of fixed length, and a function F such that
F(String) returns one of a small set of possible values.

Right now the work is being done with regular expressions - it does the
job (identify patterns that result in a given return value) but it is
quite heavy in CPU terms, and this operation must be done at a peak of
100 hits per second.

Any ideas on an alternative to the regexps?

(the possible fixed length strings are about 100000)

Thanks in advance,

JGN
Chris - 07 Aug 2006 17:31 GMT
> Hi,
>
[quoted text clipped - 9 lines]
>
> (the possible fixed length strings are about 100000)

This depends entirely on your app's requirements. What, exactly, are
your regexes doing? Can it be done more simply?

Some things to think about:

-- Jakarta ORO instead of the JDK 1.4 regex library. I have no idea if
it's any faster, but you could try it.

-- You may be better off using arrays of char, or one giant array of
char containing all strings, than the String class. String creates a lot
of garbage.

-- Make sure you compile your regexes in advance. Don't recompile before
each comparison.

-- Depending on your app, you might be able to use a lexer instead. See
http://www.jflex.de/. There's a learning curve, but it's *really* fast.
jagonzal@gmail.com - 07 Aug 2006 17:41 GMT
> This depends entirely on your app's requirements. What, exactly, are
> your regexes doing? Can it be done more simply?

They are evaluating a given string for a match, so that finally a
function can return which classification this string is in.

Something along the lines of:

if(incoming string matches any of {X1} set of regexpes)
 return "Value1"
if(incoming string matches any of {x2} set of regexps)
 return "Value2"
...

there are about three or four possible values, but the regexps must
cover about 100000 different cases of "incoming string".

> -- You may be better off using arrays of char, or one giant array of
> char containing all strings, than the String class. String creates a lot
> of garbage.

The strings are a restriction that I cannot remove :)

I'm thinking of making an index with the 100000 Strings.

> -- Make sure you compile your regexes in advance. Don't recompile before
> each comparison.

It's already done this way.

> -- Depending on your app, you might be able to use a lexer instead. See
> http://www.jflex.de/. There's a learning curve, but it's *really* fast.

Will check it out, thank you!

JGN
Oliver Wong - 07 Aug 2006 19:57 GMT
>> This depends entirely on your app's requirements. What, exactly, are
>> your regexes doing? Can it be done more simply?
[quoted text clipped - 12 lines]
> there are about three or four possible values, but the regexps must
> cover about 100000 different cases of "incoming string".

   Please give examples of:

   (*) The incoming strings.
   (*) The regular expressions.
   (*) The values returned.

   Otherwise, I can only offer the vague advice "look at decision trees".

   - Oliver
Robert Klemme - 07 Aug 2006 17:44 GMT
>> Hi,
>>
[quoted text clipped - 27 lines]
> -- Depending on your app, you might be able to use a lexer instead. See
> http://www.jflex.de/. There's a learning curve, but it's *really* fast.

-- Optimize the RX.

-- Preprocess and cache results (do those strings change? if yes and
there's repetition some form of LRU cache might help)

We certainly have too few information to narrow it down.

    robert
Boris Stumm - 07 Aug 2006 17:50 GMT
> Hi,
>
[quoted text clipped - 9 lines]
>
> (the possible fixed length strings are about 100000)

If you know the 100000 strings in advance, a simple
Map<String, String> with the strings as key and the function
value as value will probably fastest, and you only have to compute
the function values once.

If you are afraid of memory consumption, use a cache instead of a
simple map.
Chris Uppal - 08 Aug 2006 09:51 GMT
> Right now the work is being done with regular expressions - it does the
> job (identify patterns that result in a given return value) but it is
> quite heavy in CPU terms, and this operation must be done at a peak of
> 100 hits per second.

Just as a side-note, if you are comparing 100 strings a second using regexps,
and are seeing significant CPU load as a result, then (unless the strings are
very long) there must be something wrong with either the regexp implementation,
or the way you are using it (or perhaps both if the implementation is such that
it's forcing you to use it in the "wrong" way).

It should be possible to transform a task like this into a single DFA
(deterministic finite state machine) which will run in time proportional to the
length of the input string (not the complexity of the regexps).  The constants
of proportionality depend on the DFA implementation, but should not be /huge/
even for a naive implementation.  I would /hope/ that the Java regexp
implementation builds and uses such a DFA, but if not then the thing to do is
to do so yourself.  Probably the easiest way to do that is to use a
pre-existing scanner-generator, such as the jflex generator which has already
been mentioned.

   -- chris


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.