Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / First Aid / September 2004

Tip: Looking for answers? Try searching our database.

fastest way to do very very many character substitutions

Thread view: 
anon - 24 Sep 2004 06:07 GMT
hi,

I am wondering what approaches you all might suggest for the following
scenario. I'm not looking for code, but just general thoughts about
the best way to approach this problem.

I need to substitute hundreds, possibly thousands, of character
sequences (English words or phrases) in texts that are up to about
100KB or so in size, and I need to do this as fast as human possible
(well, actually, faster). Some of the substitution terms require
regular expressions, but some can be handled by a regular replace.

Any advice at all about Java resources that might be available would
be very much appreciated. I can think of a few naive ways of doing
this, but perhaps there are some lesser known classes than String and
StringBuffer that would be useful, or perhaps there is some
open-source utility class that offers a mutable character-array type
object with powerful search-and-replace/regex abilities, or maybe
something else altogether.

Thanks in advance for any pointers...
Paul Lutus - 24 Sep 2004 06:31 GMT
> hi,
>
[quoted text clipped - 7 lines]
> (well, actually, faster). Some of the substitution terms require
> regular expressions, but some can be handled by a regular replace.

The clearly defined terms (words) can be handled by a hashtable mapping to
their replacements. This leaves the issue of delimiting the terms, probably
best handled by a conventional stream scanner.

Terms that are ambiguous may be able to be handled by a closest-match using
a binary search of nearby terms.

But frankly, if you are after raw speed and are adamant on acquiring the
maximum performance, with Java you are trading some performance for ease of
development and a rich library. These factors should be weighed carefully
as to their relative importance.

> Any advice at all about Java resources that might be available would
> be very much appreciated. I can think of a few naive ways of doing
[quoted text clipped - 3 lines]
> object with powerful search-and-replace/regex abilities, or maybe
> something else altogether.

You need to realize that regex in general is not the fastest approach to
matching, although here also there are ways to produce greater speed, like
precompiling the matcher.

Signature

Paul Lutus
http://www.arachnoid.com

William Brogden - 24 Sep 2004 12:31 GMT
> hi,
>
[quoted text clipped - 17 lines]
>
> Thanks in advance for any pointers...

I suggest two things:
1. look at the algorithms used in spelling checkers.
2. avoid characters and stay in bytes if you are sure
   your text can be represented that way.

Bill


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.