Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / May 2007

Tip: Looking for answers? Try searching our database.

String.replace() in practice

Thread view: 
Aljosa Mohorovic - 22 May 2007 12:25 GMT
i currently have this:

String tmp = "this is text for flash player";
tmp = tmp.replace("š", "š");
tmp = tmp.replace("Š", "Š");
tmp = tmp.replace("<br />", "<br>");
...

and this works but i was wondering if there is a better way to do
this?

Aljosa
Tom Hawtin - 22 May 2007 12:54 GMT
> String tmp = "this is text for flash player";
> tmp = tmp.replace("&scaron;", "š");
[quoted text clipped - 4 lines]
> and this works but i was wondering if there is a better way to do
> this?

Better in what sense?

In terms of making it easier to read, you could write it something like:

    String escaped = "this is text for flash player"
        .replace("&scaron;", "š")
        .replace("&Scaron;", "Š")
        .replace("<br />", "<br>")
        ...

For performance, it depends how much effort versus how much performance
you want.

Writing an in-place replacement with a StringBuilder is easy enough
using indexof(String,int) and replace(int,int,String). Slightly more
sophisticated would be copying for one StringBuilder to another.

If all your replacement match strings start with & and <, you might want
to write your own optimised code using a char array.

Tom Hawtin
Robert Klemme - 22 May 2007 14:15 GMT
>> String tmp = "this is text for flash player";
>> tmp = tmp.replace("&scaron;", "š");
[quoted text clipped - 21 lines]
> using indexof(String,int) and replace(int,int,String). Slightly more
> sophisticated would be copying for one StringBuilder to another.

For example using the algorithm suggested by stdlib documentation (hint
to OP):

http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html#appendRepla
cement(java.lang.StringBuffer,%20java.lang.String
)

> If all your replacement match strings start with & and <, you might want
> to write your own optimised code using a char array.
>
> Tom Hawtin

Kind regards

    robert
Lew - 22 May 2007 14:22 GMT
Aljosa Mohorovic wrote:
>> tmp = tmp.replace("<br />", "<br>");

Apparently you are dealing with HTML.  If that's so, why in the world would
you replace a correct "<br />" with an incorrect "<br>"?

Signature

Lew

Aljosa Mohorovic - 22 May 2007 17:42 GMT
> Apparently you are dealing with HTML.  If that's so, why in the world would
> you replace a correct "<br />" with an incorrect "<br>"?

because content will be rendered in flash player and maybe some
devices where <br /> is not recognized but <br> is.
i do many things i would like to avoid but because of platforms i work
with i don't have that choice.

thanks to everybody for tips.
Aljosa Mohorovic - 22 May 2007 19:17 GMT
> Apparently you are dealing with HTML.  If that's so, why in the world would
> you replace a correct "<br />" with an incorrect "<br>"?

because content will be rendered in flash player and maybe some
devices where <br /> is not recognized but <br> is.
i do many things i would like to avoid but because of platforms i work
with i don't have that choice.

thanks to everybody for tips.
angrybaldguy@gmail.com - 22 May 2007 20:03 GMT
> Aljosa Mohorovic wrote:
> >> tmp = tmp.replace("<br />", "<br>");
[quoted text clipped - 4 lines]
> --
> Lew

Because <br /> is incorrect HTML and correct XHTML, and XHTML so far
hasn't reached critical mass: it's still impossible to serve correctly
to IE.
Joshua Cranmer - 22 May 2007 23:35 GMT
> Because <br /> is incorrect HTML and correct XHTML, and XHTML so far
> hasn't reached critical mass: it's still impossible to serve correctly
> to IE.

Not really. Haven't you heard of the doctype? IE can't serve the
MIME-type correctly, but it can properly parse an XHTML-style doctype,
even if its delivered as a text/html.

Furthermore, XHTML was made a W3C recommendation on January 4, 2000, so
it's been around 7 years and anything that can handle HTML and not XHTML
is horrendously outdated.

Finally, the XML <br /> close sequence was actually derived from the
erroneous common handling of the /proper/ SGML markup <br/ (most
browsers actually don't handle HTML properly...). Most pre-XHTML
browsers were handling <hr /> as <hr> instead of the correct <hr >>
John W. Kennedy - 24 May 2007 22:16 GMT
>> Because <br /> is incorrect HTML and correct XHTML, and XHTML so far
>> hasn't reached critical mass: it's still impossible to serve correctly
[quoted text clipped - 3 lines]
> MIME-type correctly, but it can properly parse an XHTML-style doctype,
> even if its delivered as a text/html.

Nope. IE parses XHTML served as text/html as though it were HTML.

> Furthermore, XHTML was made a W3C recommendation on January 4, 2000, so
> it's been around 7 years and anything that can handle HTML and not XHTML
> is horrendously outdated.

Like IE. Except, of course, that "outdated" isn't the issue. Microsoft
has deliberately and maliciously killed XHTML.

Signature

John W. Kennedy
"Compact is becoming contract,
Man only earns and pays."
  -- Charles Williams.  "Bors to Elayne:  On the King's Coins"
* TagZilla 0.066 * http://tagzilla.mozdev.org

Daniel Pitts - 22 May 2007 16:41 GMT
On May 22, 4:25 am, Aljosa Mohorovic <aljosa.mohoro...@gmail.com>
wrote:
> i currently have this:
>
[quoted text clipped - 8 lines]
>
> Aljosa

There isn't a simpler way to do this, if thats what your question
was.  There are more efficient methods, but most of them are quite a
bit more complex.

Oh, also, you should consider the fact that String.replace only
replaces the FIRST instance in the string.  you can use replaceAll,
but replaceAll uses regex (so you have to make sure that your search/
replace text is escaped appropriately).  Also, being regex, it does
actually slow things down somewhat.

I say, stick with the easiest code unless it has been proven to be too
inefficient.  If you find your program is taking up to many resources,
you could then consider using some sort of finite state machine to
parse and replace everything in one go.   Actually, I wish Sun would
add to the JDK a method String.translate(java.util.Map<String, String>
translations);  The intent would be that any key in the String would
be replaced by the corresponding value.  This could be somewhat
optimized.
Tom Hawtin - 22 May 2007 16:54 GMT
> I say, stick with the easiest code unless it has been proven to be too
> inefficient.  If you find your program is taking up to many resources,
> you could then consider using some sort of finite state machine to
> parse and replace everything in one go.

The API docs for String.replace say "Replaces each substring", so it
shouldn't be just the first occurrence that is replaced.
String.replaceFirst does that, only with a regex (nice API).

>                                           Actually, I wish Sun would
> add to the JDK a method String.translate(java.util.Map<String, String>
> translations);  The intent would be that any key in the String would
> be replaced by the corresponding value.  This could be somewhat
> optimized.

You could write you own. If it becomes popular, it could be incorporated
into the library. My hunch is that it wouldn't in fact be that useful in
practice.

Tom Hawtin
Piotr Kobzda - 22 May 2007 16:56 GMT
> Oh, also, you should consider the fact that String.replace only
> replaces the FIRST instance in the string.

Quote from replace() documentation:

"Replaces *each* substring of this string that matches the literal
target sequence with the specified literal replacement sequence."

> you can use replaceAll,
> but replaceAll uses regex (so you have to make sure that your search/
> replace text is escaped appropriately).  Also, being regex, it does
> actually slow things down somewhat.

replace() uses reqex as well, it's current Sun's JRE implementation is:

public String replace(CharSequence target, CharSequence replacement) {
    return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
        this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
}

piotr
Tom Hawtin - 22 May 2007 18:07 GMT
> replace() uses reqex as well, it's current Sun's JRE implementation is:
>
> public String replace(CharSequence target, CharSequence replacement) {
>     return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
>         this).replaceAll(Matcher.quoteReplacement(replacement.toString()));
> }

The implementation uses the regex library. However, the arguments are
not treated as such.

Note the Pattern.LITERAL. Current API docs read:

   "When this flag is specified then the input string that specifies
the pattern is treated as a sequence of literal characters.
Metacharacters or escape sequences in the input sequence will be given
no special meaning."

More bad APIs...

Tom Hawtin
Piotr Kobzda - 22 May 2007 23:43 GMT
>> public String replace(CharSequence target, CharSequence replacement) {
>>     return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
[quoted text clipped - 4 lines]
> The implementation uses the regex library. However, the arguments are
> not treated as such.

The second argument is still treated as such.

In fact, there is no significant difference in efficiency between a
current implementation, and the following "full" regex:

    return Pattern.compile(Matcher.quoteReplacement(target.toString()))
            .matcher(this).replaceAll(
                    Matcher.quoteReplacement(replacement.toString()));

I don't get, why regex is in use there?  Couldn't a replace() be
implemented in a more "straightforward" way?  E.g.:

    public String replace(CharSequence target, CharSequence replacement) {
        String tgt = target.toString();
        String rep = replacement.toString();
        StringBuilder sb = new StringBuilder(length());
        int s = 0;
        for(int e; (e = indexOf(tgt, s)) != -1; s = e + tgt.length()) {
            sb.append(substring(s, e)).append(rep);
        }
        return sb.append(substring(s)).toString();
    }

My simple benchmarks shows that it's about twice faster then current
implementation.  Thus, is it a good idea to use regex library for that?

piotr
Arne Vajhøj - 23 May 2007 03:36 GMT
> In fact, there is no significant difference in efficiency between a
> current implementation, and the following "full" regex:

> I don't get, why regex is in use there?  Couldn't a replace() be
> implemented in a more "straightforward" way?  E.g.:

> My simple benchmarks shows that it's about twice faster then current
> implementation.  Thus, is it a good idea to use regex library for that?

To me this is an obvious goof. I can see it for me.

The Java team had mile long list of complaints over .replaceAll not
working with backslashes, periods etc.. The tech lead looked
over at the newest team member and told him to create a .replace
that did not use regex syntax. He then copy pasted the .replaceAll
code and used Pattern.LITERAL. And due to time constraints they had
to cancel a few code reviews and this simple code was picked not to
be reviewed.

Note that I do not have any indication that it was what actually
happened. But stuff like this happen all the time in software
development.

Arne
Tom Hawtin - 23 May 2007 04:01 GMT
>>> public String replace(CharSequence target, CharSequence replacement) {
>>>     return Pattern.compile(target.toString(), Pattern.LITERAL).matcher(
[quoted text clipped - 6 lines]
>
> The second argument is still treated as such.

In what way? It is escaped properly by quoteReplacement.

> In fact, there is no significant difference in efficiency between a
> current implementation, and the following "full" regex:
>
>     return Pattern.compile(Matcher.quoteReplacement(target.toString()))
                             ^^^^^^^^^^^^^^^^^^^^^^^^Pattern.quote
>             .matcher(this).replaceAll(
>                     Matcher.quoteReplacement(replacement.toString()));
>
> I don't get, why regex is in use there?  Couldn't a replace() be
> implemented in a more "straightforward" way?  E.g.:

I don't know the reason. Probably not a priority. The way it is
implemented has less source and object code, and the methods it calls
may already be compiled.

Tom Hawtin
Piotr Kobzda - 23 May 2007 08:55 GMT
>> In fact, there is no significant difference in efficiency between a
>> current implementation, and the following "full" regex:
[quoted text clipped - 3 lines]
>>             .matcher(this).replaceAll(
>>                     Matcher.quoteReplacement(replacement.toString()));

Oh, you're right.  However, it doesn't change its general efficiency...

>> I don't get, why regex is in use there?  Couldn't a replace() be
>> implemented in a more "straightforward" way?  E.g.:
>
> I don't know the reason. Probably not a priority. The way it is
> implemented has less source and object code, and the methods it calls
> may already be compiled.

Sounds reasonable for me.  Hope that most of its users will follow that,
and won't self implement it, just because of regex...

piotr
Piotr Kobzda - 23 May 2007 11:16 GMT
>>> The implementation uses the regex library. However, the arguments are
>>> not treated as such.
>>
>> The second argument is still treated as such.
>
> In what way? It is escaped properly by quoteReplacement.

It's properly escaped.  But is treated the same way as the other
replacement strings by regex library.

In other words, it's treated the same way as the second argument of
replaceAll() in:

   str.replaceAll(Pattern.quote(target),
Matcher.quoteReplacement(replacement));

Regex parser logic is still involved.

piotr
Ingo Menger - 23 May 2007 13:40 GMT
> I don't get, why regex is in use there?  

Because "replace a constant substring with another constant string" is
an instance of the more general problem: "replace a substring that
looks like regex with a string that looks like replacement"?
Piotr Kobzda - 23 May 2007 23:58 GMT
>> I don't get, why regex is in use there?  
>
> Because "replace a constant substring with another constant string" is
> an instance of the more general problem: "replace a substring that
> looks like regex with a string that looks like replacement"?

Maybe...  However, as my simple implementation shows, that's not
difficult to solve it without solving the more general problem.  Why
then should we pay for the overheads of that general solution, when a
simple and more efficient solution exits?

OTOH, having to compute the area of the square whose sides have length
S, what would you choose to solve the problem:
   a) well known S^2 formula,
   b) computation of the integral between 0 and S of function f(x)=S,
because you have the best integral evaluation algorithm implemented
?

Current replace() implementation looks like choosing b) to me.

piotr
Twisted - 24 May 2007 07:31 GMT
> OTOH, having to compute the area of the square whose sides have length
> S, what would you choose to solve the problem:
[quoted text clipped - 4 lines]
>
> Current replace() implementation looks like choosing b) to me.

Is it? Or is it more like applying the MxN rectangle-area formula to
the special case where the sides happen to be equal? That's clearly
not as "bad" in terms of either overgeneralizing or losing efficiency
(although with bignums, at least, a squaring can be twice the speed of
a generic multiply).
Piotr Kobzda - 24 May 2007 15:09 GMT
>> OTOH, having to compute the area of the square whose sides have length
>> S, what would you choose to solve the problem:
[quoted text clipped - 10 lines]
> (although with bignums, at least, a squaring can be twice the speed of
> a generic multiply).

Well, for each input S, when M=N=S, the MxN formula gives exactly the
same result what applying the S^2 squaring formula gives.  So we can
safely use it to achieve a squaring expected results.  Using integral
computation algorithms we can of course also achieve the same results,
but most probably it requires an extra overhead (like forming a function
based on input, etc.).

Similar kind of overheads I see in replace() implemented using regex.

To solve replace() problem a "simple search" is enough.  Regex library
performs "pattern search", which is more advanced and complicated, than
a "simple" one.  Thus we have to add some extra overhead, to make it
"simple" again, performing completely unnecessary operations, e.g.
escaping and parsing a replacement string...

That's why I think it looks more like b).

piotr
Ingo Menger - 24 May 2007 10:50 GMT
> >> I don't get, why regex is in use there?  
>
[quoted text clipped - 4 lines]
> Maybe...  However, as my simple implementation shows, that's not
> difficult to solve it without solving the more general problem.

But the more general problem (pattern matching) is already solved.

> Why
> then should we pay for the overheads of that general solution, when a
> simple and more efficient solution exits?

We don't have to.
String.replace is but a convenience method. If you use it in a tight
loop, it's your fault.
Try using String.replaceAll and make sure you compile the pattern just
once *before* the loop. I guess we'll see that your simple
implementation will not be that much faster.

> OTOH, having to compute the area of the square whose sides have length
> S, what would you choose to solve the problem:
>     a) well known S^2 formula,
>     b) computation of the integral between 0 and S of function f(x)=S,
> because you have the best integral evaluation algorithm implemented?

> Current replace() implementation looks like choosing b) to me.

See answer from Twisted, whom I fully support on this issue.

The situation is also different from this contrived example in that we
all know how requirements change. Today, you are happy to replace all
<br/> by <br>, tomorrow, you notice, that sometimes <BR/> and even <Br/
> appear, and the next day the same happens for some other tag.

Do you think that
 pattern = Pattern.compile("?i:<(\\w+)/>>");
 s = s.replaceAll(pattern, "<$1>");

is still slower than:
 s.simplereplace("<br/>", "<br>")
   .simplereplace("<BR/>", "<br>")
   .simplereplace("<bR/>", "<br>")
   .simplereplace("<Br/>", "<br>")
   .simplereplace // all the variants for other tags in question
Piotr Kobzda - 24 May 2007 15:39 GMT
> String.replace is but a convenience method. If you use it in a tight
> loop, it's your fault.

What is my fault when I need it?  For example:

    for(String line; (line = readLine()) != null; ) {
        writeLine(line.replace(getReplaceTarget(), getReplacement()));
    }

Why should I care about replace() efficiency here?  When replace()
efficiency is not as good as expected, it's replace() implementation
fault, not mine I think.

> Try using String.replaceAll and make sure you compile the pattern just
> once *before* the loop. I guess we'll see that your simple
> implementation will not be that much faster.

I can't do that, replace() compiles my string each time I call it.

Compiling pattern once before calling replaceAll() is not as convenient
as using replace(), and is not necessarily helpful in all situations
(code above is the example).  However, even having each pattern compiled
once, my simple implementation is about two times faster (check it!).

As long as regex pattern search is implemented in linear fashion (I mean
no optimized search algorithm in used, e.g. Boyer-Moore's), using regex
for "simple replace" will probably be at the most as fast as my simple
implementation, not better.

> Do you think that
>   pattern = Pattern.compile("?i:<(\\w+)/>>");
[quoted text clipped - 6 lines]
>     .simplereplace("<Br/>", "<br>")
>     .simplereplace // all the variants for other tags in question

I don't think so.  But that's not a matter of our discussion.  Poor
replace() implementation, that's what our debate is about.

piotr
Robert Klemme - 24 May 2007 15:55 GMT
>> String.replace is but a convenience method. If you use it in a tight
>> loop, it's your fault.
[quoted text clipped - 8 lines]
> efficiency is not as good as expected, it's replace() implementation
> fault, not mine I think.

I am sorry - are you kidding?  If you write a program and it's too slow,
*you* are responsible.  It does not matter whether the slowness is
caused by the code you wrote or caused by picking the wrong library
class or method.

If you have a large set of objects and need to frequently check whether
something is member of that set you would not use a LinkedList for the
set and blame the list implementation for the slowness, would you?

Knowing semantics and performance characteristics (where necessary) of
library code is part of software engineering like being capable of
writing code yourself.

Please reconsider your statement.

    robert
Piotr Kobzda - 24 May 2007 23:14 GMT
>> Why should I care about replace() efficiency here?  When replace()
>> efficiency is not as good as expected, it's replace() implementation
>> fault, not mine I think.
>
> I am sorry - are you kidding?

No.

> If you write a program and it's too slow,
> *you* are responsible.  It does not matter whether the slowness is
> caused by the code you wrote or caused by picking the wrong library
> class or method.

Of course.  But notice, that I'm not talking about fault of my program,
that's just speculation what is possible fault cause.

> If you have a large set of objects and need to frequently check whether
> something is member of that set you would not use a LinkedList for the
> set and blame the list implementation for the slowness, would you?

No.  I wouldn't.  I would choose more appropriate list based on my
experience and library documentation.

> Knowing semantics and performance characteristics (where necessary) of
> library code is part of software engineering like being capable of
> writing code yourself.
>
> Please reconsider your statement.

My statement is just a complain on replace() implementation, which I
think could be better.

Semantics of that method are well documented and meets my expectations,
but with its performance characteristics (which I know only because I've
seen its implementation and run some tests) I'm not fully satisfied.

Don't take me wrong.  I don't try to say that current implementation is
completely bad.  It just fails some of my personal acceptance tests, and
I'm wondering about possible improvement of it.  That's all.

piotr
Ingo Menger - 24 May 2007 16:05 GMT
> > String.replace is but a convenience method. If you use it in a tight
> > loop, it's your fault.
[quoted text clipped - 8 lines]
> efficiency is not as good as expected, it's replace() implementation
> fault, not mine I think.

I don't see why you couldn't use replaceAll here also. Of course, when
the search pattern changes on each search, one cannot save pattern
compilation time.

> > Try using String.replaceAll and make sure you compile the pattern just
> > once *before* the loop. I guess we'll see that your simple
> > implementation will not be that much faster.
>
> I can't do that, replace() compiles my string each time I call it.

For the example of the OP (replace <br/> by <br>) replaceAll with
precompiled pattern is appropriate.

> Compiling pattern once before calling replaceAll() is not as convenient
> as using replace(),

Sure. You decide if you want fast code or convenient code.

> and is not necessarily helpful in all situations
> (code above is the example).

Quite contrived example. It implies, that the data contains a
description of how it is to be changed.

> As long as regex pattern search is implemented in linear fashion (I mean
> no optimized search algorithm in used, e.g. Boyer-Moore's), using regex
> for "simple replace" will probably be at the most as fast as my simple
> implementation, not better.

Not better, sure, because your simple implmentation is the most
degenerate case of the search/replace problem. And also the least
mighty, by the way.

> I don't think so.  But that's not a matter of our discussion.  Poor
> replace() implementation, that's what our debate is about.

It's poor, no question. But for two good reasons:
a) the more general solution is at hand.
b) fixed string search/replace will most likely be replaced sooner or
later by regex-search replace anyway in every non trivial project. For
trivial projects or ad hoc replace()s, it's good enough.
Piotr Kobzda - 24 May 2007 23:24 GMT
>> Compiling pattern once before calling replaceAll() is not as convenient
>> as using replace(),
>
> Sure. You decide if you want fast code or convenient code.

Why not to have both, fast and convenient code?

>> and is not necessarily helpful in all situations
>> (code above is the example).
>
> Quite contrived example. It implies, that the data contains a
> description of how it is to be changed.

That's true.  But it's just an example of situations when I can't
predict what replace pattern I will use.  Even simpler example is a need
of replace something once.  Convenience don't have to be compromised in
such a cases.

> It's poor, no question. But for two good reasons:
> a) the more general solution is at hand.

Better is at hand also, a few posts before. ;)

> b) fixed string search/replace will most likely be replaced sooner or
> later by regex-search replace anyway in every non trivial project. For
> trivial projects or ad hoc replace()s, it's good enough.

Agreed.  I wish all having that king of problems only...  :)

piotr


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.