Java Forum / General / May 2007
String.replace() in practice
Aljosa Mohorovic - 22 May 2007 12:25 GMT i currently have this:
String tmp = "this is text for flash player"; tmp = tmp.replace("š", "š"); tmp = tmp.replace("Š", "Š"); tmp = tmp.replace("<br />", "<br>"); ...
and this works but i was wondering if there is a better way to do this?
Aljosa
Tom Hawtin - 22 May 2007 12:54 GMT > String tmp = "this is text for flash player"; > tmp = tmp.replace("š", "š"); [quoted text clipped - 4 lines] > and this works but i was wondering if there is a better way to do > this? Better in what sense?
In terms of making it easier to read, you could write it something like:
String escaped = "this is text for flash player" .replace("š", "š") .replace("Š", "Š") .replace("<br />", "<br>") ...
For performance, it depends how much effort versus how much performance you want.
Writing an in-place replacement with a StringBuilder is easy enough using indexof(String,int) and replace(int,int,String). Slightly more sophisticated would be copying for one StringBuilder to another.
If all your replacement match strings start with & and <, you might want to write your own optimised code using a char array.
Tom Hawtin
Robert Klemme - 22 May 2007 14:15 GMT >> String tmp = "this is text for flash player"; >> tmp = tmp.replace("š", "š"); [quoted text clipped - 21 lines] > using indexof(String,int) and replace(int,int,String). Slightly more > sophisticated would be copying for one StringBuilder to another. For example using the algorithm suggested by stdlib documentation (hint to OP):
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html#appendRepla cement(java.lang.StringBuffer,%20java.lang.String)
> If all your replacement match strings start with & and <, you might want > to write your own optimised code using a char array. > > Tom Hawtin Kind regards
robert
Lew - 22 May 2007 14:22 GMT Aljosa Mohorovic wrote:
>> tmp = tmp.replace("<br />", "<br>"); Apparently you are dealing with HTML. If that's so, why in the world would you replace a correct "<br />" with an incorrect "<br>"?
 Signature Lew
Aljosa Mohorovic - 22 May 2007 17:42 GMT > Apparently you are dealing with HTML. If that's so, why in the world would > you replace a correct "<br />" with an incorrect "<br>"? because content will be rendered in flash player and maybe some devices where <br /> is not recognized but <br> is. i do many things i would like to avoid but because of platforms i work with i don't have that choice.
thanks to everybody for tips.
Aljosa Mohorovic - 22 May 2007 19:17 GMT > Apparently you are dealing with HTML. If that's so, why in the world would > you replace a correct "<br />" with an incorrect "<br>"? because content will be rendered in flash player and maybe some devices where <br /> is not recognized but <br> is. i do many things i would like to avoid but because of platforms i work with i don't have that choice.
thanks to everybody for tips.
angrybaldguy@gmail.com - 22 May 2007 20:03 GMT > Aljosa Mohorovic wrote: > >> tmp = tmp.replace("<br />", "<br>"); [quoted text clipped - 4 lines] > -- > Lew Because <br /> is incorrect HTML and correct XHTML, and XHTML so far hasn't reached critical mass: it's still impossible to serve correctly to IE.
Joshua Cranmer - 22 May 2007 23:35 GMT > Because <br /> is incorrect HTML and correct XHTML, and XHTML so far > hasn't reached critical mass: it's still impossible to serve correctly > to IE. Not really. Haven't you heard of the doctype? IE can't serve the MIME-type correctly, but it can properly parse an XHTML-style doctype, even if its delivered as a text/html.
Furthermore, XHTML was made a W3C recommendation on January 4, 2000, so it's been around 7 years and anything that can handle HTML and not XHTML is horrendously outdated.
Finally, the XML <br /> close sequence was actually derived from the erroneous common handling of the /proper/ SGML markup <br/ (most browsers actually don't handle HTML properly...). Most pre-XHTML browsers were handling <hr /> as <hr> instead of the correct <hr >>
John W. Kennedy - 24 May 2007 22:16 GMT >> Because <br /> is incorrect HTML and correct XHTML, and XHTML so far >> hasn't reached critical mass: it's still impossible to serve correctly [quoted text clipped - 3 lines] > MIME-type correctly, but it can properly parse an XHTML-style doctype, > even if its delivered as a text/html. Nope. IE parses XHTML served as text/html as though it were HTML.
> Furthermore, XHTML was made a W3C recommendation on January 4, 2000, so > it's been around 7 years and anything that can handle HTML and not XHTML > is horrendously outdated. Like IE. Except, of course, that "outdated" isn't the issue. Microsoft has deliberately and maliciously killed XHTML.
 Signature John W. Kennedy "Compact is becoming contract, Man only earns and pays." -- Charles Williams. "Bors to Elayne: On the King's Coins" * TagZilla 0.066 * http://tagzilla.mozdev.org
Daniel Pitts - 22 May 2007 16:41 GMT On May 22, 4:25 am, Aljosa Mohorovic <aljosa.mohoro...@gmail.com> wrote:
> i currently have this: > [quoted text clipped - 8 lines] > > Aljosa There isn't a simpler way to do this, if thats what your question was. There are more efficient methods, but most of them are quite a bit more complex.
Oh, also, you should consider the fact that String.replace only replaces the FIRST instance in the string. you can use replaceAll, but replaceAll uses regex (so you have to make sure that your search/ replace text is escaped appropriately). Also, being regex, it does actually slow things down somewhat.
I say, stick with the easiest code unless it has been proven to be too inefficient. If you find your program is taking up to many resources, you could then consider using some sort of finite state machine to parse and replace everything in one go. Actually, I wish Sun would add to the JDK a method String.translate(java.util.Map<String, String> translations); The intent would be that any key in the String would be replaced by the corresponding value. This could be somewhat optimized.
Tom Hawtin - 22 May 2007 16:54 GMT > I say, stick with the easiest code unless it has been proven to be too > inefficient. If you find your program is taking up to many resources, > you could then consider using some sort of finite state machine to > parse and replace everything in one go. The API docs for String.replace say "Replaces each substring", so it shouldn't be just the first occurrence that is replaced. String.replaceFirst does that, only with a regex (nice API).
> Actually, I wish Sun would > add to the JDK a method String.translate(java.util.Map<String, String> > translations); The intent would be that any key in the String would > be replaced by the corresponding value. This could be somewhat > optimized. You could write you own. If it becomes popular, it could be incorporated into the library. My hunch is that it wouldn't in fact be that useful in practice.
Tom Hawtin
Piotr Kobzda - 22 May 2007 16:56 GMT > Oh, also, you should consider the fact that String.replace only > replaces the FIRST instance in the string. Quote from replace() documentation:
"Replaces *each* substring of this string that matches the literal target sequence with the specified literal replacement sequence."
> you can use replaceAll, > but replaceAll uses regex (so you have to make sure that your search/ > replace text is escaped appropriately). Also, being regex, it does > actually slow things down somewhat. replace() uses reqex as well, it's current Sun's JRE implementation is:
public String replace(CharSequence target, CharSequence replacement) { return Pattern.compile(target.toString(), Pattern.LITERAL).matcher( this).replaceAll(Matcher.quoteReplacement(replacement.toString())); }
piotr
Tom Hawtin - 22 May 2007 18:07 GMT > replace() uses reqex as well, it's current Sun's JRE implementation is: > > public String replace(CharSequence target, CharSequence replacement) { > return Pattern.compile(target.toString(), Pattern.LITERAL).matcher( > this).replaceAll(Matcher.quoteReplacement(replacement.toString())); > } The implementation uses the regex library. However, the arguments are not treated as such.
Note the Pattern.LITERAL. Current API docs read:
"When this flag is specified then the input string that specifies the pattern is treated as a sequence of literal characters. Metacharacters or escape sequences in the input sequence will be given no special meaning."
More bad APIs...
Tom Hawtin
Piotr Kobzda - 22 May 2007 23:43 GMT >> public String replace(CharSequence target, CharSequence replacement) { >> return Pattern.compile(target.toString(), Pattern.LITERAL).matcher( [quoted text clipped - 4 lines] > The implementation uses the regex library. However, the arguments are > not treated as such. The second argument is still treated as such.
In fact, there is no significant difference in efficiency between a current implementation, and the following "full" regex:
return Pattern.compile(Matcher.quoteReplacement(target.toString())) .matcher(this).replaceAll( Matcher.quoteReplacement(replacement.toString()));
I don't get, why regex is in use there? Couldn't a replace() be implemented in a more "straightforward" way? E.g.:
public String replace(CharSequence target, CharSequence replacement) { String tgt = target.toString(); String rep = replacement.toString(); StringBuilder sb = new StringBuilder(length()); int s = 0; for(int e; (e = indexOf(tgt, s)) != -1; s = e + tgt.length()) { sb.append(substring(s, e)).append(rep); } return sb.append(substring(s)).toString(); }
My simple benchmarks shows that it's about twice faster then current implementation. Thus, is it a good idea to use regex library for that?
piotr
Arne Vajhøj - 23 May 2007 03:36 GMT > In fact, there is no significant difference in efficiency between a > current implementation, and the following "full" regex:
> I don't get, why regex is in use there? Couldn't a replace() be > implemented in a more "straightforward" way? E.g.:
> My simple benchmarks shows that it's about twice faster then current > implementation. Thus, is it a good idea to use regex library for that? To me this is an obvious goof. I can see it for me.
The Java team had mile long list of complaints over .replaceAll not working with backslashes, periods etc.. The tech lead looked over at the newest team member and told him to create a .replace that did not use regex syntax. He then copy pasted the .replaceAll code and used Pattern.LITERAL. And due to time constraints they had to cancel a few code reviews and this simple code was picked not to be reviewed.
Note that I do not have any indication that it was what actually happened. But stuff like this happen all the time in software development.
Arne
Tom Hawtin - 23 May 2007 04:01 GMT >>> public String replace(CharSequence target, CharSequence replacement) { >>> return Pattern.compile(target.toString(), Pattern.LITERAL).matcher( [quoted text clipped - 6 lines] > > The second argument is still treated as such. In what way? It is escaped properly by quoteReplacement.
> In fact, there is no significant difference in efficiency between a > current implementation, and the following "full" regex: > > return Pattern.compile(Matcher.quoteReplacement(target.toString())) ^^^^^^^^^^^^^^^^^^^^^^^^Pattern.quote
> .matcher(this).replaceAll( > Matcher.quoteReplacement(replacement.toString())); > > I don't get, why regex is in use there? Couldn't a replace() be > implemented in a more "straightforward" way? E.g.: I don't know the reason. Probably not a priority. The way it is implemented has less source and object code, and the methods it calls may already be compiled.
Tom Hawtin
Piotr Kobzda - 23 May 2007 08:55 GMT >> In fact, there is no significant difference in efficiency between a >> current implementation, and the following "full" regex: [quoted text clipped - 3 lines] >> .matcher(this).replaceAll( >> Matcher.quoteReplacement(replacement.toString())); Oh, you're right. However, it doesn't change its general efficiency...
>> I don't get, why regex is in use there? Couldn't a replace() be >> implemented in a more "straightforward" way? E.g.: > > I don't know the reason. Probably not a priority. The way it is > implemented has less source and object code, and the methods it calls > may already be compiled. Sounds reasonable for me. Hope that most of its users will follow that, and won't self implement it, just because of regex...
piotr
Piotr Kobzda - 23 May 2007 11:16 GMT >>> The implementation uses the regex library. However, the arguments are >>> not treated as such. >> >> The second argument is still treated as such. > > In what way? It is escaped properly by quoteReplacement. It's properly escaped. But is treated the same way as the other replacement strings by regex library.
In other words, it's treated the same way as the second argument of replaceAll() in:
str.replaceAll(Pattern.quote(target), Matcher.quoteReplacement(replacement));
Regex parser logic is still involved.
piotr
Ingo Menger - 23 May 2007 13:40 GMT > I don't get, why regex is in use there? Because "replace a constant substring with another constant string" is an instance of the more general problem: "replace a substring that looks like regex with a string that looks like replacement"?
Piotr Kobzda - 23 May 2007 23:58 GMT >> I don't get, why regex is in use there? > > Because "replace a constant substring with another constant string" is > an instance of the more general problem: "replace a substring that > looks like regex with a string that looks like replacement"? Maybe... However, as my simple implementation shows, that's not difficult to solve it without solving the more general problem. Why then should we pay for the overheads of that general solution, when a simple and more efficient solution exits?
OTOH, having to compute the area of the square whose sides have length S, what would you choose to solve the problem: a) well known S^2 formula, b) computation of the integral between 0 and S of function f(x)=S, because you have the best integral evaluation algorithm implemented ?
Current replace() implementation looks like choosing b) to me.
piotr
Twisted - 24 May 2007 07:31 GMT > OTOH, having to compute the area of the square whose sides have length > S, what would you choose to solve the problem: [quoted text clipped - 4 lines] > > Current replace() implementation looks like choosing b) to me. Is it? Or is it more like applying the MxN rectangle-area formula to the special case where the sides happen to be equal? That's clearly not as "bad" in terms of either overgeneralizing or losing efficiency (although with bignums, at least, a squaring can be twice the speed of a generic multiply).
Piotr Kobzda - 24 May 2007 15:09 GMT >> OTOH, having to compute the area of the square whose sides have length >> S, what would you choose to solve the problem: [quoted text clipped - 10 lines] > (although with bignums, at least, a squaring can be twice the speed of > a generic multiply). Well, for each input S, when M=N=S, the MxN formula gives exactly the same result what applying the S^2 squaring formula gives. So we can safely use it to achieve a squaring expected results. Using integral computation algorithms we can of course also achieve the same results, but most probably it requires an extra overhead (like forming a function based on input, etc.).
Similar kind of overheads I see in replace() implemented using regex.
To solve replace() problem a "simple search" is enough. Regex library performs "pattern search", which is more advanced and complicated, than a "simple" one. Thus we have to add some extra overhead, to make it "simple" again, performing completely unnecessary operations, e.g. escaping and parsing a replacement string...
That's why I think it looks more like b).
piotr
Ingo Menger - 24 May 2007 10:50 GMT > >> I don't get, why regex is in use there? > [quoted text clipped - 4 lines] > Maybe... However, as my simple implementation shows, that's not > difficult to solve it without solving the more general problem. But the more general problem (pattern matching) is already solved.
> Why > then should we pay for the overheads of that general solution, when a > simple and more efficient solution exits? We don't have to. String.replace is but a convenience method. If you use it in a tight loop, it's your fault. Try using String.replaceAll and make sure you compile the pattern just once *before* the loop. I guess we'll see that your simple implementation will not be that much faster.
> OTOH, having to compute the area of the square whose sides have length > S, what would you choose to solve the problem: > a) well known S^2 formula, > b) computation of the integral between 0 and S of function f(x)=S, > because you have the best integral evaluation algorithm implemented?
> Current replace() implementation looks like choosing b) to me. See answer from Twisted, whom I fully support on this issue.
The situation is also different from this contrived example in that we all know how requirements change. Today, you are happy to replace all <br/> by <br>, tomorrow, you notice, that sometimes <BR/> and even <Br/
> appear, and the next day the same happens for some other tag. Do you think that pattern = Pattern.compile("?i:<(\\w+)/>>"); s = s.replaceAll(pattern, "<$1>");
is still slower than: s.simplereplace("<br/>", "<br>") .simplereplace("<BR/>", "<br>") .simplereplace("<bR/>", "<br>") .simplereplace("<Br/>", "<br>") .simplereplace // all the variants for other tags in question
Piotr Kobzda - 24 May 2007 15:39 GMT > String.replace is but a convenience method. If you use it in a tight > loop, it's your fault. What is my fault when I need it? For example:
for(String line; (line = readLine()) != null; ) { writeLine(line.replace(getReplaceTarget(), getReplacement())); }
Why should I care about replace() efficiency here? When replace() efficiency is not as good as expected, it's replace() implementation fault, not mine I think.
> Try using String.replaceAll and make sure you compile the pattern just > once *before* the loop. I guess we'll see that your simple > implementation will not be that much faster. I can't do that, replace() compiles my string each time I call it.
Compiling pattern once before calling replaceAll() is not as convenient as using replace(), and is not necessarily helpful in all situations (code above is the example). However, even having each pattern compiled once, my simple implementation is about two times faster (check it!).
As long as regex pattern search is implemented in linear fashion (I mean no optimized search algorithm in used, e.g. Boyer-Moore's), using regex for "simple replace" will probably be at the most as fast as my simple implementation, not better.
> Do you think that > pattern = Pattern.compile("?i:<(\\w+)/>>"); [quoted text clipped - 6 lines] > .simplereplace("<Br/>", "<br>") > .simplereplace // all the variants for other tags in question I don't think so. But that's not a matter of our discussion. Poor replace() implementation, that's what our debate is about.
piotr
Robert Klemme - 24 May 2007 15:55 GMT >> String.replace is but a convenience method. If you use it in a tight >> loop, it's your fault. [quoted text clipped - 8 lines] > efficiency is not as good as expected, it's replace() implementation > fault, not mine I think. I am sorry - are you kidding? If you write a program and it's too slow, *you* are responsible. It does not matter whether the slowness is caused by the code you wrote or caused by picking the wrong library class or method.
If you have a large set of objects and need to frequently check whether something is member of that set you would not use a LinkedList for the set and blame the list implementation for the slowness, would you?
Knowing semantics and performance characteristics (where necessary) of library code is part of software engineering like being capable of writing code yourself.
Please reconsider your statement.
robert
Piotr Kobzda - 24 May 2007 23:14 GMT >> Why should I care about replace() efficiency here? When replace() >> efficiency is not as good as expected, it's replace() implementation >> fault, not mine I think. > > I am sorry - are you kidding? No.
> If you write a program and it's too slow, > *you* are responsible. It does not matter whether the slowness is > caused by the code you wrote or caused by picking the wrong library > class or method. Of course. But notice, that I'm not talking about fault of my program, that's just speculation what is possible fault cause.
> If you have a large set of objects and need to frequently check whether > something is member of that set you would not use a LinkedList for the > set and blame the list implementation for the slowness, would you? No. I wouldn't. I would choose more appropriate list based on my experience and library documentation.
> Knowing semantics and performance characteristics (where necessary) of > library code is part of software engineering like being capable of > writing code yourself. > > Please reconsider your statement. My statement is just a complain on replace() implementation, which I think could be better.
Semantics of that method are well documented and meets my expectations, but with its performance characteristics (which I know only because I've seen its implementation and run some tests) I'm not fully satisfied.
Don't take me wrong. I don't try to say that current implementation is completely bad. It just fails some of my personal acceptance tests, and I'm wondering about possible improvement of it. That's all.
piotr
Ingo Menger - 24 May 2007 16:05 GMT > > String.replace is but a convenience method. If you use it in a tight > > loop, it's your fault. [quoted text clipped - 8 lines] > efficiency is not as good as expected, it's replace() implementation > fault, not mine I think. I don't see why you couldn't use replaceAll here also. Of course, when the search pattern changes on each search, one cannot save pattern compilation time.
> > Try using String.replaceAll and make sure you compile the pattern just > > once *before* the loop. I guess we'll see that your simple > > implementation will not be that much faster. > > I can't do that, replace() compiles my string each time I call it. For the example of the OP (replace <br/> by <br>) replaceAll with precompiled pattern is appropriate.
> Compiling pattern once before calling replaceAll() is not as convenient > as using replace(), Sure. You decide if you want fast code or convenient code.
> and is not necessarily helpful in all situations > (code above is the example). Quite contrived example. It implies, that the data contains a description of how it is to be changed.
> As long as regex pattern search is implemented in linear fashion (I mean > no optimized search algorithm in used, e.g. Boyer-Moore's), using regex > for "simple replace" will probably be at the most as fast as my simple > implementation, not better. Not better, sure, because your simple implmentation is the most degenerate case of the search/replace problem. And also the least mighty, by the way.
> I don't think so. But that's not a matter of our discussion. Poor > replace() implementation, that's what our debate is about. It's poor, no question. But for two good reasons: a) the more general solution is at hand. b) fixed string search/replace will most likely be replaced sooner or later by regex-search replace anyway in every non trivial project. For trivial projects or ad hoc replace()s, it's good enough.
Piotr Kobzda - 24 May 2007 23:24 GMT >> Compiling pattern once before calling replaceAll() is not as convenient >> as using replace(), > > Sure. You decide if you want fast code or convenient code. Why not to have both, fast and convenient code?
>> and is not necessarily helpful in all situations >> (code above is the example). > > Quite contrived example. It implies, that the data contains a > description of how it is to be changed. That's true. But it's just an example of situations when I can't predict what replace pattern I will use. Even simpler example is a need of replace something once. Convenience don't have to be compromised in such a cases.
> It's poor, no question. But for two good reasons: > a) the more general solution is at hand. Better is at hand also, a few posts before. ;)
> b) fixed string search/replace will most likely be replaced sooner or > later by regex-search replace anyway in every non trivial project. For > trivial projects or ad hoc replace()s, it's good enough. Agreed. I wish all having that king of problems only... :)
piotr
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|