Java Forum / General / November 2007
Java text compression
Chris - 18 Nov 2007 19:44 GMT What's the fastest way to compress/decompress text?
We're doing that over really large datasets in our app. We're currently converting char arrays to byte arrays using our own UTF-8 conversion code, and then compressing the bytes using java.util.zip. The code is pretty old.
I don't like this two-step process, and the profiler shows that this is a bottleneck in our app.
Is anyone aware of any code that compresses chars directly? Perhaps a third-party library that does it faster?
In our particular situation, decompression speed is a lot more important than compression speed.
Joshua Cranmer - 18 Nov 2007 20:13 GMT > What's the fastest way to compress/decompress text? I know of a compression method that can compress in constant time: it replaces the entire text with a single `0' (hey, it's lossy!). To what degree of compression do you need?
> We're doing that over really large datasets in our app. We're currently > converting char arrays to byte arrays using our own UTF-8 conversion [quoted text clipped - 3 lines] > I don't like this two-step process, and the profiler shows that this is > a bottleneck in our app. Questions: 1. From whence are the char arrays coming? 2. Where are the byte arrays going? 3. Which part of the process is the bottleneck? The UTF-8 conversion, or the compression?
> Is anyone aware of any code that compresses chars directly? Perhaps a > third-party library that does it faster? > > In our particular situation, decompression speed is a lot more important > than compression speed.
 Signature Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth
Chris - 18 Nov 2007 22:19 GMT >> What's the fastest way to compress/decompress text? > > I know of a compression method that can compress in constant time: it > replaces the entire text with a single `0' (hey, it's lossy!). To what > degree of compression do you need? As with all compression apps, as much as possible. It's hard to pick a number until you know the available tradeoffs. We already know the tradeoffs with java.util.zip, so obviously we're looking for something faster.
> Questions: > 1. From whence are the char arrays coming? Memory. They get streamed from an external source, compressed, and written to disk as they come in.
> 2. Where are the byte arrays going? Also memory, although ultimately to the UI.
> 3. Which part of the process is the bottleneck? The UTF-8 conversion, or > the compression? Both, though the UTF-8 is the bigger burden.
Eric Sosman - 18 Nov 2007 20:16 GMT > What's the fastest way to compress/decompress text? If you're really interested in "the fastest way" to the exclusion of all other concerns, then don't compress at all. Bingo! Problem solved!
You might be happier with a compression scheme that did a little better at reducing the size of the data, but now you can't get a sensible answer until you describe the trade-offs you're willing to make. For example, if you were offered a compression scheme that ran ten percent faster than your current method but emitted fifteen percent more data, would you take it or reject it?
> We're doing that over really large datasets in our app. We're currently > converting char arrays to byte arrays using our own UTF-8 conversion [quoted text clipped - 6 lines] > Is anyone aware of any code that compresses chars directly? Perhaps a > third-party library that does it faster? How badly do you need your own idiosyncratic UTF-8 conversion? If you can use standard methods, consider wrapping the compressed streams, somewhat like
Writer w = new OutputStreamWriter( new GZIPOutputStream(...)); w.write("Hello, world!");
BufferedReader r = new BufferedReader( new InputStreamReader( new GZIPInputStream(...))); String s = r.readLine();
You'll have to make your own assessment of the speeds and the degree of compression.
> In our particular situation, decompression speed is a lot more important > than compression speed. "You'll have to make your own assessment ..."
 Signature Eric Sosman esosman@ieee-dot-org.invalid
Robert Klemme - 18 Nov 2007 21:36 GMT >> What's the fastest way to compress/decompress text? > [quoted text clipped - 9 lines] > method but emitted fifteen percent more data, would you take it > or reject it? Bonus question for OP: what is the size of data sets and how are they used? Especially, where are they stored?
>> We're doing that over really large datasets in our app. We're >> currently converting char arrays to byte arrays using our own UTF-8 [quoted text clipped - 13 lines] > Writer w = new OutputStreamWriter( > new GZIPOutputStream(...)); Minor detail: The encoding is missing, so in this case it would rather be
new OutputStreamWriter(new GZIPOutputstream(...), "UTF-8")
> w.write("Hello, world!"); > [quoted text clipped - 5 lines] > You'll have to make your own assessment of the speeds and the > degree of compression. This is the solution I was going to suggest as well. If the custom encoding yields the same results as Java's built in UTF-8 then I would immediately switch to this approach of stacked streams. If the custom encoding yields slightly different results I am sure you can plug in the custom encoding into the standard Java io and nio classes with a little effort.
If data is to be stored in a BLOB via JDBC you can even extend this approach to directly stream into the database.
>> In our particular situation, decompression speed is a lot more >> important than compression speed. > > "You'll have to make your own assessment ..." Decompression is generally much faster than compression. I believe there is not much difference in decompression speed when decompressing a GZIP stream that was compressed with highest and lowest level. If you dig a bit into compression theory then it's pretty obvious that finding a small compressed representation is significantly harder than converting a compressed data set back.
Kind regards
robert
Chris - 18 Nov 2007 22:35 GMT > Bonus question for OP: what is the size of data sets and how are they > used? Especially, where are they stored? Multi-terabyte sized, split across multiple machines. On a single machine, generally not more than a few hundred Gb. One or two disks per machine, SATA, no RAID.
At compression time, the data is streamed from an external source, transformed in memory, and written to disk.
At decompression time, the app seeks to the particular block of text of interest and decompresses it. Seek time dominates decompression time, *except* when we do heavy caching, in which case the decompression becomes the bottleneck. Storing the decompressed text in memory takes up too much space. Has to be cached in compressed form.
Eric Sosman - 18 Nov 2007 23:30 GMT >> Bonus question for OP: what is the size of data sets and how are they >> used? Especially, where are they stored? [quoted text clipped - 11 lines] > becomes the bottleneck. Storing the decompressed text in memory takes up > too much space. Has to be cached in compressed form. Cutting the text into blocks usually makes the compression less effective. Most compression schemes nowadays (and I believe DEFLATE is among them) are adaptive, meaning that they adjust to the characteristics of the data stream as they process it. Thus, they compress relatively poorly at first, then improve as they learn more about the statistical profile of the data.
Implications: (1) You'll get better compression if you can keep the blocks "fairly long." (1a) You might make the blocks "long" by concatenating multiple sub-blocks, at the expense of needing to decompress from the start of a block even if you only need the sub-block at its end. (2) If the blocks simply must be small, you probably shouldn't waste effort on BEST_COMPRESSION.
Some interesting experiments are in order.
 Signature Eric Sosman esosman@ieee-dot-org.invalid
George Neuner - 19 Nov 2007 06:51 GMT >>> Bonus question for OP: what is the size of data sets and how are they >>> used? Especially, where are they stored? >> >> Multi-terabyte sized, split across multiple machines. On a single >> machine, generally not more than a few hundred Gb. One or two disks per >> machine, SATA, no RAID. This sounds like it might be DNA sequences.
>> At compression time, the data is streamed from an external source, >> transformed in memory, and written to disk. [quoted text clipped - 11 lines] >they compress relatively poorly at first, then improve as they >learn more about the statistical profile of the data. Most LZ based implementations (including DEFLATE) limit codes to 16-bits (I've heard of 32-bit LZ, but I've never seen it). Compression studies have shown that, on average, the 16-bit code dictionary will be filled after processing 200KB of input.
If the remainder of the input is characteristically similar to the part already encoded, the full dictionary will compress the rest of the input pretty well. But most input varies as it goes, sometimes rapidly and drastically, so it does make sense to segment the input to take advantage of the variation.
> Implications: (1) You'll get better compression if you can >keep the blocks "fairly long." (1a) You might make the blocks [quoted text clipped - 4 lines] > > Some interesting experiments are in order. Unfortunately, the best segment length depends on the input. For general text I would probably use relatively small blocks, 16-64KB depending on the input - the more variation in the input, the smaller the blocks need to be to take advantage of it.
If we are talking about DNA sequences, then I would probably go for 256KB - once the base nucleotides and amino acid sequences are in the dictionary (and you can guarantee this by preloading them), compression is typically very good (80+%), so it makes sense to not worry about it and just pick a convenient sized buffer to work with.
George -- for email reply remove "/" from address
rossum - 19 Nov 2007 09:59 GMT >This sounds like it might be DNA sequences. If it is DNA sequences, then you might be able to use something like Base64 for initial coding:
A -> 00 C -> 01 G -> 10 T -> 11
AAA -> 000000 ... ACG -> 000110 ... TTT -> 111111
Those six bit codes are a straight match for Base64 coding. That will compress three bytes "AAC" into one "B" (000001 -> B in Base64).
rossum
Andrew Thompson - 19 Nov 2007 11:14 GMT >>This sounds like it might be DNA sequences. >If it is DNA sequences, .. Sounds more like a heap of speculation over a 'cagey' problem specification.
To clarify the 'cagey' aspect, perhaps the OP can indeed express what the ulitmate point of this exercise is, rather than doling out 'precious little tid-bits' that supposedly represent the actual problem.
(Ultimately, the OP's own posts identify that they are entirely unaware of the subtleties of the problems they face, and until they fill in more detail, it really constitutes an 'interesting but inefficient use of bandwidth' to speculate further.)
 Signature Andrew Thompson http://www.physci.org/
Chris - 20 Nov 2007 03:55 GMT >>> This sounds like it might be DNA sequences. >> If it is DNA sequences, .. [quoted text clipped - 10 lines] > they fill in more detail, it really constitutes an 'interesting but > inefficient use of bandwidth' to speculate further.) I started to write a long, irritable response to this, but I just don't have time. Maybe I'll start a new thread later on twerps whose snotty, fatuous responses waste bandwidth.
I'll say only this:
1. The problem is fully specified. Text compression is a known problem, and I just asked if anyone knew of a Java library that had better tradeoffs than UTF-8 + zip.
2. Text means text, not DNA. Written words.
3. I'm fully aware of the subtleties of this particular problem space, and there is nothing in any of the posts so far that I haven't considered and already benchmarked. (Including block sizes. For this app, ~10k is optimal).
4. You don't need to know about the environment or the rest of the app, because I'm not asking for a damn consultant.
5. Asking "how much compression I want" is just stupid.
In short, a narrow question is an invitation for a narrow answer, not an invitation for you to tell me how you think I should write this app.
Eric Sosman - 20 Nov 2007 04:15 GMT > [...] > 1. The problem is fully specified. Text compression is a known problem, > and I just asked if anyone knew of a Java library that had better > tradeoffs than UTF-8 + zip. > > 2. Text means text, not DNA. Written words. Elsethread you've explained that the compressed stream gets read back into a companion program and decompressed there; this suggests that it doesn't need to be exchanged with "foreign" programs. In which case, I ask again: Does UTF-8 encoding buy you enough additional compression to justify its expense? How bad would things be if you just handed 16-bit chars to the compressor with no "intelligence" whatsoever?
> 5. Asking "how much compression I want" is just stupid. Well, you asked about compression speed. Other things being equal, faster compressors compress less well and "looser" compressors compress faster, so the question of "how much" must eventually arise when you weigh alternatives.
 Signature Eric Sosman esosman@ieee-dot-org.invalid
Chris - 20 Nov 2007 04:51 GMT >> [...] >> 1. The problem is fully specified. Text compression is a known [quoted text clipped - 10 lines] > bad would things be if you just handed 16-bit chars to the > compressor with no "intelligence" whatsoever? I'd like to try that. Unfortunately, java.util.zip.Deflater accepts only byte arrays, not char arrays. I suppose it might be faster to copy the chars to 2-byte sequences and compress, rather than run the UTF-8 compressor. An extra step, but worth a try.
>> 5. Asking "how much compression I want" is just stupid. > > Well, you asked about compression speed. Other things > being equal, faster compressors compress less well and "looser" > compressors compress faster, so the question of "how much" must > eventually arise when you weigh alternatives. Of course. It just reminded of walking into a store and having the clerk ask "how much do you want to pay?" The right answer is, "show me the merchandise and I'll figure out what the tradeoffs are on my own".
Andrew Thompson - 20 Nov 2007 05:02 GMT >>> [...] ...
>Of course. It just reminded of walking into a store and having the clerk >ask "how much do you want to pay?" ... That might be an appropriate comparison if this were a help desk. It is not a help desk. (Though you seem to be treating the responders as though they were your own personal servants - so perhaps you are confused about the differences between 'help-desk' and 'usenet').
 Signature Andrew Thompson http://www.physci.org/
Roger Lindsjö - 20 Nov 2007 11:16 GMT >>> 5. Asking "how much compression I want" is just stupid. >> [quoted text clipped - 6 lines] > ask "how much do you want to pay?" The right answer is, "show me the > merchandise and I'll figure out what the tradeoffs are on my own". But don't you think it is an appropriate question? At least to get in the ballpark? If you answered that you have regular text and you need ~10X compression, then people can suggest algorithms that has been shown to give that compression. With no specification you will get suggestions that are very fast (no compression was suggested) to methods that will give very high compression rates but run very slowly.
Similar to your analogy, the electronics store where I have been looking at a new TV has TVs ranging from about $100 to $120000. I doubt that when someone walks and says "I want a new TV" that the first suggestion is to look at the very expensive on (I heard a rumor that 1 was sold in Sweden so far). Size and price are common to narrow down the number of choices.
//Roger Lindsjö
Lew - 20 Nov 2007 13:59 GMT Chris wrote:
>> Of course. It just reminded of walking into a store and having the >> clerk ask "how much do you want to pay?" The right answer is, "show me >> the merchandise and I'll figure out what the tradeoffs are on my own".
> But don't you think it is an appropriate question? At least to get in > the ballpark? ... [quoted text clipped - 5 lines] > Sweden so far). Size and price are common to narrow down the number of > choices. Car dealers, clothing stores, flower shops - all routinely ask me "how much do I want to spend?"
 Signature Lew
Lasse Reichstein Nielsen - 20 Nov 2007 17:30 GMT > Car dealers, clothing stores, flower shops - all routinely ask me "how > much do I want to spend?" ... and this is where you lie! :) /L
 Signature Lasse Reichstein Nielsen - lrn@hotpop.com DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html> 'Faith without judgement merely degrades the spirit divine.'
Lew - 21 Nov 2007 01:50 GMT >> Car dealers, clothing stores, flower shops - all routinely ask me "how >> much do I want to spend?" > > .... and this is where you lie! :) Which way? Does one claim to be a big spender or a skinflint?
 Signature Lew
Lasse Reichstein Nielsen - 21 Nov 2007 05:59 GMT >>> Car dealers, clothing stores, flower shops - all routinely ask me "how >>> much do I want to spend?" >> .... and this is where you lie! :) > > Which way? Does one claim to be a big spender or a skinflint? OK, I guess it's mainly car dealers where you don't want to tell them your limit. Car shopping is a negotiation, and it's harder to press the price if you have already told them what you are willing to spend. I'd say something at least 20% lower than my real limit. I.e., neither a big spender or a skinflint, but someone more restricted in your spending than you actually are. But then again, I'm not a great negotiatior, so don't necessarily trust my opinion :)
Flower shops are ok.
/L
 Signature Lasse Reichstein Nielsen - lrn@hotpop.com DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html> 'Faith without judgement merely degrades the spirit divine.'
Eric Sosman - 20 Nov 2007 13:37 GMT >>> [...] >>> 1. The problem is fully specified. Text compression is a known [quoted text clipped - 15 lines] > chars to 2-byte sequences and compress, rather than run the UTF-8 > compressor. An extra step, but worth a try. Might it instead be the removal of a step? You've also mentioned that the data are "streamed from an external source;" in what form do they arrive? Unless you're using something like RMI they probably don't arrive as full-fledged String objects, but are converted to Strings from some more primitive form -- like, perhaps, streams of bytes?
>>> 5. Asking "how much compression I want" is just stupid. >> [quoted text clipped - 6 lines] > ask "how much do you want to pay?" The right answer is, "show me the > merchandise and I'll figure out what the tradeoffs are on my own". ... and to push the analogy perhaps just a little bit too far, the kindly clerk dumps the 600-page inventory list in your lap and walks away. That is, a search is more efficient if the searcher is an active participant instead of a filter-feeder.
 Signature Eric Sosman esosman@ieee-dot-org.invalid
Roedy Green - 21 Nov 2007 03:41 GMT On Mon, 19 Nov 2007 23:15:12 -0500, Eric Sosman <esosman@ieee-dot-org.invalid> wrote, quoted or indirectly quoted someone who said :
> Well, you asked about compression speed. Other things >being equal, faster compressors compress less well and "looser" >compressors compress faster, so the question of "how much" must >eventually arise when you weigh alternatives. The proper answer would have been use X if you want fast but not so hot compression use Y if you want slow but deep compression.
The various answers seemed to be disguising the fact nobody had any algorithms other that the minor modification of GZIP to offer. It gave me the impression there were one-upmanship games in play. I can see why Chris felt so pissed to be jerked around that way.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Eric Sosman - 21 Nov 2007 13:54 GMT > On Mon, 19 Nov 2007 23:15:12 -0500, Eric Sosman > <esosman@ieee-dot-org.invalid> wrote, quoted or indirectly quoted [quoted text clipped - 8 lines] > use X if you want fast but not so hot compression > use Y if you want slow but deep compression. Quite aside from the point that enumerating the set of possible compression techniques may be significantly harder than counting to two, I don't think that's the "proper" answer at all. The O.P. does not have a compression problem, but a systemic problem -- even in the original posting he mentioned the custom- written UTF-8 conversion as part of the bottleneck. The "proper" answer is to take a step back and get a wider view of the context. Global optimization, not peephole optimization.
If someone says he's spending a hundred dollars a week putting motor oil in his car and asks for advice on how to get inexpensive oil, the proper answer is not to help him shave a penny here and a penny there but to find out what's wrong with the car!
> The various answers seemed to be disguising the fact nobody had any > algorithms other that the minor modification of GZIP to offer. It gave > me the impression there were one-upmanship games in play. I can see > why Chris felt so pissed to be jerked around that way. I'll admit to directing some sarcasm at Chris that perhaps an exceptionally thin-skinned person might have seen as ridicule, but I deny jerking him around.
 Signature Eric Sosman esosman@ieee-dot-org.invalid
Lew - 21 Nov 2007 14:02 GMT >> The various answers seemed to be disguising the fact nobody had any >> algorithms other that the minor modification of GZIP to offer. It gave [quoted text clipped - 4 lines] > an exceptionally thin-skinned person might have seen as ridicule, > but I deny jerking him around. There seems to be a nawful lot of thin-skinnedness around lately. Every FAQ I've read, every essay on how best to use it, points out (almost pridefully) that Usenet discussions tend toward lurid prose, even (Gods forfend!) an occasional touch of sarcasm. People were warned (or would have been, if they'd RTFM).
So people: Get over it. Get on with it. Did the information help? Good. Did it not? Ask more questions. Did you feel offended by some turn of rhetoric? Awwwwww.
 Signature Lew
Chris - 21 Nov 2007 23:33 GMT > On Mon, 19 Nov 2007 23:15:12 -0500, Eric Sosman > <esosman@ieee-dot-org.invalid> wrote, quoted or indirectly quoted [quoted text clipped - 13 lines] > me the impression there were one-upmanship games in play. I can see > why Chris felt so pissed to be jerked around that way. Thank you, thank you. I could not have said it better.
I've been thinking for some time about posting on this topic. The problem is that many have a tendency to offer broad programming advice, rather than deal with the poster's question directly.
The problem with this (aside from the irritation factor) is that it tends to crowd out substantive advice. When a thread has a number of answers, many participants tend to skip it, assuming that the question has been answered.
Whenever a poster asks a "How do I make X faster?" question, the non-substantive answers come in droves. Usually they take the form "It depends" (without any discussion of what it depends on), or "Run a profiler/benchmark it first" (without any knowledge about whether the poster has already done that), or just "You don't need to make it fast, just make it work first and optimize later" (which assumes that the poster is so foolish as to have asked the question without first having seen slow behavior in his program).
These sorts of responses are condescending, and really very damaging when they crowd out real answers. I'm sure most of us in the forum have real work that needs to get done, real deadlines, and our income or jobs depend on the insights we get in this and other forums.
We really ought to apply a bit of social pressure to get the egotists and trolls to stuff it.
Eric Sosman - 22 Nov 2007 02:09 GMT > [...] > We really ought to apply a bit of social pressure to get the egotists > and trolls to stuff it. Good idea. Let us also apply some pressure to those who ask strangers for help and then complain when they receive it.
 Signature Eric Sosman esosman@ieee-dot-org.invalid
Roedy Green - 22 Nov 2007 07:05 GMT On Wed, 21 Nov 2007 21:09:13 -0500, Eric Sosman <esosman@ieee-dot-org.invalid> wrote, quoted or indirectly quoted someone who said :
> Good idea. Let us also apply some pressure to those who >ask strangers for help and then complain when they receive it. People tend to be much more candid on the net than they would be in person. That's just the way it is. There is no reason to read a specific-to-you insult into it.
The late Ken Keyes put it this way: "You add suffering to the world just as much when you take offense as when you give offense."
On the other hand, I think several answerers routinely take a rather sadistic joy is putting down newbies and questioners, playing a sort of "Mother May I" game to humiliate them into begging in precisely the correct form, then giving them only a crumb.
This whole compression thing with Chris smells of such a tease because nobody appeared to even know of a potential algorithm.
I think it quite reasonable to request someone produce an SSCCE to help diagnose a problem, but it is not OK to chastise someone new to the group for not already having done so.
Imagine a Monty Pythonesque sketch along these lines:
customer: I would like a beer?
barkeep: would that be a chilled or room temperature?
customer: chilled
barkeep: you stupid sod. Wrong answer. Nobody whose anyone drinks chilled beer. Where do you think you are, Texas?
customer: ok, room temperature.
barkeep: that's better. What brand?
customer: Rickard's Red?
barkeep: don't have it.
customer: Guinness, surely you have a Guinness ale.
barkeep: nope. Try again.
... etc. etc.
customer: what would you recommend?
barkeep: It depends. What's your astrological sign?
... etc. etc.
barkeep: what a silly twit you are! We don't have a liquor licence. You should have read the fine print on the back of the napkins. We don't serve any kind of alcohol here, you stupid, naive, newbie, drunken lout. What do you think this is, a pub? Get out before I call the Raging Grannies.
(not to be taken too literally)
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Andrew Thompson - 22 Nov 2007 10:34 GMT ...
>I think it quite reasonable to request someone produce an SSCCE to >help diagnose a problem, but it is not OK to chastise someone new to >the group for not already having done so. As someone that has a Google alert for the term 'SSCCE', I can assure you - that does not happen.
 Signature Andrew Thompson http://www.physci.org/
Andrew Thompson - 22 Nov 2007 10:49 GMT >... >>I think it quite reasonable to request someone produce an SSCCE to [quoted text clipped - 3 lines] >As someone that has a Google alert for the term 'SSCCE', >I can assure you - that does not happen. Or at least, it does not happen on *usenet*, and if I see any so much as questionable references to produce an SSCCE, I will usually reply and say so.
I *have* seen some rather odd references to SSCCEs on *Sun* forums, but I am not in a position to either get regular updates to the term, nor respond through those (painfully slow) forums.
I do not moniter any other 'web based' forums and am in no position to comment on them.
 Signature Andrew Thompson http://www.physci.org/
Lew - 22 Nov 2007 15:03 GMT >> [...] >> We really ought to apply a bit of social pressure to get the egotists >> and trolls to stuff it. > > Good idea. Let us also apply some pressure to those who > ask strangers for help and then complain when they receive it. It is treatment of Usenet as one's personal help desk that is the height of arrogance.
This is a discussion group. Chris seems to think that we aren't allowed to discuss his topics, but have to jump to his finger snap to answer his problem, and only that, and that any other reaction is egotism and trollishness. Well, la-di-effing-dah.
If you offer a topic here, it's for discussion. If we didn't serve Your Precious Highness exactly as desired, that's the risk with voluntary participants who ourselves are trying to learn.
So get off your high horse, Chris.
 Signature Lew
Robert Klemme - 27 Nov 2007 19:49 GMT > Whenever a poster asks a "How do I make X faster?" question, the > non-substantive answers come in droves. Usually they take the form "It > depends" (without any discussion of what it depends on), I find this one a bit odd. When I do not get enough information to provide a reasonable answer, I rather say "it depends" and ask for more details about the problem to solve. Sometimes there are a few "depends" that can be pointed at and sometimes the problem is so underspecified that it's much more reasonable to poke for more information instead of starting to list all the alternatives.
The alternative might be to just not respond. But that way an OP won't know whether people do not answer because they have no idea or whether they need more input. Sometimes one is so absorbed from the problem at hand that one does not realize that a newsgroup posting contained too little information. In those cases it can be helpful to be asked to provide more input.
Kind regards
robert
George Neuner - 20 Nov 2007 17:58 GMT >2. Text means text, not DNA. Written words. > [quoted text clipped - 5 lines] >4. You don't need to know about the environment or the rest of the app, >because I'm not asking for a damn consultant. There's no need to get nasty. Just in general you'll get better answers when people have a general idea of where you're headed. Storing gigabytes of "text" is not a typical problem other than for search engines, data mining, or DNA research.
>5. Asking "how much compression I want" is just stupid. No, this question is FAR from stupid. For all you knew, the alternatives to "UTF-8 + zip" may have been much worse for your application.
This forum is open to all and people frequently tackle things that are way beyond their capabilities. You are now claiming to have the ability to evaluate suggestions yourself, but we had no way of knowing that until now - you gave us no clues as to your own capabilities and precious little information about what you are attempting.
>In short, a narrow question is an invitation for a narrow answer, not an >invitation for you to tell me how you think I should write this app. A narrow question is frequently a sign of a newbie in over his head. Beyond that, answers to questions frequently create new questions ... if not from the original poster, then from a reader seeking more information about a response.
If you're looking for a simple answer, go find yourself a programming oracle ... on Usenet any halfway interesting subject (of which compression is one) is likely to invite a debate.
George -- for email reply remove "/" from address
Roedy Green - 21 Nov 2007 03:44 GMT On Tue, 20 Nov 2007 12:58:38 -0500, George Neuner <gneuner2/@/comcast.net> wrote, quoted or indirectly quoted someone who said :
>Storing gigabytes of "text" is not a typical problem other than for >search engines, data mining, or DNA research. The more you know about the text the more options you have for compression. For example if the text used some limited character set or some limited word vocabulary specialised compression is possible.
"Text" can vary from instrument readings in ASCII to armoured encrypted data.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Roedy Green - 19 Nov 2007 11:41 GMT On Mon, 19 Nov 2007 01:51:11 -0500, George Neuner <gneuner2/@/comcast.net> wrote, quoted or indirectly quoted someone who said :
>>> Multi-terabyte sized, split across multiple machines. On a single >>> machine, generally not more than a few hundred Gb. One or two disks per >>> machine, SATA, no RAID. > >This sounds like it might be DNA sequences. If they are, convert each letter to 3 bits, then pack them 21 to a long, wasting the sign bit.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
rossum - 19 Nov 2007 16:38 GMT >On Mon, 19 Nov 2007 01:51:11 -0500, George Neuner ><gneuner2/@/comcast.net> wrote, quoted or indirectly quoted someone [quoted text clipped - 7 lines] > >If they are, convert each letter to 3 bits, Two bits, there are only four letters.
rossum
>then pack them 21 to a >long, wasting the sign bit. Roedy Green - 21 Nov 2007 03:49 GMT >>If they are, convert each letter to 3 bits, >Two bits, there are only four letters. I presume you would have some meta information embedded as well, e.g. names and locations of genes. You need an extra bit to escape. You could also handle meta information by encoding lengths in front of 2-bit encoded nucleotide sequences.
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Eric Sosman - 19 Nov 2007 14:44 GMT >>>> Bonus question for OP: what is the size of data sets and how are they >>>> used? Especially, where are they stored? [quoted text clipped - 21 lines] > compression is typically very good (80+%), so it makes sense to not > worry about it and just pick a convenient sized buffer to work with. If you're right, the alphabet is so small that I'd question the need for the UTF-8 conversion. DEFLATE will quickly learn that every other byte is a zero, and will compress them very well.
 Signature Eric Sosman esosman@ieee-dot-org.invalid
Chris - 18 Nov 2007 22:28 GMT >> What's the fastest way to compress/decompress text? > > If you're really interested in "the fastest way" to the > exclusion of all other concerns, then don't compress at all. > Bingo! Problem solved! Um, actually no. When you're dealing with really large datasets it's generally faster to compress than not to compress. The reason is that compressed data requires less disk I/O.
Not always, of course; really processor-intensive compression schemes can make you processor-bound. But in general, that's the case.
Yup, we've benchmarked it.
> You might be happier with a compression scheme that did > a little better at reducing the size of the data, but now you [quoted text clipped - 3 lines] > method but emitted fifteen percent more data, would you take it > or reject it? Yes, but I don't think that knowing that is essential to answering the question. The question is really about finding better tradeoffs than we can get with char conversion + zip conversion.
Robert Klemme - 21 Nov 2007 20:57 GMT >>> What's the fastest way to compress/decompress text? >> [quoted text clipped - 22 lines] > question. The question is really about finding better tradeoffs than we > can get with char conversion + zip conversion. If you have strings already then serialization is one form to convert them into binary data (which can be compressed with Java's standard lib). If you can get rid of UTF-8 compression this might be an alternative.
robert
Roedy Green - 18 Nov 2007 20:41 GMT >What's the fastest way to compress/decompress text? You can try GZIP. See http://mindprod.com/applet/fileio.html for sample code. It is simpler. It might be faster.
Note that zip has a compression number where you can trade off speed for compression.
If you have a limited vocabulary, you might try just looking up words and converting to 16-bit index numbers.
Each number is a word + a trailing space.
see http://mindprod.com/project/supercompressor.html
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Arne Vajhøj - 18 Nov 2007 21:25 GMT > On Sun, 18 Nov 2007 13:44:44 -0600, Chris <spam_me_not@goaway.com> >> What's the fastest way to compress/decompress text? Not quoted: #We're currently converting char arrays to byte arrays using our own #UTF-8 conversion code, and then compressing the bytes using #java.util.zip.
> You can try GZIP. See http://mindprod.com/applet/fileio.html > for sample code. It is simpler. It might be faster. ZIP and GZIP uses the same basic algorithm - the difference is just in the packaging where zip supports multiple files and store file meta data while gzip does not.
Arne
Mark Rafn - 19 Nov 2007 03:54 GMT >What's the fastest way to compress/decompress text? As others have said, this is going to vary a lot based on exact requirements.
>We're doing that over really large datasets in our app. We're currently >converting char arrays to byte arrays using our own UTF-8 conversion >code, and then compressing the bytes using java.util.zip. The code is >pretty old. >I don't like this two-step process, and the profiler shows that this is >a bottleneck in our app. I'd start by benchmarking standard Java encoding and compression against your two-step process. This STILL buffers stuff internally, so I wouldn't expect huge gains unless your home-grown code is broken somehow.
Writer w = new OutputStreamWriter(new GZipOutputStream(real_out), "UTF-8");
This will at least get you a baseline, and it'll be fairly easy to use other DeflaterOutputStream implementations.
>Is anyone aware of any code that compresses chars directly? Perhaps a >third-party library that does it faster? Keep in mind that UTF-8 is actually a compression for many datasets. It's mildly expensive, but it's also a 50% size reduction from the 16-bit internal representation if you don't have many non-ascii characters.
It's certainly worth benchmarking UTF-16 encoding on your data. This should be much cheaper (AFAIK, all common JVMs use UTF-16 as their internal storage, so UTF-16 encoding is a no-op). But it'll give you a lot more bytes to compress, which may or may not wash out in the compression mechanism and copying from buffer to buffer. -- Mark Rafn dagon@dagon.net <http://www.dagon.net/>
Roedy Green - 19 Nov 2007 08:50 GMT >This will at least get you a baseline, and it'll be fairly easy to use other >DeflaterOutputStream implementations. Google points out:
http://teatrove.sourceforge.net/javadoc/com/go/trove/util/DeflaterOutputStream.html
 Signature Roedy Green Canadian Mind Products The Java Glossary http://mindprod.com
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|