Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / November 2007

Tip: Looking for answers? Try searching our database.

Java text compression

Thread view: 
Chris - 18 Nov 2007 19:44 GMT
What's the fastest way to compress/decompress text?

We're doing that over really large datasets in our app. We're currently
converting char arrays to byte arrays using our own UTF-8 conversion
code, and then compressing the bytes using java.util.zip. The code is
pretty old.

I don't like this two-step process, and the profiler shows that this is
a bottleneck in our app.

Is anyone aware of any code that compresses chars directly? Perhaps a
third-party library that does it faster?

In our particular situation, decompression speed is a lot more important
than compression speed.
Joshua Cranmer - 18 Nov 2007 20:13 GMT
> What's the fastest way to compress/decompress text?

I know of a compression method that can compress in constant time: it
replaces the entire text with a single `0' (hey, it's lossy!). To what
degree of compression do you need?

> We're doing that over really large datasets in our app. We're currently
> converting char arrays to byte arrays using our own UTF-8 conversion
[quoted text clipped - 3 lines]
> I don't like this two-step process, and the profiler shows that this is
> a bottleneck in our app.

Questions:
1. From whence are the char arrays coming?
2. Where are the byte arrays going?
3. Which part of the process is the bottleneck? The UTF-8 conversion, or
the compression?

> Is anyone aware of any code that compresses chars directly? Perhaps a
> third-party library that does it faster?
>
> In our particular situation, decompression speed is a lot more important
> than compression speed.

Signature

Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth

Chris - 18 Nov 2007 22:19 GMT
>> What's the fastest way to compress/decompress text?
>
> I know of a compression method that can compress in constant time: it
> replaces the entire text with a single `0' (hey, it's lossy!). To what
> degree of compression do you need?

As with all compression apps, as much as possible. It's hard to pick a
number until you know the available tradeoffs. We already know the
tradeoffs with java.util.zip, so obviously we're looking for something
faster.

> Questions:
> 1. From whence are the char arrays coming?

Memory. They get streamed from an external source, compressed, and
written to disk as they come in.

> 2. Where are the byte arrays going?

Also memory, although ultimately to the UI.

> 3. Which part of the process is the bottleneck? The UTF-8 conversion, or
> the compression?

Both, though the UTF-8 is the bigger burden.
Eric Sosman - 18 Nov 2007 20:16 GMT
> What's the fastest way to compress/decompress text?

    If you're really interested in "the fastest way" to the
exclusion of all other concerns, then don't compress at all.
Bingo!  Problem solved!

    You might be happier with a compression scheme that did
a little better at reducing the size of the data, but now you
can't get a sensible answer until you describe the trade-offs
you're willing to make.  For example, if you were offered a
compression scheme that ran ten percent faster than your current
method but emitted fifteen percent more data, would you take it
or reject it?

> We're doing that over really large datasets in our app. We're currently
> converting char arrays to byte arrays using our own UTF-8 conversion
[quoted text clipped - 6 lines]
> Is anyone aware of any code that compresses chars directly? Perhaps a
> third-party library that does it faster?

    How badly do you need your own idiosyncratic UTF-8 conversion?
If you can use standard methods, consider wrapping the compressed
streams, somewhat like

    Writer w = new OutputStreamWriter(
       new GZIPOutputStream(...));
    w.write("Hello, world!");

    BufferedReader r = new BufferedReader(
       new InputStreamReader(
           new GZIPInputStream(...)));
    String s = r.readLine();

    You'll have to make your own assessment of the speeds and the
degree of compression.
   
> In our particular situation, decompression speed is a lot more important
> than compression speed.

    "You'll have to make your own assessment ..."

Signature

Eric Sosman
esosman@ieee-dot-org.invalid

Robert Klemme - 18 Nov 2007 21:36 GMT
>> What's the fastest way to compress/decompress text?
>
[quoted text clipped - 9 lines]
> method but emitted fifteen percent more data, would you take it
> or reject it?

Bonus question for OP: what is the size of data sets and how are they
used?  Especially, where are they stored?

>> We're doing that over really large datasets in our app. We're
>> currently converting char arrays to byte arrays using our own UTF-8
[quoted text clipped - 13 lines]
>     Writer w = new OutputStreamWriter(
>         new GZIPOutputStream(...));

Minor detail: The encoding is missing, so in this case it would rather be

new OutputStreamWriter(new GZIPOutputstream(...), "UTF-8")

>     w.write("Hello, world!");
>
[quoted text clipped - 5 lines]
>     You'll have to make your own assessment of the speeds and the
> degree of compression.

This is the solution I was going to suggest as well.  If the custom
encoding yields the same results as Java's built in UTF-8 then I would
immediately switch to this approach of stacked streams.  If the custom
encoding yields slightly different results I am sure you can plug in the
custom encoding into the standard Java io and nio classes with a little
effort.

If data is to be stored in a BLOB via JDBC you can even extend this
approach to directly stream into the database.

>> In our particular situation, decompression speed is a lot more
>> important than compression speed.
>
>     "You'll have to make your own assessment ..."

Decompression is generally much faster than compression.  I believe
there is not much difference in decompression speed when decompressing a
GZIP stream that was compressed with highest and lowest level.  If you
dig a bit into compression theory then it's pretty obvious that finding
a small compressed representation is significantly harder than
converting a compressed data set back.

Kind regards

    robert
Chris - 18 Nov 2007 22:35 GMT
> Bonus question for OP: what is the size of data sets and how are they
> used?  Especially, where are they stored?

Multi-terabyte sized, split across multiple machines. On a single
machine, generally not more than a few hundred Gb. One or two disks per
machine, SATA, no RAID.

At compression time, the data is streamed from an external source,
transformed in memory, and written to disk.

At decompression time, the app seeks to the particular block of text of
interest and decompresses it. Seek time dominates decompression time,
*except* when we do heavy caching, in which case the decompression
becomes the bottleneck. Storing the decompressed text in memory takes up
too much space. Has to be cached in compressed form.
Eric Sosman - 18 Nov 2007 23:30 GMT
>> Bonus question for OP: what is the size of data sets and how are they
>> used?  Especially, where are they stored?
[quoted text clipped - 11 lines]
> becomes the bottleneck. Storing the decompressed text in memory takes up
> too much space. Has to be cached in compressed form.

    Cutting the text into blocks usually makes the compression
less effective.  Most compression schemes nowadays (and I believe
DEFLATE is among them) are adaptive, meaning that they adjust to
the characteristics of the data stream as they process it.  Thus,
they compress relatively poorly at first, then improve as they
learn more about the statistical profile of the data.

    Implications: (1) You'll get better compression if you can
keep the blocks "fairly long."  (1a) You might make the blocks
"long" by concatenating multiple sub-blocks, at the expense of
needing to decompress from the start of a block even if you only
need the sub-block at its end.  (2) If the blocks simply must be
small, you probably shouldn't waste effort on BEST_COMPRESSION.

    Some interesting experiments are in order.

Signature

Eric Sosman
esosman@ieee-dot-org.invalid

George Neuner - 19 Nov 2007 06:51 GMT
>>> Bonus question for OP: what is the size of data sets and how are they
>>> used?  Especially, where are they stored?
>>
>> Multi-terabyte sized, split across multiple machines. On a single
>> machine, generally not more than a few hundred Gb. One or two disks per
>> machine, SATA, no RAID.

This sounds like it might be DNA sequences.

>> At compression time, the data is streamed from an external source,
>> transformed in memory, and written to disk.
[quoted text clipped - 11 lines]
>they compress relatively poorly at first, then improve as they
>learn more about the statistical profile of the data.

Most LZ based implementations (including DEFLATE) limit codes to
16-bits (I've heard of 32-bit LZ, but I've never seen it).
Compression studies have shown that, on average, the 16-bit code
dictionary will be filled after processing 200KB of input.  

If the remainder of the input is characteristically similar to the
part already encoded, the full dictionary will compress the rest of
the input pretty well.  But most input varies as it goes, sometimes
rapidly and drastically, so it does make sense to segment the input to
take advantage of the variation.

>     Implications: (1) You'll get better compression if you can
>keep the blocks "fairly long."  (1a) You might make the blocks
[quoted text clipped - 4 lines]
>
>     Some interesting experiments are in order.

Unfortunately, the best segment length depends on the input.  For
general text I would probably use relatively small blocks, 16-64KB
depending on the input - the more variation in the input, the smaller
the blocks need to be to take advantage of it.

If we are talking about DNA sequences, then I would probably go for
256KB - once the base nucleotides and amino acid sequences are in the
dictionary (and you can guarantee this by preloading them),
compression is typically very good (80+%), so it makes sense to not
worry about it and just pick a convenient sized buffer to work with.

George
--
for email reply remove "/" from address
rossum - 19 Nov 2007 09:59 GMT
>This sounds like it might be DNA sequences.
If it is DNA sequences, then you might be able to use something like
Base64 for initial coding:

A -> 00
C -> 01
G -> 10
T -> 11

AAA -> 000000
...
ACG -> 000110
...
TTT -> 111111

Those six bit codes are a straight match for Base64 coding.  That will
compress three bytes "AAC" into one "B" (000001 -> B in Base64).

rossum
Andrew Thompson - 19 Nov 2007 11:14 GMT
>>This sounds like it might be DNA sequences.
>If it is DNA sequences, ..

Sounds more like a heap of speculation over a 'cagey' problem
specification.

To clarify the 'cagey' aspect, perhaps the OP can indeed express
what the ulitmate point of this exercise is, rather than doling out
'precious little tid-bits'  that supposedly represent the actual problem.

(Ultimately, the OP's own posts identify that they are entirely
unaware of the subtleties of the problems they face, and until
they fill in more detail, it really constitutes an 'interesting but
inefficient use of bandwidth' to speculate further.)

Signature

Andrew Thompson
http://www.physci.org/

Chris - 20 Nov 2007 03:55 GMT
>>> This sounds like it might be DNA sequences.
>> If it is DNA sequences, ..
[quoted text clipped - 10 lines]
> they fill in more detail, it really constitutes an 'interesting but
> inefficient use of bandwidth' to speculate further.)

I started to write a long, irritable response to this, but I just don't
have time. Maybe I'll start a new thread later on twerps whose snotty,
fatuous responses waste bandwidth.

I'll say only this:

1. The problem is fully specified. Text compression is a known problem,
and I just asked if anyone knew of a Java library that had better
tradeoffs than UTF-8 + zip.

2. Text means text, not DNA. Written words.

3. I'm fully aware of the subtleties of this particular problem space,
and there is nothing in any of the posts so far that I haven't
considered and already benchmarked. (Including block sizes. For this
app, ~10k is optimal).

4. You don't need to know about the environment or the rest of the app,
because I'm not asking for a damn consultant.

5. Asking "how much compression I want" is just stupid.

In short, a narrow question is an invitation for a narrow answer, not an
invitation for you to tell me how you think I should write this app.
Eric Sosman - 20 Nov 2007 04:15 GMT
> [...]
> 1. The problem is fully specified. Text compression is a known problem,
> and I just asked if anyone knew of a Java library that had better
> tradeoffs than UTF-8 + zip.
>
> 2. Text means text, not DNA. Written words.

    Elsethread you've explained that the compressed stream
gets read back into a companion program and decompressed there;
this suggests that it doesn't need to be exchanged with "foreign"
programs.  In which case, I ask again: Does UTF-8 encoding buy
you enough additional compression to justify its expense?  How
bad would things be if you just handed 16-bit chars to the
compressor with no "intelligence" whatsoever?

> 5. Asking "how much compression I want" is just stupid.

    Well, you asked about compression speed.  Other things
being equal, faster compressors compress less well and "looser"
compressors compress faster, so the question of "how much" must
eventually arise when you weigh alternatives.

Signature

Eric Sosman
esosman@ieee-dot-org.invalid

Chris - 20 Nov 2007 04:51 GMT
>> [...]
>> 1. The problem is fully specified. Text compression is a known
[quoted text clipped - 10 lines]
> bad would things be if you just handed 16-bit chars to the
> compressor with no "intelligence" whatsoever?

I'd like to try that. Unfortunately, java.util.zip.Deflater accepts only
byte arrays, not char arrays. I suppose it might be faster to copy the
chars to 2-byte sequences and compress, rather than run the UTF-8
compressor. An extra step, but worth a try.

>> 5. Asking "how much compression I want" is just stupid.
>
>     Well, you asked about compression speed.  Other things
> being equal, faster compressors compress less well and "looser"
> compressors compress faster, so the question of "how much" must
> eventually arise when you weigh alternatives.

Of course. It just reminded of walking into a store and having the clerk
ask "how much do you want to pay?" The right answer is, "show me the
merchandise and I'll figure out what the tradeoffs are on my own".
Andrew Thompson - 20 Nov 2007 05:02 GMT
>>> [...]
...
>Of course. It just reminded of walking into a store and having the clerk
>ask "how much do you want to pay?"  ...

That might be an appropriate comparison if this were
a help desk.  It is not a help desk.  (Though you seem
to be treating the responders as though they were your
own personal servants - so perhaps you are confused
about the differences between 'help-desk' and 'usenet').

Signature

Andrew Thompson
http://www.physci.org/

Roger Lindsjö - 20 Nov 2007 11:16 GMT
>>> 5. Asking "how much compression I want" is just stupid.
>>
[quoted text clipped - 6 lines]
> ask "how much do you want to pay?" The right answer is, "show me the
> merchandise and I'll figure out what the tradeoffs are on my own".

But don't you think it is an appropriate question? At least to get in
the ballpark? If you answered that you have regular text and you need
~10X compression, then people can suggest algorithms that has been shown
to give that compression. With no specification you will get suggestions
that are very fast (no compression was suggested) to methods that will
give very high compression rates but run very slowly.

Similar to your analogy, the electronics store where I have been looking
at a new TV has TVs ranging from about $100 to $120000. I doubt that
when someone walks and says "I want a new TV" that the first suggestion
is to look at the very expensive on (I heard a rumor that 1 was sold in
Sweden so far). Size and price are common to narrow down the number of
choices.

//Roger Lindsjö
Lew - 20 Nov 2007 13:59 GMT
Chris wrote:
>> Of course. It just reminded of walking into a store and having the
>> clerk ask "how much do you want to pay?" The right answer is, "show me
>> the merchandise and I'll figure out what the tradeoffs are on my own".

> But don't you think it is an appropriate question? At least to get in
> the ballpark?  ...
[quoted text clipped - 5 lines]
> Sweden so far). Size and price are common to narrow down the number of
> choices.

Car dealers, clothing stores, flower shops - all routinely ask me "how much do
I want to spend?"

Signature

Lew

Lasse Reichstein Nielsen - 20 Nov 2007 17:30 GMT
> Car dealers, clothing stores, flower shops - all routinely ask me "how
> much do I want to spend?"

... and this is where you lie! :)
/L
Signature

Lasse Reichstein Nielsen  -  lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
 'Faith without judgement merely degrades the spirit divine.'

Lew - 21 Nov 2007 01:50 GMT
>> Car dealers, clothing stores, flower shops - all routinely ask me "how
>> much do I want to spend?"
>
> .... and this is where you lie! :)

Which way?  Does one claim to be a big spender or a skinflint?

Signature

Lew

Lasse Reichstein Nielsen - 21 Nov 2007 05:59 GMT
>>> Car dealers, clothing stores, flower shops - all routinely ask me "how
>>> much do I want to spend?"
>> .... and this is where you lie! :)
>
> Which way?  Does one claim to be a big spender or a skinflint?

OK, I guess it's mainly car dealers where you don't want to tell them
your limit. Car shopping is a negotiation, and it's harder to press
the price if you have already told them what you are willing to
spend. I'd say something at least 20% lower than my real limit.
I.e., neither a big spender or a skinflint, but someone more restricted
in your spending than you actually are.
But then again, I'm not a great negotiatior, so don't necessarily
trust my opinion :)

Flower shops are ok.

/L
Signature

Lasse Reichstein Nielsen  -  lrn@hotpop.com
DHTML Death Colors: <URL:http://www.infimum.dk/HTML/rasterTriangleDOM.html>
 'Faith without judgement merely degrades the spirit divine.'

Eric Sosman - 20 Nov 2007 13:37 GMT
>>> [...]
>>> 1. The problem is fully specified. Text compression is a known
[quoted text clipped - 15 lines]
> chars to 2-byte sequences and compress, rather than run the UTF-8
> compressor. An extra step, but worth a try.

    Might it instead be the removal of a step?  You've also
mentioned that the data are "streamed from an external source;"
in what form do they arrive?  Unless you're using something
like RMI they probably don't arrive as full-fledged String
objects, but are converted to Strings from some more primitive
form -- like, perhaps, streams of bytes?

>>> 5. Asking "how much compression I want" is just stupid.
>>
[quoted text clipped - 6 lines]
> ask "how much do you want to pay?" The right answer is, "show me the
> merchandise and I'll figure out what the tradeoffs are on my own".

    ... and to push the analogy perhaps just a little bit too
far, the kindly clerk dumps the 600-page inventory list in your
lap and walks away.  That is, a search is more efficient if the
searcher is an active participant instead of a filter-feeder.

Signature

Eric Sosman
esosman@ieee-dot-org.invalid

Roedy Green - 21 Nov 2007 03:41 GMT
On Mon, 19 Nov 2007 23:15:12 -0500, Eric Sosman
<esosman@ieee-dot-org.invalid> wrote, quoted or indirectly quoted
someone who said :

>     Well, you asked about compression speed.  Other things
>being equal, faster compressors compress less well and "looser"
>compressors compress faster, so the question of "how much" must
>eventually arise when you weigh alternatives.

The proper answer would have been
use X if you want fast but not so hot compression
use Y if you want slow but deep compression.

The various answers seemed to be disguising the fact nobody had any
algorithms other that the minor modification of GZIP to offer. It gave
me the impression there were one-upmanship games in play.  I can see
why Chris felt so pissed to be jerked around that way.

Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Eric Sosman - 21 Nov 2007 13:54 GMT
> On Mon, 19 Nov 2007 23:15:12 -0500, Eric Sosman
> <esosman@ieee-dot-org.invalid> wrote, quoted or indirectly quoted
[quoted text clipped - 8 lines]
> use X if you want fast but not so hot compression
> use Y if you want slow but deep compression.

    Quite aside from the point that enumerating the set of
possible compression techniques may be significantly harder than
counting to two, I don't think that's the "proper" answer at all.
The O.P. does not have a compression problem, but a systemic
problem -- even in the original posting he mentioned the custom-
written UTF-8 conversion as part of the bottleneck.  The "proper"
answer is to take a step back and get a wider view of the context.
Global optimization, not peephole optimization.

    If someone says he's spending a hundred dollars a week putting
motor oil in his car and asks for advice on how to get inexpensive
oil, the proper answer is not to help him shave a penny here and
a penny there but to find out what's wrong with the car!

> The various answers seemed to be disguising the fact nobody had any
> algorithms other that the minor modification of GZIP to offer. It gave
> me the impression there were one-upmanship games in play.  I can see
> why Chris felt so pissed to be jerked around that way.

    I'll admit to directing some sarcasm at Chris that perhaps
an exceptionally thin-skinned person might have seen as ridicule,
but I deny jerking him around.

Signature

Eric Sosman
esosman@ieee-dot-org.invalid

Lew - 21 Nov 2007 14:02 GMT
>> The various answers seemed to be disguising the fact nobody had any
>> algorithms other that the minor modification of GZIP to offer. It gave
[quoted text clipped - 4 lines]
> an exceptionally thin-skinned person might have seen as ridicule,
> but I deny jerking him around.

There seems to be a nawful lot of thin-skinnedness around lately.  Every FAQ
I've read, every essay on how best to use it, points out (almost pridefully)
that Usenet discussions tend toward lurid prose, even (Gods forfend!) an
occasional touch of sarcasm.  People were warned (or would have been, if
they'd RTFM).

So people: Get over it.  Get on with it.  Did the information help?  Good.
Did it not?  Ask more questions.  Did you feel offended by some turn of
rhetoric?  Awwwwww.

Signature

Lew

Chris - 21 Nov 2007 23:33 GMT
> On Mon, 19 Nov 2007 23:15:12 -0500, Eric Sosman
> <esosman@ieee-dot-org.invalid> wrote, quoted or indirectly quoted
[quoted text clipped - 13 lines]
> me the impression there were one-upmanship games in play.  I can see
> why Chris felt so pissed to be jerked around that way.

Thank you, thank you. I could not have said it better.

I've been thinking for some time about posting on this topic. The
problem is that many have a tendency to offer broad programming advice,
rather than deal with the poster's question directly.

The problem with this (aside from the irritation factor) is that it
tends to crowd out substantive advice. When a thread has a number of
answers, many participants tend to skip it, assuming that the question
has been answered.

Whenever a poster asks a "How do I make X faster?" question, the
non-substantive answers come in droves. Usually they take the form "It
depends" (without any discussion of what it depends on), or "Run a
profiler/benchmark it first" (without any knowledge about whether the
poster has already done that), or just "You don't need to make it fast,
just make it work first and optimize later" (which assumes that the
poster is so foolish as to have asked the question without first having
seen slow behavior in his program).

These sorts of responses are condescending, and really very damaging
when they crowd out real answers. I'm sure most of us in the forum have
real work that needs to get done, real deadlines, and our income or jobs
depend on the insights we get in this and other forums.

We really ought to apply a bit of social pressure to get the egotists
and trolls to stuff it.
Eric Sosman - 22 Nov 2007 02:09 GMT
> [...]
> We really ought to apply a bit of social pressure to get the egotists
> and trolls to stuff it.

    Good idea.  Let us also apply some pressure to those who
ask strangers for help and then complain when they receive it.

Signature

Eric Sosman
esosman@ieee-dot-org.invalid

Roedy Green - 22 Nov 2007 07:05 GMT
On Wed, 21 Nov 2007 21:09:13 -0500, Eric Sosman
<esosman@ieee-dot-org.invalid> wrote, quoted or indirectly quoted
someone who said :

>     Good idea.  Let us also apply some pressure to those who
>ask strangers for help and then complain when they receive it.

People tend to be much more candid on the net than they would be in
person.  That's just the way it is.  There is no reason to read a
specific-to-you insult into it.

The late Ken Keyes put it this way:
"You add suffering to the world just as much when you take offense as
when you give offense."

On the other hand, I think several answerers routinely take a rather
sadistic joy is putting down newbies and questioners, playing a sort
of "Mother May I" game to humiliate them into begging in precisely the
correct form, then giving them only a crumb.

This whole compression thing with Chris smells of such a tease because
nobody appeared to even know of a potential algorithm.

I think it quite reasonable to request someone produce an SSCCE to
help diagnose a problem, but it is not OK to chastise someone new to
the group for not already having done so.

Imagine a Monty Pythonesque sketch along these lines:

customer: I would like a beer?

barkeep: would that be a chilled or room temperature?

customer: chilled

barkeep: you stupid sod. Wrong answer.  Nobody whose anyone drinks
chilled beer. Where do you think you are, Texas?

customer: ok, room temperature.

barkeep: that's better.  What brand?

customer: Rickard's Red?

barkeep: don't have it.  

customer: Guinness, surely you have a Guinness ale.

barkeep: nope.  Try again.

... etc. etc.

customer: what would you recommend?

barkeep: It depends. What's your astrological sign?

... etc. etc.

barkeep: what a silly twit you are!  We don't have a liquor licence.
You should have read the fine print on the back of the napkins. We
don't serve any kind of alcohol here, you stupid, naive, newbie,
drunken lout. What do you think this is, a pub? Get out before I call
the Raging Grannies.

(not to be taken too literally)
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Andrew Thompson - 22 Nov 2007 10:34 GMT
...
>I think it quite reasonable to request someone produce an SSCCE to
>help diagnose a problem, but it is not OK to chastise someone new to
>the group for not already having done so.

As someone that has a Google alert for the term 'SSCCE',
I can assure you - that does not happen.

Signature

Andrew Thompson
http://www.physci.org/

Andrew Thompson - 22 Nov 2007 10:49 GMT
>...
>>I think it quite reasonable to request someone produce an SSCCE to
[quoted text clipped - 3 lines]
>As someone that has a Google alert for the term 'SSCCE',
>I can assure you - that does not happen.

Or at least, it does not happen on *usenet*, and if I
see any so much as questionable references to
produce an SSCCE, I will usually reply and say so.

I *have* seen some rather odd references to SSCCEs on
*Sun* forums, but I am not in a position to either get regular
updates to the term, nor respond through those (painfully
slow) forums.

I do not moniter any other 'web based' forums and am
in no position to comment on them.

Signature

Andrew Thompson
http://www.physci.org/

Lew - 22 Nov 2007 15:03 GMT
>> [...]
>> We really ought to apply a bit of social pressure to get the egotists
>> and trolls to stuff it.
>
>     Good idea.  Let us also apply some pressure to those who
> ask strangers for help and then complain when they receive it.

It is treatment of Usenet as one's personal help desk that is the height of
arrogance.

This is a discussion group.  Chris seems to think that we aren't allowed to
discuss his topics, but have to jump to his finger snap to answer his problem,
and only that, and that any other reaction is egotism and trollishness.  Well,
la-di-effing-dah.

If you offer a topic here, it's for discussion.  If we didn't serve Your
Precious Highness exactly as desired, that's the risk with voluntary
participants who ourselves are trying to learn.

So get off your high horse, Chris.

Signature

Lew

Robert Klemme - 27 Nov 2007 19:49 GMT
> Whenever a poster asks a "How do I make X faster?" question, the
> non-substantive answers come in droves. Usually they take the form "It
> depends" (without any discussion of what it depends on),

I find this one a bit odd.  When I do not get enough information to
provide a reasonable answer, I rather say "it depends" and ask for more
details about the problem to solve.  Sometimes there are a few "depends"
that can be pointed at and sometimes the problem is so underspecified
that it's much more reasonable to poke for more information instead of
starting to list all the alternatives.

The alternative might be to just not respond.  But that way an OP won't
know whether people do not answer because they have no idea or whether
they need more input.  Sometimes one is so absorbed from the problem at
hand that one does not realize that a newsgroup posting contained too
little information.  In those cases it can be helpful to be asked to
provide more input.

Kind regards

    robert
George Neuner - 20 Nov 2007 17:58 GMT
>2. Text means text, not DNA. Written words.
>
[quoted text clipped - 5 lines]
>4. You don't need to know about the environment or the rest of the app,
>because I'm not asking for a damn consultant.

There's no need to get nasty. Just in general you'll get better
answers when people have a general idea of where you're headed.
Storing gigabytes of "text" is not a typical problem other than for
search engines, data mining, or DNA research.  

>5. Asking "how much compression I want" is just stupid.

No, this question is FAR from stupid.  For all you knew, the
alternatives to "UTF-8 + zip" may have been much worse for your
application.  

This forum is open to all and people frequently tackle things that are
way beyond their capabilities.  You are now claiming to have the
ability to evaluate suggestions yourself, but we had no way of knowing
that until now - you gave us no clues as to your own capabilities and
precious little information about what you are attempting.

>In short, a narrow question is an invitation for a narrow answer, not an
>invitation for you to tell me how you think I should write this app.

A narrow question is frequently a sign of a newbie in over his head.
Beyond that, answers to questions frequently create new questions ...
if not from the original poster, then from a reader seeking more
information about a response.  

If you're looking for a simple answer, go find yourself a programming
oracle ... on Usenet any halfway interesting subject (of which
compression is one) is likely to invite a debate.

George
--
for email reply remove "/" from address
Roedy Green - 21 Nov 2007 03:44 GMT
On Tue, 20 Nov 2007 12:58:38 -0500, George Neuner
<gneuner2/@/comcast.net> wrote, quoted or indirectly quoted someone
who said :

>Storing gigabytes of "text" is not a typical problem other than for
>search engines, data mining, or DNA research.  

The more you know about the text the more options you have for
compression.  For example if the text used some limited character set
or some limited word vocabulary specialised compression is possible.

"Text" can vary from instrument readings in ASCII to armoured
encrypted data.
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Roedy Green - 19 Nov 2007 11:41 GMT
On Mon, 19 Nov 2007 01:51:11 -0500, George Neuner
<gneuner2/@/comcast.net> wrote, quoted or indirectly quoted someone
who said :

>>> Multi-terabyte sized, split across multiple machines. On a single
>>> machine, generally not more than a few hundred Gb. One or two disks per
>>> machine, SATA, no RAID.
>
>This sounds like it might be DNA sequences.

If they are, convert each letter to 3 bits, then pack them 21 to a
long, wasting the sign bit.
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

rossum - 19 Nov 2007 16:38 GMT
>On Mon, 19 Nov 2007 01:51:11 -0500, George Neuner
><gneuner2/@/comcast.net> wrote, quoted or indirectly quoted someone
[quoted text clipped - 7 lines]
>
>If they are, convert each letter to 3 bits,
Two bits, there are only four letters.

rossum

>then pack them 21 to a
>long, wasting the sign bit.
Roedy Green - 21 Nov 2007 03:49 GMT
>>If they are, convert each letter to 3 bits,
>Two bits, there are only four letters.

I presume you would have some meta information embedded as well, e.g.
names and locations of genes.  You need an extra bit to escape. You
could also handle meta information by encoding lengths in front of
2-bit encoded nucleotide sequences.

Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Eric Sosman - 19 Nov 2007 14:44 GMT
>>>> Bonus question for OP: what is the size of data sets and how are they
>>>> used?  Especially, where are they stored?
[quoted text clipped - 21 lines]
> compression is typically very good (80+%), so it makes sense to not
> worry about it and just pick a convenient sized buffer to work with.

    If you're right, the alphabet is so small that I'd question
the need for the UTF-8 conversion.  DEFLATE will quickly learn
that every other byte is a zero, and will compress them very well.

Signature

Eric Sosman
esosman@ieee-dot-org.invalid

Chris - 18 Nov 2007 22:28 GMT
>> What's the fastest way to compress/decompress text?
>
>     If you're really interested in "the fastest way" to the
> exclusion of all other concerns, then don't compress at all.
> Bingo!  Problem solved!

Um, actually no. When you're dealing with really large datasets it's
generally faster to compress than not to compress. The reason is that
compressed data requires less disk I/O.

Not always, of course; really processor-intensive compression schemes
can make you processor-bound. But in general, that's the case.

Yup, we've benchmarked it.

>     You might be happier with a compression scheme that did
> a little better at reducing the size of the data, but now you
[quoted text clipped - 3 lines]
> method but emitted fifteen percent more data, would you take it
> or reject it?

Yes, but I don't think that knowing that is essential to answering the
question. The question is really about finding better tradeoffs than we
can get with char conversion + zip conversion.
Robert Klemme - 21 Nov 2007 20:57 GMT
>>> What's the fastest way to compress/decompress text?
>>
[quoted text clipped - 22 lines]
> question. The question is really about finding better tradeoffs than we
> can get with char conversion + zip conversion.

If you have strings already then serialization is one form to convert
them into binary data (which can be compressed with Java's standard
lib).  If you can get rid of UTF-8 compression this might be an alternative.

    robert
Roedy Green - 18 Nov 2007 20:41 GMT
>What's the fastest way to compress/decompress text?

You can try GZIP.  See http://mindprod.com/applet/fileio.html
for sample code.  It is simpler. It might be faster.

Note that zip has a compression number where you can trade off speed
for compression.

If you have a limited vocabulary, you might try just looking up words
and converting to 16-bit index numbers.

Each number is a word + a trailing space.

see http://mindprod.com/project/supercompressor.html
Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Arne Vajhøj - 18 Nov 2007 21:25 GMT
> On Sun, 18 Nov 2007 13:44:44 -0600, Chris <spam_me_not@goaway.com>
>> What's the fastest way to compress/decompress text?

Not quoted:
#We're currently converting char arrays to byte arrays using our own
#UTF-8 conversion code, and then compressing the bytes using
#java.util.zip.

> You can try GZIP.  See http://mindprod.com/applet/fileio.html
> for sample code.  It is simpler. It might be faster.

ZIP and GZIP uses the same basic algorithm - the difference
is just in the packaging where zip supports multiple files and
store file meta data while gzip does not.

Arne
Mark Rafn - 19 Nov 2007 03:54 GMT
>What's the fastest way to compress/decompress text?

As others have said, this is going to vary a lot based on exact requirements.

>We're doing that over really large datasets in our app. We're currently
>converting char arrays to byte arrays using our own UTF-8 conversion
>code, and then compressing the bytes using java.util.zip. The code is
>pretty old.
>I don't like this two-step process, and the profiler shows that this is
>a bottleneck in our app.

I'd start by benchmarking standard Java encoding and compression against your
two-step process.  This STILL buffers stuff internally, so I wouldn't expect
huge gains unless your home-grown code is broken somehow.

Writer w = new OutputStreamWriter(new GZipOutputStream(real_out), "UTF-8");

This will at least get you a baseline, and it'll be fairly easy to use other
DeflaterOutputStream implementations.

>Is anyone aware of any code that compresses chars directly? Perhaps a
>third-party library that does it faster?

Keep in mind that UTF-8 is actually a compression for many datasets.  It's
mildly expensive, but it's also a 50% size reduction from the 16-bit internal
representation if you don't have many non-ascii characters.

It's certainly worth benchmarking UTF-16 encoding on your data.  This should
be much cheaper (AFAIK, all common JVMs use UTF-16 as their internal storage,
so UTF-16 encoding is a no-op).  But it'll give you a lot more bytes to
compress, which may or may not wash out in the compression mechanism and
copying from buffer to buffer.
--
Mark Rafn    dagon@dagon.net    <http://www.dagon.net/>
Roedy Green - 19 Nov 2007 08:50 GMT
>This will at least get you a baseline, and it'll be fairly easy to use other
>DeflaterOutputStream implementations.

Google points out:

http://teatrove.sourceforge.net/javadoc/com/go/trove/util/DeflaterOutputStream.html

Signature

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com



Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.