Java Forum / General / December 2005
compression API available in Java & C++?
Monique Y. Mudama - 06 Dec 2005 23:58 GMT Apologies if this is slightly off-topic ...
I'm looking for a compression algorithm that works better than gzip for small (12 bytes or so) chunks of data. Ideally there would be a free implementation for Java and C++ that we could use for commercial purposes.
Specifically, we are streaming data from a C++ server to a Java applet and would love to use as little bandwidth as possible.
We're not using any form of compression right now, but the belief is that gzip works in blocks, so small chunks of data don't get much benefit. (Anyone disagree?)
 Signature monique
Ask smart questions, get good answers: http://www.catb.org/~esr/faqs/smart-questions.html
Chris Smith - 07 Dec 2005 00:13 GMT > I'm looking for a compression algorithm that works better than gzip > for small (12 bytes or so) chunks of data. Ideally there would be a [quoted text clipped - 3 lines] > Specifically, we are streaming data from a C++ server to a Java applet > and would love to use as little bandwidth as possible. Compressing 12 bytes? That's a tough one. The concern is that the overhard added by any compression algorithm for something like the dictionary is going to be more than 12 bytes in length. I'd be considering the following options:
1. Drop it. There may be very little that you can do.
2. Run length encoding. If you expect this be a reliable hit, then a simple RFE scheme could be devised that only adds a byte for the uncompressable case and helps a little with the compressable case.
3. Out-of-stream dictionary. If you can devise a dictionary of pre- known patterns that are likely to occur frequently in the data, then you could pre-build that dictionary and devise a compression scheme that asume them.
None of the three cases involve using compression libraries, though. You'd have to build this stuff yourself.
 Signature www.designacourse.com The Easiest Way To Train Anyone... Anywhere.
Chris Smith - Lead Software Developer/Technical Trainer MindIQ Corporation
Monique Y. Mudama - 07 Dec 2005 00:39 GMT >> I'm looking for a compression algorithm that works better than gzip >> for small (12 bytes or so) chunks of data. Ideally there would be [quoted text clipped - 3 lines] >> Specifically, we are streaming data from a C++ server to a Java >> applet and would love to use as little bandwidth as possible. Thanks for the response, Chris!
> Compressing 12 bytes? That's a tough one. The concern is that the > overhard added by any compression algorithm for something like the > dictionary is going to be more than 12 bytes in length. Exactly right.
> I'd be considering the following options: > > 1. Drop it. There may be very little that you can do. Yeah, I'm thinking it may be more trouble than it's worth .. but if it *is* possible, it would be a help.
> 2. Run length encoding. If you expect this be a reliable hit, then a > simple RFE scheme could be devised that only adds a byte for the > uncompressable case and helps a little with the compressable case. Was RFE a typo for RLE? I hadn't heard of RLE; I will google around and see what comes up.
> 3. Out-of-stream dictionary. If you can devise a dictionary of pre- > known patterns that are likely to occur frequently in the data, then > you could pre-build that dictionary and devise a compression scheme > that asume them. I would have to think about this. I'm pretty sure the data is only ASCII characters, and of those only the common ones that show up on a US keyboard; is that predictable enough for a dictionary approach?
> None of the three cases involve using compression libraries, though. > You'd have to build this stuff yourself. It may be worth it if the compression is good enough. I haven't yet figured out what "good enough" would mean for this, though.
Thank you for all your suggestions.
 Signature monique
Ask smart questions, get good answers: http://www.catb.org/~esr/faqs/smart-questions.html
Monique Y. Mudama - 07 Dec 2005 01:09 GMT >> 2. Run length encoding. If you expect this be a reliable hit, then >> a simple RFE scheme could be devised that only adds a byte for the [quoted text clipped - 11 lines] > ASCII characters, and of those only the common ones that show up on > a US keyboard; is that predictable enough for a dictionary approach? The more I think about it, the more I think RLE + dictionary may be right for us. For some of the data we're streaming, up to half of the chunk of data is a single character repeated. And we could possibly use a few bits to describe each character we use. I'm not sure the character patterns are predictable enough to describe multiple characters at a time, though.
 Signature monique
Ask smart questions, get good answers: http://www.catb.org/~esr/faqs/smart-questions.html
Bjorn Abelli - 07 Dec 2005 01:27 GMT "Monique Y. Mudama" wrote...
>> 3. Out-of-stream dictionary. If you can devise a dictionary >> of pre-known patterns that are likely to occur frequently in [quoted text clipped - 5 lines] > show up on a US keyboard; is that predictable enough for a > dictionary approach? If the characters are somewhat evenly distributed, I don't think you could gain so much by what I guess Chris meant, as it would mean that the number of "patterns" would be just about the number of keys on the keyboard, hence almost as many as can fit into a byte itself...
Although, there could be some more "compression" made if you're *really* sure that it's only ASCII.
Then you can simply drop the insignificant bit (make each character 7-bits) and pack them into 11 bytes instead of 12. But then you've only gained one byte.
If you can investigate what the data really is comprised of, you could probably make an even better chart to map a kind of dictionary.
Let's say that the data sent is only alphanumerical characters (0-9, a-z and A-Z), which "could" be the case, then you can make your own schema for mapping each character to 6 bits instead of 8. After packing those 12 characters in that schema, you would end up with sending 9 bytes instead of 12.
There would actually be room to map additionally 11 characters... ;-)
// Bjorn A
Monique Y. Mudama - 07 Dec 2005 05:31 GMT > "Monique Y. Mudama" wrote... > [quoted text clipped - 13 lines] > on the keyboard, hence almost as many as can fit into a byte > itself... Yeah, that's kind of what I came to realize earlier tonight ...
> Although, there could be some more "compression" made if you're > *really* sure that it's only ASCII. [quoted text clipped - 15 lines] > There would actually be room to map additionally 11 characters... > ;-) We also have some punctuation to deal with, so we'd probably end up needing the full 6 bits, or maybe even all the way to seven. Bleh!
Thanks to you and everyone else who has made some suggestions. I am going to think about it. It may be that our data just isn't well-suited to compression; maybe I can come up with ways to reorganize the data for compression, but I don't think that would be well received because we already have apps coded to the existing formats.
 Signature monique
Ask smart questions, get good answers: http://www.catb.org/~esr/faqs/smart-questions.html
Mike Schilling - 07 Dec 2005 01:34 GMT > Apologies if this is slightly off-topic ... > > I'm looking for a compression algorithm that works better than gzip > for small (12 bytes or so) chunks of data. Ideally there would be a > free implementation for Java and C++ that we could use for commercial > purposes. Consider the communication overhead of the protocol you're using. You might accomplish more packing multiple chunks into a message than compressing the chunks.
Monique Y. Mudama - 07 Dec 2005 05:28 GMT >> Apologies if this is slightly off-topic ... >> [quoted text clipped - 6 lines] > You might accomplish more packing multiple chunks into a message > than compressing the chunks. The data is live; each chunk needs to be transmitted as soon as the server can do so. Sorry if I forgot to mention a streaming need in my description.
 Signature monique
Ask smart questions, get good answers: http://www.catb.org/~esr/faqs/smart-questions.html
Mike Schilling - 07 Dec 2005 07:22 GMT >>> Apologies if this is slightly off-topic ... >>> [quoted text clipped - 10 lines] > server can do so. Sorry if I forgot to mention a streaming need in > my description. What's the per-packet overhead, then? If it's much more than 12 bytes, you're not going to get much out of compressing the data.
Thomas Hawtin - 07 Dec 2005 18:23 GMT > Consider the communication overhead of the protocol you're using. You might > accomplish more packing multiple chunks into a message than compressing the > chunks. Not just overhead, but minimum packet lengths. ATM (for instance ADSL) packets always have exactly 48-bytes payload (although I don't know how IP is mapped to ATM). Ethernet has a minimum packet size in or for collision detection to work (machines at extremes of a network have to detect collisions while still sending the packet).
(48 bytes may seem a peculiar choice. IIRC, the story is: Europeans wanted a 64-byte packet for better efficiency. Americans wanted 32 bytes as the faster rate of packets would sufficiently low latency that echo suppression logic would not be necessary. 48 bytes was the compromise. It gives poor efficiency and does not remove the need for echo suppression logic.)
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Roedy Green - 07 Dec 2005 01:52 GMT On Tue, 6 Dec 2005 16:58:05 -0700, "Monique Y. Mudama" <spam@bounceswoosh.org> wrote, quoted or indirectly quoted someone who said :
>I'm looking for a compression algorithm that works better than gzip >for small (12 bytes or so) chunks of data. Ideally there would be a [quoted text clipped - 7 lines] >that gzip works in blocks, so small chunks of data don't get much >benefit. (Anyone disagree?) the key to this in getting a big sample to compress and studying it and seeing that both ends have access to the same summary info. You won't do much with totally self contained messages.
What do you know about the data? Every additional fact can be exploited in designing custom compression.
It the ideal case, your 12 bytes are one of 100 phrases (commands) and you can replace them with a 1 byte int indexing the message.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Monique Y. Mudama - 07 Dec 2005 05:27 GMT > the key to this in getting a big sample to compress and studying it > and seeing that both ends have access to the same summary info. You > won't do much with totally self contained messages. > > What do you know about the data? Every additional fact can be > exploited in designing custom compression. I think I need to spend more time considering exactly this question. There are actually several "types" of data that can be coming from this source. It may be that to compress effectively, we would have to use a different compression technique for each "type."
> It the ideal case, your 12 bytes are one of 100 phrases (commands) > and you can replace them with a 1 byte int indexing the message. Sadly, no. It's data from an external source, and ...
I recognize that I'm not being very helpful, because I don't want to reveal the full nature of the data I'm dealing with. I also recognize that this means that people have to suggest general approaches rather than being able to tell me exactly what to do, and I apologize for the frustration that may cause.
Anyway, any amount of optimizing of the kind you describe has already been done. I'm looking at ways of compressing data that's already stripped down to codes, etc. But we're using ASCII representations, which could possibly be represented more efficiently in bits. I talked about it with my team lead last night, and we came to the conclusion we'd probably only squeeze it down to 6 bits instead of 8, which probably isn't enough gain.
We have a new team member, and it looks like his first project will be to play with this data and see if he can come up with a useful compression scheme. So I'm still thinking about it, but I'm not on the hook, so to speak.
 Signature monique
Ask smart questions, get good answers: http://www.catb.org/~esr/faqs/smart-questions.html
Knute Johnson - 07 Dec 2005 05:38 GMT > Apologies if this is slightly off-topic ... > [quoted text clipped - 9 lines] > that gzip works in blocks, so small chunks of data don't get much > benefit. (Anyone disagree?) How much data are you actually sending at a time? If it is only 12 bytes and you aren't sending them fast enough to get a lot of them into a TCP packet then compression will gain you nothing. If you are then GZIP ought to work great.
 Signature Knute Johnson email s/nospam/knute/
Hiran Chaudhuri - 07 Dec 2005 10:43 GMT >> I'm looking for a compression algorithm that works better than gzip >> for small (12 bytes or so) chunks of data. Ideally there would be a [quoted text clipped - 4 lines] > packet then compression will gain you nothing. If you are then GZIP ought > to work great. From his other post it looks like they are already discussing the bits representation of their messages. Any compression algorithm can only reduce redundancy, repetitions and patterns. But if their information set is reduced to contain no such structures any more no compression algorithm can squeeze the data any more. Well, you might call it custom compression if you like.
Hiran
Chris Uppal - 07 Dec 2005 13:12 GMT > I'm looking for a compression algorithm that works better than gzip > for small (12 bytes or so) chunks of data. Ideally there would be a > free implementation for Java and C++ that we could use for commercial > purposes. No general-purpose compression scheme is going to be very much help for such short strings. They work by learning from the data, and 12 bytes doesn't give enough input for the compressor to learn much, let alone start using what it has learned.
So you /have/ to use a custom compression scheme. Adding to the suggestions that have already been made:
The gzip compression scheme (as implemented in the Java API) can be pre-trained with representative data (which is known to both the reading and writing applications and therefore does not need to be transmitted). You can easily test how well that would work for you by assembling a file consisting of (say) 8K of real, representative, data and another which is exactly the same except that it has one more code tacked onto the end. The difference in size between to the two files when compressed with gzip -9 will give an indication of the kind of compression available via this approach. I suspect it won't work well, but you never know and it's easy to try out.
Otherwise you will almost certainly have to look into Huffman compression or even (if you are sufficiently desperate) Arithmetic compression (much slower and harder to implement).
One thing to consider is whether the data has any structure. For instance (this is a real example from a previous job) if you desperately need to compress URLs such as found on the web: http://java.sun.com/index.html you can divide them up into separate "regions": http:// java.sun .com/ index .html each of which has distinct frequency characteristics. The first region, the "scheme" can probably be represented in 1 bit in the majority of cases. You are looking at a population that is: http:// -- the overwhelming majority https:// -- quite common ftp:// -- moderately rare <everything else> -- rare So you can represent http:// as one bit, probably use three bits for https:// and ftp://, and more bits than that for other cases. Similarly the ".com" suffix can be recognised as the most common top-level domain, and so can be compressed to one or two bits. The .htm and .html suffixes are common too. You can use the Huffman algorithm to assign near optimal bit-patterns to the various cases. OTOH the 'java.sun' bit is going to be much more "random" so the best you can hope for is a normal text compression scheme, such as character-by-character Huffman compression based of the frequency of ASCII characters in domain names (less the top-level segment).
BTW, if this data is the result of encoding some compound information, then look at the information itself rather than the strings produced by your current code scheme. I.e. don't think of the task as compressing the 12-bytes of data, but of compressing the N-bytes of information that is (currently) expressed by the 12-bytes.
-- chris
mr.vaibhavjain@gmail.com - 07 Dec 2005 15:14 GMT Hello Monique,
If your data is stream based than you can try out Huffmans Compression algorithm. Its an algorithm that converts fixed length blocks to variable length keys that can be used to look up the actual data. The implementation of this algorithm may be trivial but requires some great deal of bit shifts. You can find more info aboute Huffmans Compression algorithm from the web. In your case is RLE fails or does not satisfies you needs than Huffmans compression is the best bet. You can achive a good compression ratio with it even for data blocks that are distributed and hence do not get compressed much in RLE.
But please bear in mind that you should avoid third party libraries as they might add some overhead to the data being transmitted. Instead I suggest you to approach your problem in following phase :
1. Analyze the ASCII data that is being transmitted and prepare a table (Will be of 256 entries in case of ASCII) for each ASCII charecter and it corrosponding frequency of occurence in your test analysis.
2. Mannually prepare the Huffman Tree from the obtained data. Hence you will find the optimal encoding scheme for you data set. That means that for charecter occuring most will get a shorter bit string than the charecter that does occurs that repeatedly.
3. Implement this encoding scheme into your Tramitter / Reciever application. You can very well implement FilterInputstream and FilterOutputStream interfaces into a class that will decorate your Socket I/O stream.
I guess that above scheme will do your job as in any data some charecters are more frequent than other.
---- VJ
Eric Sosman - 07 Dec 2005 18:01 GMT Monique Y. Mudama wrote On 12/06/05 18:58,:
> Apologies if this is slightly off-topic ... > [quoted text clipped - 9 lines] > that gzip works in blocks, so small chunks of data don't get much > benefit. (Anyone disagree?) Thought #1: If you're "streaming" the data, can you treat the entire stream as the compressible block instead of compressing each mini-block separately?
Thought #2: Compression introduces delay, not only by consuming CPU cycles but from the very fact that it absorbs more bits than it emits. E.g., if your super-efficient compressor squeezes twelve bytes down to four bits, it needs to swallow two twelve-byte packets before it can transmit one compressed byte. If you're "streaming" the data, this might not be acceptable.
Thought #3: If protocol overhead already amounts to the equivalent of twelve bytes, no compressor can possibly do better than halve your bandwidth. If protocol overhead is more like sixty bytes, compression cannot possibly give you more than a one-sixth reduction.
Thought #4: My news server carries three Usenet groups with "compression" in their titles ...
 Signature Eric.Sosman@sun.com
Monique Y. Mudama - 07 Dec 2005 19:03 GMT On 2005-12-06, Monique Y. Mudama asked a question ...
Thanks all for your info and ideas. It's now clear to me that I didn't even know how to analyze my data, and your feedback has helped me figure out how I will need to reframe the question. I'm going to try to get a better understanding of my data, rather than trying to answer questions piecemeal and frustrating everyone.
 Signature monique
Ask smart questions, get good answers: http://www.catb.org/~esr/faqs/smart-questions.html
Roedy Green - 07 Dec 2005 20:13 GMT On Tue, 6 Dec 2005 16:58:05 -0700, "Monique Y. Mudama" <spam@bounceswoosh.org> wrote, quoted or indirectly quoted someone who said :
>I'm looking for a compression algorithm that works better than gzip >for small (12 bytes or so) chunks of data. Ideally there would be a >free implementation for Java and C++ that we could use for commercial >purposes. Compression will buy you little if every chunk goes out in its own packet. The packet wrapper overhead will overwhelm the payload. See http://mindprod.com/jgloss/tcpip.html to see what I am talking about.
However, if you your packet send delay is sufficient to say put 2 chunks most of the time in a packet you will halve the packet overhead.
My suggestion then is to see if there is a way to tweak the send delay to find an the optimal value for acceptable latency and fast transmission. This will likely prove more fruitful than working on compression.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Mike Schilling - 08 Dec 2005 16:01 GMT > On Tue, 6 Dec 2005 16:58:05 -0700, "Monique Y. Mudama" > <spam@bounceswoosh.org> wrote, quoted or indirectly quoted someone who [quoted text clipped - 18 lines] > transmission. This will likely prove more fruitful than working on > compression. Roedy makes an excellent point.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|