Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
HomeAnnouncementsWhite Papers
Discussion GroupsFirst AidDatabasesJavaBeansGUIJava 3DVirtual MachineCORBASecurityToolsGeneral
Java DirectoryOpen Source ProjectsSample Book ChaptersUser GroupsWeb Resources
Related Topics
Databases.NETMore Topics ...

Java Forum / General / December 2005

Tip: Looking for answers? Try searching our database.

compression API available in Java & C++?

Thread view: 
Monique Y. Mudama - 06 Dec 2005 23:58 GMT
Apologies if this is slightly off-topic ...

I'm looking for a compression algorithm that works better than gzip
for small (12 bytes or so) chunks of data.  Ideally there would be a
free implementation for Java and C++ that we could use for commercial
purposes.

Specifically, we are streaming data from a C++ server to a Java applet
and would love to use as little bandwidth as possible.

We're not using any form of compression right now, but the belief is
that gzip works in blocks, so small chunks of data don't get much
benefit.  (Anyone disagree?)

Signature

monique

Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

Chris Smith - 07 Dec 2005 00:13 GMT
> I'm looking for a compression algorithm that works better than gzip
> for small (12 bytes or so) chunks of data.  Ideally there would be a
[quoted text clipped - 3 lines]
> Specifically, we are streaming data from a C++ server to a Java applet
> and would love to use as little bandwidth as possible.

Compressing 12 bytes?  That's a tough one.  The concern is that the
overhard added by any compression algorithm for something like the
dictionary is going to be more than 12 bytes in length.  I'd be
considering the following options:

1. Drop it.  There may be very little that you can do.

2. Run length encoding.  If you expect this be a reliable hit, then a
simple RFE scheme could be devised that only adds a byte for the
uncompressable case and helps a little with the compressable case.

3. Out-of-stream dictionary.  If you can devise a dictionary of pre-
known patterns that are likely to occur frequently in the data, then you
could pre-build that dictionary and devise a compression scheme that
asume them.

None of the three cases involve using compression libraries, though.  
You'd have to build this stuff yourself.

Signature

www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Monique Y. Mudama - 07 Dec 2005 00:39 GMT
>> I'm looking for a compression algorithm that works better than gzip
>> for small (12 bytes or so) chunks of data.  Ideally there would be
[quoted text clipped - 3 lines]
>> Specifically, we are streaming data from a C++ server to a Java
>> applet and would love to use as little bandwidth as possible.

Thanks for the response, Chris!

> Compressing 12 bytes?  That's a tough one.  The concern is that the
> overhard added by any compression algorithm for something like the
> dictionary is going to be more than 12 bytes in length.  

Exactly right.

> I'd be considering the following options:
>
> 1. Drop it.  There may be very little that you can do.

Yeah, I'm thinking it may be more trouble than it's worth .. but if it
*is* possible, it would be a help.

> 2. Run length encoding.  If you expect this be a reliable hit, then a
> simple RFE scheme could be devised that only adds a byte for the
> uncompressable case and helps a little with the compressable case.

Was RFE a typo for RLE?  I hadn't heard of RLE; I will google around
and see what comes up.

> 3. Out-of-stream dictionary.  If you can devise a dictionary of pre-
> known patterns that are likely to occur frequently in the data, then
> you could pre-build that dictionary and devise a compression scheme
> that asume them.

I would have to think about this.  I'm pretty sure the data is only
ASCII characters, and of those only the common ones that show up on a
US keyboard; is that predictable enough for a dictionary approach?

> None of the three cases involve using compression libraries, though.
> You'd have to build this stuff yourself.

It may be worth it if the compression is good enough.  I haven't yet
figured out what "good enough" would mean for this, though.

Thank you for all your suggestions.

Signature

monique

Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

Monique Y. Mudama - 07 Dec 2005 01:09 GMT
>> 2. Run length encoding.  If you expect this be a reliable hit, then
>> a simple RFE scheme could be devised that only adds a byte for the
[quoted text clipped - 11 lines]
> ASCII characters, and of those only the common ones that show up on
> a US keyboard; is that predictable enough for a dictionary approach?

The more I think about it, the more I think RLE + dictionary may be
right for us.  For some of the data we're streaming, up to half of the
chunk of data is a single character repeated.  And we could possibly
use a few bits to describe each character we use.  I'm not sure the
character patterns are predictable enough to describe multiple
characters at a time, though.

Signature

monique

Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

Bjorn Abelli - 07 Dec 2005 01:27 GMT
"Monique Y. Mudama" wrote...

>> 3. Out-of-stream dictionary.  If you can devise a dictionary
>> of pre-known patterns that are likely to occur frequently in
[quoted text clipped - 5 lines]
> show up on a US keyboard; is that predictable enough for a
> dictionary approach?

If the characters are somewhat evenly distributed, I don't think you could
gain so much by what I guess Chris meant, as it would mean that the number
of "patterns" would be just about the number of keys on the keyboard, hence
almost as many as can fit into a byte itself...

Although, there could be some more "compression" made if you're *really*
sure that it's only ASCII.

Then you can simply drop the insignificant bit (make each character 7-bits)
and pack them into 11 bytes instead of 12. But then you've only gained one
byte.

If you can investigate what the data really is comprised of, you could
probably make an even better chart to map a kind of dictionary.

Let's say that the data sent is only alphanumerical characters (0-9, a-z and
A-Z), which "could" be the case, then you can make your own schema for
mapping each character to 6 bits instead of 8. After packing those 12
characters in that schema, you would end up with sending 9 bytes instead of
12.

There would actually be room to map additionally 11 characters... ;-)

// Bjorn A
Monique Y. Mudama - 07 Dec 2005 05:31 GMT
> "Monique Y. Mudama" wrote...
>
[quoted text clipped - 13 lines]
> on the keyboard, hence almost as many as can fit into a byte
> itself...

Yeah, that's kind of what I came to realize earlier tonight ...

> Although, there could be some more "compression" made if you're
> *really* sure that it's only ASCII.
[quoted text clipped - 15 lines]
> There would actually be room to map additionally 11 characters...
> ;-)

We also have some punctuation to deal with, so we'd probably end up
needing the full 6 bits, or maybe even all the way to seven.  Bleh!

Thanks to you and everyone else who has made some suggestions.  I am
going to think about it.  It may be that our data just isn't
well-suited to compression; maybe I can come up with ways to
reorganize the data for compression, but I don't think that would be
well received because we already have apps coded to the existing
formats.

Signature

monique

Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

Mike Schilling - 07 Dec 2005 01:34 GMT
> Apologies if this is slightly off-topic ...
>
> I'm looking for a compression algorithm that works better than gzip
> for small (12 bytes or so) chunks of data.  Ideally there would be a
> free implementation for Java and C++ that we could use for commercial
> purposes.

Consider the communication overhead of the protocol you're using.  You might
accomplish more packing multiple chunks into a message than compressing the
chunks.
Monique Y. Mudama - 07 Dec 2005 05:28 GMT
>> Apologies if this is slightly off-topic ...
>>
[quoted text clipped - 6 lines]
> You might accomplish more packing multiple chunks into a message
> than compressing the chunks.

The data is live; each chunk needs to be transmitted as soon as the
server can do so.  Sorry if I forgot to mention a streaming need in
my description.

Signature

monique

Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

Mike Schilling - 07 Dec 2005 07:22 GMT
>>> Apologies if this is slightly off-topic ...
>>>
[quoted text clipped - 10 lines]
> server can do so.  Sorry if I forgot to mention a streaming need in
> my description.

What's the per-packet overhead, then?  If it's much more than 12 bytes,
you're not going to get much out of compressing the data.
Thomas Hawtin - 07 Dec 2005 18:23 GMT
> Consider the communication overhead of the protocol you're using.  You might
> accomplish more packing multiple chunks into a message than compressing the
> chunks.

Not just overhead, but minimum packet lengths. ATM (for instance ADSL)
packets always have exactly 48-bytes payload (although I don't know how
IP is mapped to ATM). Ethernet has a minimum packet size in or for
collision detection to work (machines at extremes of a network have to
detect collisions while still sending the packet).

(48 bytes may seem a peculiar choice. IIRC, the story is: Europeans
wanted a 64-byte packet for better efficiency. Americans wanted 32 bytes
as the faster rate of packets would sufficiently low latency that echo
suppression logic would not be necessary. 48 bytes was the compromise.
It gives poor efficiency and does not remove the need for echo
suppression logic.)

Tom Hawtin
Signature

Unemployed English Java programmer
http://jroller.com/page/tackline/

Roedy Green - 07 Dec 2005 01:52 GMT
On Tue, 6 Dec 2005 16:58:05 -0700, "Monique Y. Mudama"
<spam@bounceswoosh.org> wrote, quoted or indirectly quoted someone who
said :

>I'm looking for a compression algorithm that works better than gzip
>for small (12 bytes or so) chunks of data.  Ideally there would be a
[quoted text clipped - 7 lines]
>that gzip works in blocks, so small chunks of data don't get much
>benefit.  (Anyone disagree?)

the key to this in getting a big sample to compress and studying it
and seeing that both ends have access to the same summary info.  You
won't  do much with totally self contained messages.

What do you know about the data?  Every additional fact can be
exploited in designing custom compression.

It the ideal case, your 12 bytes are one of 100 phrases (commands) and
you can replace them with a 1 byte int indexing the message.
Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Monique Y. Mudama - 07 Dec 2005 05:27 GMT
> the key to this in getting a big sample to compress and studying it
> and seeing that both ends have access to the same summary info.  You
> won't  do much with totally self contained messages.
>
> What do you know about the data?  Every additional fact can be
> exploited in designing custom compression.

I think I need to spend more time considering exactly this question.
There are actually several "types" of data that can be coming from
this source.  It may be that to compress effectively, we would have to
use a different compression technique for each "type."

> It the ideal case, your 12 bytes are one of 100 phrases (commands)
> and you can replace them with a 1 byte int indexing the message.

Sadly, no.  It's data from an external source, and ...

I recognize that I'm not being very helpful, because I don't want to
reveal the full nature of the data I'm dealing with.  I also recognize
that this means that people have to suggest general approaches rather
than being able to tell me exactly what to do, and I apologize for the
frustration that may cause.

Anyway, any amount of optimizing of the kind you describe has already
been done.  I'm looking at ways of compressing data that's already
stripped down to codes, etc.  But we're using ASCII representations,
which could possibly be represented more efficiently in bits.  I
talked about it with my team lead last night, and we came to the
conclusion we'd probably only squeeze it down to 6 bits instead of 8,
which probably isn't enough gain.

We have a new team member, and it looks like his first project will be
to play with this data and see if he can come up with a useful
compression scheme.  So I'm still thinking about it, but I'm not on
the hook, so to speak.

Signature

monique

Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

Knute Johnson - 07 Dec 2005 05:38 GMT
> Apologies if this is slightly off-topic ...
>
[quoted text clipped - 9 lines]
> that gzip works in blocks, so small chunks of data don't get much
> benefit.  (Anyone disagree?)

How much data are you actually sending at a time?  If it is only 12
bytes and you aren't sending them fast enough to get a lot of them into
a TCP packet then compression will gain you nothing.  If you are then
GZIP ought to work great.

Signature

Knute Johnson
email s/nospam/knute/

Hiran Chaudhuri - 07 Dec 2005 10:43 GMT
>> I'm looking for a compression algorithm that works better than gzip
>> for small (12 bytes or so) chunks of data.  Ideally there would be a
[quoted text clipped - 4 lines]
> packet then compression will gain you nothing.  If you are then GZIP ought
> to work great.

From his other post it looks like they are already discussing the bits
representation of their messages. Any compression algorithm can only reduce
redundancy, repetitions and patterns. But if their information set is
reduced to contain no such structures any more no compression algorithm can
squeeze the data any more. Well, you might call it custom compression if you
like.

Hiran
Chris Uppal - 07 Dec 2005 13:12 GMT
> I'm looking for a compression algorithm that works better than gzip
> for small (12 bytes or so) chunks of data.  Ideally there would be a
> free implementation for Java and C++ that we could use for commercial
> purposes.

No general-purpose compression scheme is going to be very much help for such
short strings.  They work by learning from the data, and 12 bytes doesn't give
enough input for the compressor to learn much, let alone start using what it
has learned.

So you /have/ to use a custom compression scheme.  Adding to the suggestions
that have already been made:

The gzip compression scheme (as implemented in the Java API) can be pre-trained
with representative data (which is known to both the reading and writing
applications and therefore does not need to be transmitted).  You can easily
test how well that would work for you by assembling a file consisting of (say)
8K of real, representative, data and another which is exactly the same except
that it has one more code tacked onto the end.  The difference in size between
to the two files when compressed with gzip -9 will give an indication of the
kind of compression available via this approach.  I suspect it won't work well,
but you never know and it's easy to try out.

Otherwise you will almost certainly have to look into Huffman compression or
even (if you are sufficiently desperate) Arithmetic compression (much slower
and harder to implement).

One thing to consider is whether the data has any structure.  For instance
(this is a real example from a previous job) if you desperately need to
compress URLs such as found on the web:
   http://java.sun.com/index.html
you can divide them up into separate "regions":
   http://
   java.sun
   .com/
   index
   .html
each of which has distinct frequency characteristics.  The first region, the
"scheme" can probably be represented in 1 bit in the majority of cases.  You
are looking at a population that is:
   http://                  -- the overwhelming majority
   https://                 -- quite common
   ftp://                     -- moderately rare
   <everything else>    -- rare
So you can represent http:// as one bit, probably use three bits for https://
and ftp://, and more bits than that for other cases.  Similarly the ".com"
suffix can be recognised as the most common top-level domain, and so can be
compressed to one or two bits.  The .htm and .html suffixes are common too.
You can use the Huffman algorithm to assign near optimal bit-patterns to the
various cases.  OTOH the 'java.sun' bit is going to be much more "random" so
the best you can hope for is a normal text compression scheme, such as
character-by-character Huffman compression based of the frequency of ASCII
characters in domain names (less the top-level segment).

BTW, if this data is the result of encoding some compound information, then
look at the information itself rather than the strings produced by your current
code scheme.  I.e. don't think of the task as compressing the 12-bytes of data,
but of compressing the N-bytes of information that is (currently) expressed by
the 12-bytes.

   -- chris
mr.vaibhavjain@gmail.com - 07 Dec 2005 15:14 GMT
Hello Monique,

If your data is stream based than you can try out Huffmans Compression
algorithm. Its an algorithm that converts fixed length blocks to
variable length keys that can be used to look up the actual data. The
implementation of this algorithm may be trivial but requires some great
deal of bit shifts. You can find more info aboute Huffmans Compression
algorithm from the web. In your case is RLE fails or does not satisfies
you needs than Huffmans compression is the best bet. You can achive a
good compression ratio with it even for data blocks that are
distributed  and hence do not get compressed much in RLE.

But please bear in mind that you should avoid third party libraries as
they might add some overhead to the data being transmitted. Instead I
suggest you to approach your problem in following phase :

1. Analyze the ASCII data that is being transmitted and prepare a table
(Will be of 256 entries in case of ASCII) for each ASCII charecter and
it corrosponding frequency of occurence in your test analysis.

2. Mannually prepare the Huffman Tree from the obtained data. Hence you
will find the optimal encoding scheme for you data set. That means that
for charecter occuring most will get a shorter bit string than the
charecter that does occurs that repeatedly.

3. Implement this encoding scheme into your Tramitter / Reciever
application. You can very well implement FilterInputstream and
FilterOutputStream interfaces  into a class that will decorate your
Socket I/O stream.

I guess that above scheme will do your job as in any data some
charecters are more frequent than other.

----
VJ
Eric Sosman - 07 Dec 2005 18:01 GMT
Monique Y. Mudama wrote On 12/06/05 18:58,:
> Apologies if this is slightly off-topic ...
>
[quoted text clipped - 9 lines]
> that gzip works in blocks, so small chunks of data don't get much
> benefit.  (Anyone disagree?)

   Thought #1: If you're "streaming" the data, can you
treat the entire stream as the compressible block instead
of compressing each mini-block separately?

   Thought #2: Compression introduces delay, not only by
consuming CPU cycles but from the very fact that it absorbs
more bits than it emits.  E.g., if your super-efficient
compressor squeezes twelve bytes down to four bits, it
needs to swallow two twelve-byte packets before it can
transmit one compressed byte.  If you're "streaming" the
data, this might not be acceptable.

   Thought #3: If protocol overhead already amounts to
the equivalent of twelve bytes, no compressor can possibly
do better than halve your bandwidth.  If protocol overhead
is more like sixty bytes, compression cannot possibly give
you more than a one-sixth reduction.

   Thought #4: My news server carries three Usenet groups
with "compression" in their titles ...

Signature

Eric.Sosman@sun.com

Monique Y. Mudama - 07 Dec 2005 19:03 GMT
On 2005-12-06, Monique Y. Mudama asked a question ...

Thanks all for your info and ideas.  It's now clear to me that I didn't
even know how to analyze my data, and your feedback has helped me figure
out how I will need to reframe the question.  I'm going to try to get a
better understanding of my data, rather than trying to answer questions
piecemeal and frustrating everyone.

Signature

monique

Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html

Roedy Green - 07 Dec 2005 20:13 GMT
On Tue, 6 Dec 2005 16:58:05 -0700, "Monique Y. Mudama"
<spam@bounceswoosh.org> wrote, quoted or indirectly quoted someone who
said :

>I'm looking for a compression algorithm that works better than gzip
>for small (12 bytes or so) chunks of data.  Ideally there would be a
>free implementation for Java and C++ that we could use for commercial
>purposes.

Compression will buy you little if every chunk goes out in its own
packet. The packet wrapper overhead will overwhelm the payload.
See http://mindprod.com/jgloss/tcpip.html to see what I am talking
about.

However, if you your packet send delay is sufficient to say put 2
chunks most of the time in a packet you will halve the packet
overhead.

My suggestion then is to see if there is a way to tweak the send delay
to find an the optimal value for acceptable latency and fast
transmission. This will likely prove more fruitful than working on
compression.

Signature

Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.

Mike Schilling - 08 Dec 2005 16:01 GMT
> On Tue, 6 Dec 2005 16:58:05 -0700, "Monique Y. Mudama"
> <spam@bounceswoosh.org> wrote, quoted or indirectly quoted someone who
[quoted text clipped - 18 lines]
> transmission. This will likely prove more fruitful than working on
> compression.

Roedy makes an excellent point.


Free Magazines

Get these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...

Oracle MagazineNetwork ComputingComputer WorldBio-IT WorldeWeekInformation WeekInfosecurity
 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2008 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.