Java Forum / Tools / March 2006
Compression Utilities
Roedy Green - 27 Feb 2006 07:28 GMT I have benchmarked a number of compression utilities and have posted the results at http://mindprod.com/jgloss/compressionutilities.html
The bottom line is 7-zip is the clear champ for both ease of use and maximum compression. WinZip has the best speed.
Microsoft is preposterously inept.. Their simple uncompressed copy takes considerably longer than other's compression, and their compression barely makes a dent in the size. Further their properties dialog to display file sizes keeps giving the wrong results. It erroneously caches results for other files in Win2K.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Richard F.L.R.Snashall - 27 Feb 2006 08:00 GMT > I have benchmarked a number of compression utilities and have posted > the results at http://mindprod.com/jgloss/compressionutilities.html [quoted text clipped - 3 lines] > > Microsoft is preposterously inept.. Their simple uncompressed copy While I love to hear this of m$, is this a question of I/O time rather than compression time? I ran a study a while back using gzip and compress (on Unix). In the end, most of the time was used reading and writing to disk. How are you removing this time from the test?
> takes considerably longer than other's compression, and their > compression barely makes a dent in the size. Further their properties > dialog to display file sizes keeps giving the wrong results. It > erroneously caches results for other files in Win2K. > > Roedy Green - 27 Feb 2006 12:54 GMT On Mon, 27 Feb 2006 03:00:53 -0500, "Richard F.L.R.Snashall" <rflrs@notnotrcn.com> wrote, quoted or indirectly quoted someone who said :
>While I love to hear this of m$, is this a question of I/O time >rather than compression time? I ran a study a while back using >gzip and compress (on Unix). In the end, most of the time was used >reading and writing to disk. How are you removing this time from >the test? Why would you remove it? If you do your i/o in dainty chunks it takes a lot longer.Managing i/o is part of the skill of writing a good archiver.
Granted yiou would expect a non-compressing archiver to beat straight copy because the when you create an archive entry there is no need to flush it to disk right away the way is traditional in an O/S close. I don't think windows though guarantees i/o is physically complete on close. IIRC is has the option of a delayed write.
I think it was WinRAR that wasted time time by using a single thread. It purely reads fora while (wasting CPU time), then purely thinks for a while,( wasting I/O time)
Pack2000 is specially for class files, not even resources, so it would be is a bit cruel to hold it up to ridicule on general compression. It might be reasonable though to test jar.exe or some simple ZipOutputStream utility.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Oliver Wong - 27 Feb 2006 16:46 GMT > On Mon, 27 Feb 2006 03:00:53 -0500, "Richard F.L.R.Snashall" > <rflrs@notnotrcn.com> wrote, quoted or indirectly quoted someone who [quoted text clipped - 9 lines] > a lot longer.Managing i/o is part of the skill of writing a good > archiver. It might be interesting for people who want to develop a file transfer utility. They would be interested in the speed and ratio of compression, but not so much on the disk access time, because they won't be writing to disk; rather, they'll be writing to a socket across a network.
- Oliver
Roedy Green - 27 Feb 2006 19:02 GMT > It might be interesting for people who want to develop a file transfer >utility. They would be interested in the speed and ratio of compression, but >not so much on the disk access time, because they won't be writing to disk; >rather, they'll be writing to a socket across a network. there is one called BZip2 that comes with source. The install instructions looked too complicated, so I passed on it.
I think the key will be to develop different sorts of compactors, or preconditioners for different sorts of file e.g. class html text jpg png gif au wav xml
I think too when you install software ends should get a common dictionary for the languages the communicate in for the compacting algorithms to use, and possibly even some aux dictionaries for people who keep communicating to each other.
For example to compact jars you could extract the strings from classes and resources and sort them alphabetically and then hand that to the compressor.
You might also write "chunker" plugins for new file formats to help the the compressor break the file up into logical places for optimal searching for repeating strings, or for doing compression by creating deltas.
See http://mindprod.com/projects/deltacreator.html
The other possibilities is tidying compactors, e.g. that remove whitepace from HTML or XML and put it back on the other end, but not necessarily exactly in the same spots.
Similarly a compressor might notice that a gif file only used 200 colours, so it could reduce the colour depth, but only transforms that did not degrade the image, unless you specifically told it that was ok.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
neznam neznam - 01 Mar 2006 15:48 GMT > While I love to hear this of m$, is this a question of I/O time > rather than compression time? I ran a study a while back using > gzip and compress (on Unix). In the end, most of the time was used > reading and writing to disk. How are you removing this time from > the test? use RAM disk ;-)
Oliver Wong - 27 Feb 2006 16:59 GMT >I have benchmarked a number of compression utilities and have posted > the results at http://mindprod.com/jgloss/compressionutilities.html You mention that it's important that you compress to a format that the recipient can decompress from; have you considered adding to the benchmarks compressions where you create self-extracting archives? E.g. Winzip and WinRar offer to compress to a .EXE instead of .ZIP and .RAR files.
Obviously, the resulting self-extracting archives would probably be platform specific, but this might make sense if what you were archiving were the distribution for a platform specific program anyway.
> Microsoft is preposterously inept.. [...] their > compression barely makes a dent in the size. To be fair, the intent of their "compact" utility is to allow the files to be used without an explicit decompression step, which (AFAIK) none of the other formats (.ZIP, .RAR, .7z, etc) allow. I believe they use something really simple like run-length encoding, to allow for fast decompression, random seeking within the file, and other stuff that one would typically want to do with the uncompressed contents of a file, that might be prohibitively expensive or difficult to do with the compression schemes used by the other formats.
I have some images of CDs on my harddrive which I mount using a CD drive emulator. The "useful contents" of the CDs are relatively small (100MB), but they contain padding files of sizes around 600MB which just contain the byte 0x00 over and over again; the reason for this is to place the useful content near the outer edge of the CD, thus allowing for faster data reads (because when the CD spins at a constant angular velocity, the drive can read from the outer edge faster than the inner edge).
This trick doesn't do anything for when the CD is stored as an image on my harddrive though, so the file is 600MB bigger than it needs to be. If I use the "compact" utility, it does RLE on the padding file to reduce it to just a few kilobytes, and so the image file is down to a more reasonable 100MB size.
- Oliver
Roedy Green - 27 Feb 2006 19:15 GMT > You mention that it's important that you compress to a format that the >recipient can decompress from; have you considered adding to the benchmarks >compressions where you create self-extracting archives? Running benchmarks is as exciting as watching paint dry. I have had my fill for a while with Signum and the compressors. If you want to run some, I would be happy to format and post the results at http://mindprod.com/jgloss/compressionutilities.html
The reason I did that last batch is I was so under the weather I was not up to watching videos. I needed a task that required only IQ 50.
Self extracting is really a nutty idea, because the whole reason you do this is to minimise download time. You end up downloading the code for the extract over and over.
What should be done instead is an special extension set up for self extracting files that install software. When you install the compression software, it should set up the association. When such a file arrives and you "execute" it, it behaves just like a conventional self-extractor without the overhead.
The Jet people have invented a nice self-extracting scheme. They have such huge downloads the overhead is negligible for the decompressor. It has a traditional install dialog with icon and splash and query where to install. It can also set up associations as a side effect. It also has a delta scheme. You can create downloads to take you from version N to N+1 that contain just the differences. If figures out what is needed. The big saving is no runtime in the incremental downloads which is 16 MB.
I wish install people would get clever and stop bundling run times with the application and install have an the installer sniff around to see if it already installed, and if not arrange to get it installed from the run-time vendor's site. The benefit to the vendor would be they could monitor that people were living up to their run-time license agreements.
You really should bundle the runtime/JVM only for CD distribution.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Roedy Green - 27 Feb 2006 19:50 GMT On Mon, 27 Feb 2006 19:15:17 GMT, Roedy Green <my_email_is_posted_on_my_website@munged.invalid> wrote, quoted or indirectly quoted someone who said :
>I wish install people would get clever and stop bundling run times >with the application and install have an the installer sniff around to >see if it already installed, and if not arrange to get it installed >from the run-time vendor's site. The benefit to the vendor would be >they could monitor that people were living up to their run-time >license agreements. I go on at length about how an installer should work at http://mindprod.com/jgloss/installer.html http://mindprod.com/projects/installer.html http://mindprod.com/projects/sanitychecker.html
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|