Java Forum / General / December 2005
Compress int array, again
Kevin - 26 Dec 2005 03:54 GMT (I searched through this group, but found little about this.)
In my program, there are a lot of arrays like: int[]. They are usually 100 - 1000 in length, with the values of each entry be a normal integer. These int[] need to be stored in memory for fast acecss by some other codes.
I heard (actually, our manager wants it) that they are ways to save some memory by storing these int[] as "compressed". Any one could point me some example codes of it?
I wish is it "loseless", and it has a range similar to the int (does not require the exact range as int for our applications).
Thanks a lot and happy holiday!~
Roedy Green - 26 Dec 2005 07:06 GMT >I wish is it "loseless", and it has a range similar to the int (does >not require the exact range as int for our applications) you could serialise the array, and GZIP it. The will look for duplicates and replace when with a common token.
See http://mindprod.com/applets/fileio.html for how.
Tell it you want to serialise objects to a byte array with compression. The catch is you can't use it in compressed form. You have to fluff it up again first.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Alun Harford - 26 Dec 2005 09:36 GMT > (I searched through this group, but found little about this.) > [quoted text clipped - 9 lines] > I wish is it "loseless", and it has a range similar to the int (does > not require the exact range as int for our applications). Well I guess you could Huffman code each array (or maybe a better way) - byte-by-byte - and keep an index that tells you where element x starts (where x is a multiple of some integer n), but I doubt a small array of ints is going to compress very well (unless you frequently use the same or similar integers I guess).
Alun Harford
Knute Johnson - 26 Dec 2005 17:04 GMT > (I searched through this group, but found little about this.) > [quoted text clipped - 11 lines] > > Thanks a lot and happy holiday!~ I would think that any compression scheme is going to take longer to compress and uncompress than just reading the full size integers. I don't think that Java will run on any 16 bit processors so an integer access is just one hardware instruction. I think your manager is just wasting his and your time.
 Signature Knute Johnson email s/nospam/knute/
Kevin - 26 Dec 2005 21:13 GMT Thanks a lot for the replies.
Since I need to keep the int[] in memory, so I think we can not gzip it out to disk.
I did some test using a naive way:
For each int[] array, I convert each int to 4 bytes, so it will have a byte[] of 4 times the length.
I compress this byte[] using Java's Deflater class, and store the compressed byte array in memory.
For my data, before compress, the int[]s take about 170M memory, after compress, the byte[]s take about 52M memory. The compression time is a killer here, which takes about 4-5 minutes. But since this can be considered as a "preprocess" of my data and done offline, so this does not seem to be a major issue for my application at hand.
I am not sure about the decompress time at run time though. But since my code does not need to access all the int[]s at one time (it only access 10-50 of them at a time), so maybe the run time will be acceptable I guess.
Of course, we have to decompress each int[] array in order to access even one entry in it. But since my code uses the int[] as a whole each time, so this does not seem to be a problem here.
I just wonder whether my above way is the best method we can get?
Roedy Green - 26 Dec 2005 21:37 GMT >Since I need to keep the int[] in memory, so I think we can not gzip it >out to disk. You can GZip to to ByteArrayOutputStream and then to a byte array. You need not do any disk I/O.
But as I said, getting something out is a very expensive operation.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Knute Johnson - 27 Dec 2005 04:34 GMT > Thanks a lot for the replies. > [quoted text clipped - 25 lines] > > I just wonder whether my above way is the best method we can get? Does your data use the full range of an int? Could you store your data in a short?
If your average array size is 550 ints and you have 170mb of uncompressed arrays then that means you have approximately 77,272 arrays?
If it takes 4 to 5 minutes to compress your data, it will take that long to uncompress. Even if the uncompress is spread out over time the net result is the same.
The fastest access will be an int[] in memory. The second fastest will probably be off of a hard disk with good buffering. Most hard drives have multiple megabyte buffers these days. The other thing you might look at is MappedByteBuffer.
 Signature Knute Johnson email s/nospam/knute/
Erik Andreas Brandstadmoen - 26 Dec 2005 22:25 GMT > In my program, there are a lot of arrays like: int[]. They are usually > 100 - 1000 in length, with the values of each entry be a normal [quoted text clipped - 4 lines] > some memory by storing these int[] as "compressed". Any one could point > me some example codes of it? I'm sorry, but mu curiosity just has to come forward: What on earth are you doing with this, and have you investigated whether some of Java's built-in Collections can do some of the work for you?
There might of course be very good reasons for doing what you do, but I'm just curious. And in answer to your question, of course you can store things in different ways (i.e. compressed) in memory, but I'm wondering if it's a better idea to keep what you don't need on disk, and read it when you need it, because reading from disk is not necessarily slower than unzipping your in-memory data structures. Depending on hardware, of course (or maybe not? I'm not very good at compression theory).
Independent of which solution you choose, of course you will always trade reduced memory usage for reduced performance as well.
Regards, Erik Brandstadmoen
Kevin - 27 Dec 2005 05:11 GMT Thanks again.
Yes, the data has many many int[] arrays. Basically, each array records counts of a data structure similar to a "data cube" in data mining.
As to the application, it is an interactive GUI based one. Each int[] array basically corresponds to a data matrix to be manipulated and displayed on screen. As for the application, it requires that when the user clicks the mouse, the software displays the visualization immediately. As to this reason, it is not possible to read in the required int[] array from disk on the fly*. Since one visualization screen usually only needs 1 to 100 int[] arrays, which means, though we can have many many int[] arrays, at each time, we only need to access 1-100 of them (kind of random, since we never now which ones the user wants to use next). I think uncompress 1-100 (most time only 1-5) array on the fly should be OK in term of speed. So I choose to store the compressed ones in memory, and do the uncompress only to those necessary ones on the fly. (note: the application only access these int[] arrays as read-only. No change will be done on them)
*There is anyother way though I think: to segment these int[] arrays into many files, and only read in one file on the fly should be fast enough also (read one big file is slow, will take about 8 seconds as I tested). But since the user may not want to see many files for one project, so this is less attractive.
Also, though 170M data in memory is not that big as we have 1G memory on the machines, but other parts of the application will use large amount of memory also. Also, later, it is possible to have some more data (more int[] arrays). So it is better to reduce their memory usage somehow.
(To use 64 bit java is another way, but we do not have it for this project)
Knute Johnson - 27 Dec 2005 06:17 GMT > Thanks again. > [quoted text clipped - 7 lines] > immediately. As to this reason, it is not possible to read in the > required int[] array from disk on the fly*. I think you are seriously underestimating the performance of disk I/O. I wrote a test program to read an array of ints between 100 and 1000 bytes long out of a file that is 170mb of ints. Even with my slow computer it averages 10ms or less per array read. I don't think 10ms would be too long to wait.
import java.io.*; import java.util.*;
public class test10 { public static void main(String[] args) throws Exception { /* creates 170mb file of ints RandomAccessFile file = new RandomAccessFile("ran.dat","rw"); for (int i=0; i<42500000; i++) file.writeInt(i); file.close(); */ int iterations = 10000; Random random = new Random(System.currentTimeMillis()); RandomAccessFile file = new RandomAccessFile("ran.dat","r"); long start = System.currentTimeMillis(); for (int i=0; i<iterations; i++) { int arrayLength = random.nextInt(900) + 100; int[] buf = new int[arrayLength]; int where = random.nextInt(42500000); file.seek(where * 4); try { for (int j=0; j<arrayLength; j++) buf[j] = file.readInt(); } catch (EOFException eofe) { System.out.println(eofe); } } file.close(); long end = System.currentTimeMillis(); System.out.println((end-start)/1.0/iterations);
/* this is slightly faster than above int iterations = 10000; Random random = new Random(System.currentTimeMillis()); long start = System.currentTimeMillis(); for (int i=0; i<iterations; i++) { FileInputStream fis = new FileInputStream("ran.dat"); BufferedInputStream bis = new BufferedInputStream(fis); DataInputStream dis = new DataInputStream(bis); int arrayLength = random.nextInt(900) + 100; int[] buf = new int[arrayLength]; int where = random.nextInt(42500000); dis.skip(where * 4); try { for (int j=0; j<arrayLength; j++) buf[j] = dis.readInt(); } catch (EOFException eofe) { System.out.println(eofe); } dis.close(); } long end = System.currentTimeMillis(); System.out.println((end-start)/1.0/iterations); */ } }
 Signature Knute Johnson email s/nospam/knute/
Roedy Green - 27 Dec 2005 07:13 GMT On Mon, 26 Dec 2005 22:17:39 -0800, Knute Johnson <nospam@ljr-2.frazmtn.com> wrote, quoted or indirectly quoted someone who said :
>I think you are seriously underestimating the performance of disk I/O. >I wrote a test program to read an array of ints between 100 and 1000 >bytes long out of a file that is 170mb of ints. You want to arrange things so you can do the read in one physical i/o. You might store your arrays one after the other in a random access file with keeping in RAM where each array starts. If you need a new array, just tack it on the end.
See http://mindprod.com/projects/hermitcrab.html for how to do it if you require the ability to update and change the length of the arrays on the fly.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Kevin - 27 Dec 2005 21:31 GMT Thanks so much. I think I really under the impression of java has slow IO from the old days. Now, your post gives me a nice hint on this. Thank you, and now I know what I can do with my code.
Alun Harford - 28 Dec 2005 11:22 GMT > Thanks again. > [quoted text clipped - 28 lines] > data (more int[] arrays). So it is better to reduce their memory usage > somehow. Why not read the data in from disk (you should be able to know where in the file the int[] that you want is located, so you don't need to read the whole (large) file) and leave soft references to the arrays that you've already read in?
Alun Harford
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|