Java Forum / General / November 2005
Bit IO, digests, checksums
Igor Planinc - 15 Nov 2005 15:32 GMT Developing a graph (read: vertex-edge thingy) application I have some decisions to make and I'd appreciate some thoughts on a couple of ... issues.
I'd like to store (into SQL) generated graphs, including their "properties" like colourings, various cycles etc. Graphs are simple (no multiple edges nor loops) and undirected and therefore their structural data can be stored in (halfs of) adjacency matrices (or adjacency lists) with 0/1 values. I would also like those graphs to be quickly retrievable (utilizing an SQL index). As various SQL servers pose restrictions onto sizes of "indexable" fields (MySQL, for example, had 255 char limitation not long ago) I was thinking of computing graph-structure digests and make SQL indices on those.
Now, I'd like to calculate digests using a DigestOutputStream (constructor: DigestOutputStream(OutputStream stream, MessageDigest digest)), so I need: 1. Something like BitOutputStream (to pass as s first argument into above constructor) which implements, say, writeBit(int) or write(boolean). I'd also use such a stream for "serializing" graphs' structural data. I'd also need BitInputStream for "deserializing". I found JFAC's implementation but haven't tested it, yet. Also: are there any other available (noncommercial) implementations out there? Any thoughts on possible (bit-wise) endian issues? 2. As safe (collision-wise) as possible implementation of MessageDigest limited to 255 bytes (or 127, if I decide to store hex notation) computable reasonably fast. And of course implemented in Java, preferably Sun spi in their jre. Does anyone have any experience with those?
I'd appreciate any insight.
Igor Planinc - 15 Nov 2005 15:35 GMT > found JFAC's implementation Make that JFLAC ( http://jflac.sourceforge.net/ ).
Roedy Green - 15 Nov 2005 17:52 GMT >I'd appreciate any insight. Let's say you could index a node with a Java.util.BitSet indicating all the other nodes it is connected to. (I don't think JDBC has any such feature.)
How are you going to generate this key on lookup? You could add it as a BLOB, but the data are useless as a key. You key on node number.
I think what you must do is store edges with start and finish node numbers. Those SQL can find. JSut make sure SQL builds an index on both the start and finish node numbers.
Another approach is use a third party package that has solved these problems already. See http://mindprod.com/jgloss/graph.html though I suspect many won't deal with graphs so big they need SQL.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Igor Planinc - 16 Nov 2005 10:38 GMT >>I'd appreciate any insight. > [quoted text clipped - 3 lines] > > How are you going to generate this key on lookup? Like I said: by generating a digest of graphs structural data. I'm using boolean[][] as an adjacency matrix representation (booleans are "enough" since the graphs I'm dealing with are simple and undirected; the structure itself is not square, it's triangular). So, I was thinking of flushing that boolean[][] down some BitOutputStream, generating a digest with DigestOutputStream. I was thinking of making a hex representation (a String of 0,1,...,f) of that digest so that I can make an efficient and robust SQL index.
> You could add it as > a BLOB, but the data are useless as a key. You key on node number. [quoted text clipped - 6 lines] > problems already. See http://mindprod.com/jgloss/graph.html > though I suspect many won't deal with graphs so big they need SQL. My graphs won't be really big (like millions of edges) therefore a storage of a single graph would be a problem (there will be many of them, though, so that could represent ... an issue ;-) ). They will be bigger than 255 chars, though. My problem is this: I want to compute various graph "properties" like colourings, various polynomials etc. in advance and store them for quick retrieval later. So, given a graph, I'd like to have those properties readily available in SQL.
Igor Planinc - 16 Nov 2005 11:54 GMT > My graphs won't be really big (like millions of edges) therefore a > storage of a single graph would be a problem I meant: _wouldn't_ be a problem. (I blame the keyboard ;-) )
Chris Uppal - 16 Nov 2005 12:14 GMT > Like I said: by generating a digest of graphs structural data. I'm using > boolean[][] as an adjacency matrix representation (booleans are "enough" [quoted text clipped - 4 lines] > String of 0,1,...,f) of that digest so that I can make an efficient and > robust SQL index.
> My graphs won't be really big (like millions of edges) therefore a > storage of a single graph would be a problem (there will be many of them, > though, so that could represent ... an issue ;-) ). They will be bigger > than 255 chars, though. So you are working with graphs with more than about 64 nodes, if I understand you correctly. And are converting the adjacency triangle into a bitmap of around 2K bits or more, corresponding to >256 bytes of binary data. I'm assuming that you really mean that you will use the digest as a binary object (in Java and in the DB) -- if not then by doing so you can cut the length of the "string" down by a factor of 16 since a hex string only represents 16 bits per byte which is very wasteful (assuming ASCII, if the string in the db is Unicode, then it's even worse).
If you are using a digest algorithm which approximates to randomness (e.g. crypto-quality) then the choice of digest would depend mostly on how many graphs you need to store digests for. A digest producing an N-bit value would be expected to produce 1 collision if used to "index" sqrt(2**N) graphs. The various crypto-hashes produce different sized digests, so that may help guide your choice. (My guess is that performance probably wouldn't be an issue since (a) the time taken to hash a short "string" is probably a lot less than the time take to insert/find it in a DB, and (b) you don't need to use a "high spec" algorithm but can be fairly safe with a (cryptographically)not-very-strong-but-fast algorithm.)
One other thought is that unless your graphs are quite unusually dense (with a 50-50 chance of /any/ pair of nodes being connected) the 2K bits will not be "random", but will be quite compressible. I'd think about using run-length encoding of some kind, or maybe Huffman encoding (presumably the bit patterns: 00000000 00000001 00000010 ... 10000000 will occur much more often than, say, 11111111). Of possibly even use a Deflator (from java.utils.zip), though I suspect that might be a long way short of ideal for this application). Depending on the level of compression that you can achieve (if any) you might be able to get away without using a digest at all.
BTW, your scheme seems to require that you have a way of unambiguously ordering the nodes. I don't see how you do that myself (in the abstract), but if your application does have that property (or a weak form of it), then you might be able to leverage that to improve the compression.
-- chris
Igor Planinc - 16 Nov 2005 15:15 GMT >>Like I said: by generating a digest of graphs structural data. I'm using >>boolean[][] as an adjacency matrix representation (booleans are "enough" [quoted text clipped - 11 lines] > > So you are working with graphs with more than about 64 nodes, if I understand I'm, that is my application, is currently in a developing stage. That means, among other things, that I'm using relatively small graphs. But I'd like to "extend" as far as possible. Someday.
> you correctly. And are converting the adjacency triangle into a bitmap of > around 2K bits or more, corresponding to >256 bytes of binary data. I'm > assuming that you really mean that you will use the digest as a binary object > (in Java and in the DB) -- I'm using digest(s) (that is I _plan_ to use a digest) only for SQL retrieval. Large graphs may require longer keys/values than indexable SQL fields allow. Furthermore, unlimited length indexable columns would result in "impossibly" large indices (which I'm trying to avoid).
It seems like I don't know how to describe the (intended) usage right, still, let me try once more. Here's a scenario:
1. For a given graph, I compute/calculate/generate/whatever various graph "properties" (colourings, cycles, polynomials, etc.). Since such computations tend to be very time consuming, I don't want to do repeat them over and over, but rather store the data (into SQL database) for further usage. 2. I want to be able to retrieve the above mentioned data fast and that means that I have to make a proper SQL index on proper fields/columns. "Indexable" fields/columns can't be arbitrarily large so I need a "key" with limited (maybe fixed) length and that's where digests come in. 3. On another occasion, given some graph, I digest it and use that digest to retrieve the above mentioned data, saving myself the hassle of repeting the calculation.
> if not then by doing so you can cut the length of > the "string" down by a factor of 16 since a hex string only represents 16 bits [quoted text clipped - 23 lines] > will occur much more often than, say, 11111111). Of possibly even use a > Deflator (from java.utils.zip), That's an option, yes. But I guess not so much for the digest, but rather for other fields (structural data) and "properties". Some SQL servers, IIRC, compress binary (blob) content automatically.
> though I suspect that might be a long way short > of ideal for this application). Depending on the level of compression that you [quoted text clipped - 7 lines] > > -- chris Roedy Green - 16 Nov 2005 12:36 GMT >> How are you going to generate this key on lookup? > >Like I said: by generating a digest of graphs structural data. That's how you generate the key to build. How do you generate that key on lookup? to find the node you want?
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Igor Planinc - 16 Nov 2005 15:29 GMT >>>How are you going to generate this key on lookup? >> >>Like I said: by generating a digest of graphs structural data. > > That's how you generate the key to build. How do you generate that key > on lookup? to find the node you want? I guess I don't understand you correctly. By loopkup, do you mean an SQL query? By node, do you mean an SQL record?
As I explained in another post, the scenario is this:
1. For a given graph I perform some time consuming calculations and store the results of those calculations (along with the graph data) into database. 2. Then, at another occasion, given some graph, hoping that I don't have to perform those calculations again, I go look into the database if I performed them at some previous occasion (in 1.).
So, a key is a graph (more precisely, its structure: number of vertices and how those vertices are interconnected), or rather it's short equivalent: digest. How I feed the structure to a DigestOutputStream is not really all that important, provided that I do it identically each time.
Roedy Green - 16 Nov 2005 15:59 GMT >I guess I don't understand you correctly. By loopkup, do you mean an SQL query? >By node, do you mean an SQL record? The word key has a special meaning in computer science.. Perhaps you are using it to mean something else.
It means a field in an SQL record by which you can rapidly look up things. You hand SQL the key, and it finds the corresponding record. It is an indexed field.
The point I have been trying to get across my last few posts is that while you can compute the bitset and put it in as a field in a SQL record for a node it is USELESS as a key.
There is no way I could compose the key out of thin air to look up the node. I would have to find the node first to compute the key.
If you want a unique id for the node, just number them sequentially. That makes a good key - compact and quick. You can also use it in other records to reference the node.
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Igor Planinc - 16 Nov 2005 17:21 GMT >>I guess I don't understand you correctly. By loopkup, do you mean an SQL query? >>By node, do you mean an SQL record? [quoted text clipped - 5 lines] > things. You hand SQL the key, and it finds the corresponding record. > It is an indexed field. So we do understand each other in this part.
> The point I have been trying to get across my last few posts is that > while you can compute the bitset and put it in as a field in a SQL > record for a node it is USELESS as a key. This part I don't understand.
> There is no way I could compose the key out of thin air to look up the Not out of thin air. From a given graph!
(writing this ad hoc so bear with me)
CREATE TABLE GRAPH_DATA (
DIGEST VARCHAR(255) NOT NULL,
STRUCTURE BLOB NOT NULL,
THIS_COLOURING BLOB, THAT_COLOURING BLOB, CHROMATIC_NUMBER INTEGER, EDGE_CHROMATIC_NUMBER INTEGER, CIRCUMFERENCE INTEGER, DIAMETER INTEGER, THICKNESS INTEGER, MINIMUM_VERTEX_COVER BLOB, AUTOMORPHISMS BLOB, HAMILTONIAN_CYCLE BLOB, EULERIAN_CYCLE BLOB, ETC BLOB,
UNIQUE INDEX DIG_INDEX (DIGEST) )
So, given a graph G = (V,E), compute a structural digest and (some of the) properties and put all that into database with something like
INSERT INTO GRAPH_DATA VALUES (?, ?, ..., ?)
Then, at another time, _given_some_graph_, compute a digest and hopefully get the rest from the database by
SELECT STRUCTURE, BLAH1, BLAH2, BLAH3 FROM GRAPH_DATA WHERE DIGEST=?
and perhaps (to be on the safe side collision-wise) test the two structures (the one of a graph in hand and the one retrieved from database) for equality.
> node. I would have to find the node first to compute the key. > If you want a unique id for the node, just number them sequentially. > That makes a good key - compact and quick. You can also use it in > other records to reference the node. Adding ID INTEGER NOT NULL and INDEX ID_INDEX (ID), but that's beside the point.
Roedy Green - 16 Nov 2005 20:24 GMT >> There is no way I could compose the key out of thin air to look up the > >Not out of thin air. From a given graph! but that is a nutty thing to do. It is like using a person's entire genealogical history back 12 generations as their common name. Every time you want to say "hello" you have to construct this giant family tree. Just give them an number, like 7 . What is even nuttier is you need some simpler name to even begin thinking about constructing the complex name. So why not use that name?
 Signature Canadian Mind Products, Roedy Green. http://mindprod.com Java custom programming, consulting and coaching.
Igor Planinc - 17 Nov 2005 08:01 GMT >>>There is no way I could compose the key out of thin air to look up the >> [quoted text clipped - 8 lines] > thinking about constructing the complex name. So why not use that > name? But then I'd have to enumerate all the graphs I'd ever need "propertized". I also plan for a distributed application, so a graph would have to have a universally unique id. Now that I think of it, it'd also be wise to have a digest id, so that I can possibly use different digests or choose another one later on if I find out the "current" one proves to be inadequate.
Anyway, the whole thing is a pita. When it comes to such decisions I tend to be indecisive. Because I know what it means when you have to change fundamental things in mid-development, let alone in a "finished" product.
Anyway, thank you for your thoughts.
Andrey Kuznetsov - 15 Nov 2005 20:03 GMT > I need: > 1. Something like BitOutputStream (to pass as s first argument into above > constructor) which implements, say, writeBit(int) or write(boolean). see http://uio.imagero.com and http://uio.imagero.com/doc/com/imagero/uio/io/BitOutputStream.html
 Signature Andrey Kuznetsov http://uio.imagero.com Unified I/O for Java http://reader.imagero.com Java image reader http://jgui.imagero.com Java GUI components and utilities
Igor Planinc - 17 Nov 2005 08:06 GMT >>I need: >>1. Something like BitOutputStream (to pass as s first argument into above >>constructor) which implements, say, writeBit(int) or write(boolean). > > see http://uio.imagero.com > and http://uio.imagero.com/doc/com/imagero/uio/io/BitOutputStream.html Thanks. DLed it, tested it, it meets all my needs. I especially like the BSD license. Great job, Andrey!
Andrey Kuznetsov - 18 Nov 2005 22:01 GMT >>>I need: >>>1. Something like BitOutputStream (to pass as s first argument into above [quoted text clipped - 5 lines] > Thanks. DLed it, tested it, it meets all my needs. I especially like the > BSD license. Great job, Andrey! Then don't hesitate to vote for uio (category best java library): http://207.178.67.98/java/readerschoice2004/frameindex.cfm
Cheers
Andrey
 Signature Andrey Kuznetsov http://uio.imagero.com Unified I/O for Java http://reader.imagero.com Java image reader http://jgui.imagero.com Java GUI components and utilities
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|