> Yes, certainly. You could, for example, partition your data set into
> smaller chunks, say x=0...999, 1000..1999, 2000..2999 etc. and have a
> number of threads which each processes one chunk. You only need to make
> sure that you only have one thread that will generate the output XML.
> Hi Robert,
> Thank you so much for your detailed reply. I am so glad I found
[quoted text clipped - 30 lines]
> ascending order of start position, the indexes in both should be
> identical.
I am not sure what you mean by "indexes". I certainly cannot see the
termin "index" mentioned above.
> So, I cannot sort the entire DB that way. It should only be sorted for
> one genome at a time.
I don't understand. You said, you wanted to match records based on
identical value pairs (start, end). You can (in a relational database
and also in Java code) easily sort by two criteria.
> 2. Database products:
> Do you think there is a way to link my Oracle DB and the file input
> directly to match the records?
Oracle can work with files, this is called "external tables". Please
have a look at the docs at http://tahiti.oracle.com e.g.
http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/schema.htm#sthref777
That's probably what I'd do if the format of your data can be easily
made available via Oracle. Note though that the file needs to be on the
server IIRC.
> Since I have a key and set of values for each entry in both DB's,
> should I use a Map instead of a List? By using a TreeMap, I can keep
> the data sorted on addition and so I'm just left with the mapping of
> coordinates in the corresponding indexes.
Since you only want to compare according to equality of your two values
a HashMap is more efficient.
> Will this be faster or is there a better way to do it like using a
> SortedArrayList?
Don't even think about using lists for this. That's too slow. You
probably should get yourself a book about data structures and algorithms
to get a solid foundation.
> By "pair type", do you mean an object with two ints and a String value
> as members?
... and with properly implemented equals() and hashCode() methods, exactly.
> Is HashSet faster than any other datastructure for this purpose? Do
> you think this will be the best one for this?
Yes.
> 3. Volume of data:
> It is not too large as I exaggerated and is only around 3,48000
> records in my local DB. The external DB has much more data than this
Did you mean 348,000 records? That's easily handled in memory.
> which I do not need to deal with since I am only looking for a match
> on all records in my local DB.
Then it's easy: create your class (see above), read in your local DB,
store it in a HashSet or HashMap, then query the DB and for each row
create a temporary instance of the class and look it up in the map / set.
>> Alternative: query both databases ordered by both coordinates and
>> compare records individually. You can then decide which ResultSet you
[quoted text clipped - 3 lines]
> comparing done by first sorting by start positions and then comparing
> corresponding index values?
I don't understand why you bring up these indexes. What indexes?
No, select from the DB with an ORDER BY with two columns. Same for the
local store. Then iterate through the ResultSet and your local data and
compare. If the first ordered column in the local store data is less
than in the DB data, increase it until they are same or greater. If
same do the same for the second column until a match is found or the end
of one of the two sources is reached.
>> Yes, certainly. You could, for example, partition your data set into
>> smaller chunks, say x=0...999, 1000..1999, 2000..2999 etc. and have a
[quoted text clipped - 3 lines]
> I am not used to Concurrent programming till now. Do you know of any
> sample code that I can refer?
http://www.amazon.com/Concurrent-Programming-Java-Principles-Patterns/dp/0201695812
http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601
But it's probably not worthwhile in this case.
Cheers
robert
jeyanthi.m@gmail.com - 06 Mar 2008 14:48 GMT
Thanks Robert! I now have a better understanding of the problem and a
clear picture on how to proceed.
This is what I intend to do based on all your suggestions:
1. Create an external Oracle table for the external file input.
2. Read in records from my local DB in a sorted order and load them as
a pair type into a HashSet.
3. Query the external table and for each row create a temporary
instance of the class and look it up in the set. This is done by
iterating through the ResultSet and local data and comparing each.
a. If the first ordered column in the local store data is less than
in the DB data, increase it until they are same or greater.
b. If same do the same for the second column until a match is found
or the end of one of the two sources is reached.
4. I would wrap the matches in XML tags and stream it out.
Thanks a lot for all your help!!!
Robert Klemme - 06 Mar 2008 18:27 GMT
> Thanks Robert! I now have a better understanding of the problem and a
> clear picture on how to proceed.
I am not so sure. Or maybe *I* did not understand everything properly.
Your explanation below seems to deal with three data sources. From
what I read earlier I understood that you have two sources: the Oracle
DB and your local data in a file.
> This is what I intend to do based on all your suggestions:
> 1. Create an external Oracle table for the external file input.
> 2. Read in records from my local DB in a sorted order and load them as
> a pair type into a HashSet.
If you make your local DB available as external table in Oracle then you
can simply do the matching in the DB via a join. The ResultSet from the
join then contains only matching records and you can directly output these.
> 3. Query the external table and for each row create a temporary
> instance of the class and look it up in the set. This is done by
[quoted text clipped - 3 lines]
> b. If same do the same for the second column until a match is found
> or the end of one of the two sources is reached.
No. If you have read the local data in a Set or Map already you need to
only look the Oracle data up via the temp instance in the Set / Map. No
additional iterating tricks needed.
> 4. I would wrap the matches in XML tags and stream it out.
>
> Thanks a lot for all your help!!!
You're welcome.
Kind regards
robert
JThangiah - 06 Mar 2008 17:41 GMT
Thanks Robert! I now have a better understanding of the problem and a
clear picture on how to proceed.
This is what I intend to do based on all your suggestions:
1. Create an external Oracle table for the external file input.
2. Read in records from my local DB in a sorted order and load them as
a pair type into a HashSet.
3. Query the external table and for each row create a temporary
instance of the class and look it up in the set. This is done by
iterating through the ResultSet and local data and comparing each.
a. If the first ordered column in the local store data is less than
in the DB data, increase it until they are same or greater.
b. If same do the same for the second column until a match is found
or the end of one of the two sources is reached.
4. I would wrap the matches in XML tags and stream it out.
Thanks a lot for all your help!!!