Hi everybody
I always have agreed that exception shouldn't be used for flow control,
nevertheless I have incurred in a situation where breaking this
principle is probably good, for other reasons than design.
The situation is inserting a row in a table, extremely simple stuff. It
happens that the application is basically only doing this and the table
has ~10^7 records, so I have rewritten the application logic, that
would be correctly described as
ARecordObject r=...
if(aDao.find(r)!=null)
aDao.insert(r);
else
aDao.merge(r,anotherRecordObject);
in the form of
if(!aDao.insert(r))
aDao.merge(r,anotherRecordObject);
where insert catches DataIntegrityViolationException and returns true
iff an insert was made.
Now, I see that the first form is better for many reasons, but if the
latter performs faster should I really
bother?
Cheers
Francesco
lordy - 28 Jun 2006 17:07 GMT
> Hi everybody
>
[quoted text clipped - 21 lines]
> latter performs faster should I really
> bother?
For maximum performance you should push that logic into a stored
procedure on the database.
Failing that the latter method is good and relevant bits of the program
should document the 'flow'.
Have you also considered relative occurence of inserts() to merges() in
your application.
Depeneding on incomming data it may be quicker to try a merge() first.
(Presumable this would be an update that returns zero rows updated
rather than an exception)
Lordy
Rastislav Komara - 28 Jun 2006 19:54 GMT
> Hi everybody
>
[quoted text clipped - 28 lines]
> Cheers
> Francesco
It is absolutly bad solution. You cannot achieve reasonable performance
using second way. Exceptions are realy realy slow for building stack
trace. If you do not want to use SQL test over DB use other way to
determine which operation is required. I cant recomend DB stored
porcedures. This is hard to write and bound application to special
Database infrastructure lovering portability and cacheing abilities.
Chris Uppal - 29 Jun 2006 12:48 GMT
> It is absolutly bad solution. You cannot achieve reasonable performance
> using second way. Exceptions are realy realy slow for building stack
> trace.
I find it /really/ difficult to believe that the cost of building, throwing,
and catching, an exception is even remotely comparable with the cost of a round
trip to a database. (And that's assuming that a select followed by an insert
runs in approximately the same time as a single insert -- which is probably
true when taking DB caching into account, but only at a far coarser granularity
than the time taken to throw an exception.)
-- chris
villo - 29 Jun 2006 16:13 GMT
Thanks for the feedback. Just a couple of remarks:
-I know that on the average the cost of building the stacktrace and
throwing the exception
is not comparable with a round trip to the db. I have to admit that I
haven't done any serious
benchmark in my case, which I'll probably do soon. But I have the
feeling that in my case
it can be slightly different (see below)
-Chris made a good point, actually my rdbms (hsql) supports only read
uncommitted, so it is possible that I have a race condition here.
Nevertheless this doesn't represent a problem at the application level:
in case a record has been inserted between find() and merge() I simply
catch the Duplicate...Exception again.
-Select for update could be a way but I'm not sure wheter hsqldb
supports it. And I wonder if I can mimic the merge() method with bare
sql...
The reason for the first point is that due to requirements I'm forced
to use the embedded file storage hsqldb provides. Now Inserting the
data is not a problem, but search time when the table size is in the
order of 10^7 tend to be unacceptable for the current scenario, where
the objective is to write as fast as I can.
Just to give an overview of what I'm doing: I have a thread pool making
axfr requests and writing some information to a single table. Each
record contains basically an IP and the ccTld it belongs to. Laterwards
the collected data will be merged to a separate database to make some
data analysis. This will be done in batches.
So to summarize I guess I will have to make some real benchmarks to see
which approach performs better; I expect that as long as the db is
small the first will be faster, while the second should catch up as the
size increases...Which is the pointcut it is totally unclear right
now:(
cheers
Francesco
Chris Uppal - 30 Jun 2006 11:24 GMT
> The reason for the first point is that due to requirements I'm forced
> to use the embedded file storage hsqldb provides. Now Inserting the
> data is not a problem, but search time when the table size is in the
> order of 10^7 tend to be unacceptable for the current scenario, where
> the objective is to write as fast as I can.
Note that inserting into a table requires the logical equivalent of a search
anyway, since rows are unique. If you are seeing a significant difference
between the time taken to do an insert (of a record you know is not already
present), and the time take to search for that record (and not find it), then
it sounds as if something is very wrong -- and presumably fixable ;-)
You might get better performance by indexing the table (only) on an otherwise
meaningless incrementing integer and/or timestamp. The idea is to make writing
this table as near as possible like writing an ordinary sequential file.
-- chris
Chris Uppal - 29 Jun 2006 12:03 GMT
> if(aDao.find(r)!=null)
> aDao.insert(r);
> else
> aDao.merge(r,anotherRecordObject);
> if(!aDao.insert(r))
> aDao.merge(r,anotherRecordObject);
> Now, I see that the first form is better for many reasons,
It isn't better unless you are running your database at such a high isolation
level that the failed find() will prevent anyone else adding a record which
/would/ have matched the conditions in the SELECT. Similarly, the find()
could succeed, but the row be removed before your merge() executed; but
forbidding that doesn't require such a high isolation level[*].
OTOH, the second form is not a lot better, since once the error has been
triggered (and /assuming/ you've managed to find a -- database dependent -- way
of recognising the scenario accurately), there is /still/ no guarantee that the
data won't change under you before your alternative code is executed, so you
need some sort of loop somewhere.
Speaking as a non-expert, I prefer the second form, despite that problem.
I'd much prefer a third form with an UPDATE_OR_INSERT atomic action -- but
SQL doesn't provide that[**] :-(
-- chris
[*] I gather from the PostgreSQL documentation, that that DB may be quite
likely to use an isolation level high enough for this, since the level needed
for the second to work (read repeatability) is apparently just a synonym for
the highest level "serializable".
[**] Unless, perhaps, one could cobble something together with select for
update -- I'm no expert, as I mentioned.