> I use a custom-made database connection pool that doesn't have any
> serious problems. If a connection in the pool leaks, which happens very
> rarely, the app is able to recover, and if all connections have leaked,
> the app will open up new connections to the DB (the app has never
> actually leaked all connections in production, just in testing).
I believe without evidence that the reason you mentioned this
is that some part of your mind that you're not quite yet conscious
of believes that the problem is somewhere in here.
> The app will be running for fine for a day or two, without any hitch in
> response time, then all of a sudden the site will hang inexplicably.
[quoted text clipped - 3 lines]
> because I can access pages on the site that don't require a DB
> connection.
I'd want to focus first on reproducing the symptoms in a
controlled environment. The simplest way that might possibly
work is to use a tool like ab ("Apache bench", probably lurking
somewhere in your Apache distribution) to hit a page that requires
a db access many many times (pointing at a test machine, of course),
and see if it locks up after a fews days worth of hits.
That would be excellent. Do that, and you've got something
to test hypotheses with.
(This sort of thing happened to me once -- I ended up figuring
out that running two instances of ab with slightly different
timings would get it to freeze up almost instantly, which just
screams "Race Condition". Once I knew what I was looking for,
forehead slapping quickly followed.)
(Without being able to reproduce the bug, I could have found
and fixed the race condition, but not had any idea whether or
not I'd fixed the problem that was actually causing the observed
freeze. It was worth the effort.)

Signature
Mark Jeffcoat
Austin, TX
georgesbilodeau@gmail.com - 20 Oct 2006 01:38 GMT
Thanks for the reply. I've tried sending some load at a test machine
using 2 instances of Siege (http://www.joedog.org/JoeDog/Siege).
They're hitting a page that accesses the DB. My last test ran each
instance for 2 hours, and threw WAY more load at the site than it ever
gets in production, with lots of threads constantly checking DB
connections in and out of the pool. The server handled the load
masterfully, and the DB connections held up the whole time. I would
think that, at some point, the same problem experienced in production
would have happened during the test. GRRR.... to no avail.
I'm currently running a 10 hour test that will go overnight, again with
two instances, with one instance using a delay of 0-2 seconds between
requests and the other 0-3. Hopefully this one will be a little (ok a
LOT) more fruitful.
As a side note, I've had a production server's DB connections hang up
in a matter of less than an hour before, so it doesn't make sense to me
that using a longer test will be more successful (although I would love
to be proven wrong). When they did hang up in less than hour, it was on
a particularly busy day for the site, but still not nearly as busy as a
Siege test makes it.
Anyway, thanks again.
Georges
> > I use a custom-made database connection pool that doesn't have any
> > serious problems. If a connection in the pool leaks, which happens very
[quoted text clipped - 34 lines]
> not I'd fixed the problem that was actually causing the observed
> freeze. It was worth the effort.)