I'm working on an architecture for a fairly large scale system
in Java. User interactions will be through JSP and servlets,
but there is also a big batch processing component with multiple
jobs that can run for minutes to hours, and some of them running
in parallel.
I'm looking for architecture suggestions for the batch job
portion.
Specifically, does anyone have opinions on running jobs as
separate threads in a single virtual machine, vs running
multiple independent JVMs?
A single JVM has the advantage of less system overhead and
easier inter-thread communication - which may or may not turn
out to be useful. But it means that if one job crashes it can
bring the others down with it. It also means that if one job
goes into a loop and has to be terminated, the operator will
have no choice but to kill everything.
Multiple independent jobs looks more robust, but Java's penchant
for swallowing memory may make this significantly less
efficient.
Anyone have experience or opinions on this?
Any other good ideas for someone doing batch processing in Java?
Anyone know of open source or commercial toolkits of special
interest for batch processing?
Thanks.
Alan
Ray in HK - 12 Jun 2005 05:33 GMT
What are the market price of ram.
> I'm working on an architecture for a fairly large scale system
> in Java. User interactions will be through JSP and servlets,
[quoted text clipped - 30 lines]
>
> Alan
Alan Meyer - 12 Jun 2005 05:55 GMT
> What are the market price of ram.
Thanks for the reply.
RAM can be surprisingly expensive when you buy it in multi-bank,
multi-gigabyte modules for Sun machines. But your point is well
taken. The extra cost may be justified.
There may be some processing efficiencies besides using
less memory management in a multi-threaded vs. multi-tasking
design.
Maybe someone has experience in this area?
Aquila Deus - 12 Jun 2005 06:04 GMT
> > What are the market price of ram.
>
[quoted text clipped - 9 lines]
>
> Maybe someone has experience in this area?
Ask unix or windows experts. There are at least three disadvantages:
1.Allocating a process uses more resource than a new thread. But given
Java's own resource need, .... :-)
2.Context-switching processes takes longer time than switching threads.
3.On non-SMP, sync between processes are super heavy compared to
thread's. In x86 all you need to do mutex between threads is an
exchange command, but between processes....
PS: none of above is really important.
Lucy - 13 Jun 2005 01:16 GMT
> > > What are the market price of ram.
> >
[quoted text clipped - 16 lines]
>
> 2.Context-switching processes takes longer time than switching threads.
So on a per batch basis, the percentage of time wasted is this:
(A tiny part of a fraction of a second) / (A few minutes or hours)
is approx == 0, so just forget about it.
> 3.On non-SMP, sync between processes are super heavy compared to
> thread's. In x86 all you need to do mutex between threads is an
> exchange command, but between processes....
>
> PS: none of above is really important.
Bjorn Borud - 13 Jun 2005 01:09 GMT
["Alan Meyer" <ameyer2@yahoo.com>]
| RAM can be surprisingly expensive when you buy it in multi-bank,
| multi-gigabyte modules for Sun machines. But your point is well
| taken. The extra cost may be justified.
if we are talking about a large system and the batch processing is
easily separable from your other infrastructure, why not buy cheap
Intel or AMD based machines, fill them up with RAM and run your batch
jobs there? it might be cheaper over all?
| There may be some processing efficiencies besides using
| less memory management in a multi-threaded vs. multi-tasking
| design.
|
| Maybe someone has experience in this area?
it is hard to give any sort of meaningful answer as long as I have no
idea of the what the nature of your batch processing tasks is :-).
-Bjørn
Aquila Deus - 12 Jun 2005 05:58 GMT
> I'm working on an architecture for a fairly large scale system
> in Java. User interactions will be through JSP and servlets,
[quoted text clipped - 26 lines]
> Anyone know of open source or commercial toolkits of special
> interest for batch processing?
Multi-JVM. If you worry about crash, multi-process is the only solution
- even .NET's Application Domain cannot ensure 100% isolation inside a
process.
memory problem could be solved later, but a single-process solution has
failed since the beginning.
However, most developers of java application servers wouldn't agree
with me :-)
Harald - 12 Jun 2005 12:10 GMT
> Specifically, does anyone have opinions on running jobs as
> separate threads in a single virtual machine, vs running
> multiple independent JVMs?
> A single JVM has the advantage of less system overhead and
[...]
> Multiple independent jobs looks more robust, but Java's penchant
[...]
By default, the VM tries to operate with 70% more allocated memory
than is currently needed for all objects. You can reduce this with
-XX:MaxHeapFreeRatio (see [1]). Call the GC explicitly after
dismissing any huge object to convince it to really obey the
MaxHeapFreeRatio (see [2]). This, and the added stability may be in
favor of running independent VMs.
On the other hand, Java's Process class is pretty poor compared to
proper process management. If you need more than trivial communication
between processes, go for threads. As for killing individual threads
that have gone crazy, jdb's remote interface may be an option, though
I never tried this myself.
Harald.
[1] http://java.sun.com/docs/hotspot/VMOptions.html
[2] http://www.ebi.ac.uk/Rebholz-srv/whatizit/monq-doc/monq/stuff/ConvinceGC.html

Signature
---------------------+---------------------------------------------
Harald Kirsch (@home)|
Java Text Crunching: http://www.ebi.ac.uk/Rebholz-srv/whatizit/software
Patrick May - 12 Jun 2005 12:23 GMT
> On the other hand, Java's Process class is pretty poor compared to
> proper process management. If you need more than trivial
> communication between processes, go for threads.
Another alternative is to use a JavaSpace. For grid and
autonomic computing systems it provides an elegant solution to the
problem of interprocess communication.
Regards,
Patrick
------------------------------------------------------------------------
S P Engineering, Inc. | The experts in large scale distributed OO
| systems design and implementation.
pjm@spe.com | (C++, Java, Common Lisp, Jini, CORBA, UML)
Bjorn Borud - 13 Jun 2005 01:37 GMT
[Harald <pifpafpuf@gmx.de>]
| Call the GC explicitly after dismissing any huge object to convince
| it to really obey the MaxHeapFreeRatio (see [2]).
triggering major collections blindly every second is not exactly an
optimal solution to this problem. this is indeed *very* bad advice.
I would recommend reading a bit more about how the JVM you want to use
manages its memory. if you use Sun's JVM for instance you can read:
http://java.sun.com/docs/hotspot/
I would also recommend using jvmstat and visualgc to inspect what your
JVM is doing. after using it for a while you will become more
familiar with the hotspot GC system and you will most likely be able
to spot various problems with heap sizing etc quite fast.
-Bjørn
HK - 14 Jun 2005 15:02 GMT
> [Harald <pifpafpuf@gmx.de>]
> |
[quoted text clipped - 3 lines]
> triggering major collections blindly every second is not exactly an
> optimal solution to this problem. this is indeed *very* bad advice.
May I kindly ask you to carefully read my posting and the
documentation of ConvinceGC before you suggest I am
a complete idiot?
The only thing that I can see that may have mislead you
is a different understanding of what a "huge object" is.
For me this is one which needs more than 50% of the
allocated memory. If such an object is not needed
anymore, e.g. after a startup phase of a server,
the only way I found to really get rid of it *and* free the
memory for other processes was ConvinceGC. If you have
a better solution, I would be eager to learn about it.
> I would recommend reading a bit more about how the JVM you want to use
> manages its memory. if you use Sun's JVM for instance you can read:
>
> http://java.sun.com/docs/hotspot/
Well, guess were I learned about -XX:MaxHeapFreeRatio.
Harald.
Bjorn Borud - 14 Jun 2005 14:57 GMT
["HK" <pifpafpuf@gmx.de>]
| May I kindly ask you to carefully read my posting and the
| documentation of ConvinceGC before you suggest I am
| a complete idiot?
indeed, I was mistaken. I misread the API and thought the class was
used to call System.gc() at a given interval indefinitely (which would
be a bad idea, and indeed, bad advice).
| The only thing that I can see that may have mislead you
| is a different understanding of what a "huge object" is.
that, and the fact that I misread the API. my apologies.
-Bjørn
Bjorn Borud - 13 Jun 2005 01:19 GMT
["Alan Meyer" <ameyer2@yahoo.com>]
| But it means that if one job crashes it can bring the others down
| with it. It also means that if one job goes into a loop and has to
| be terminated, the operator will have no choice but to kill
| everything.
what do you mean by "crash" here? threads dying because of unhandled
exceptions or hard errors that make the JVM die?
I'd prefer to model the batch processing APIs so that you don't really
have to make a decision before you know you have to. you just
abstract away if the job runs in the same JVM or not. provide
implementations for running jobs in the same JVM first. if it becomes
a problem doing so or it would make sense for other reasons to move
the processing elsewhere, implement whatever is needed for sending the
batch job to a different JVM (possibly on a different machine).
the important part is to
- have proper abstractions so that later you have the freedom
to choose.
- implement the remote processing when needed, and not start by
prematurely assuming that it is required.
good luck!
-Bjørn
Chris Uppal - 13 Jun 2005 07:41 GMT
> the important part is to
>
[quoted text clipped - 3 lines]
> - implement the remote processing when needed, and not start by
> prematurely assuming that it is required.
I agree with your philosophy here, but I think I'd come to the opposite
conclusion.
Presumably there is no compelling /need/ for the batch code to run in the same
JVM as the online code (the two don't need to interact directly). In that case
I'd want to start with the simple and inherently robust architecture of using
separate processes, and take the options of moving them onto separate machines
or of moving them into a shared JVM as and when it became necessary.
-- chris
Bjorn Borud - 13 Jun 2005 13:42 GMT
["Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>]
| I agree with your philosophy here, but I think I'd come to the opposite
| conclusion.
|
| Presumably there is no compelling /need/ for the batch code to run
| in the same JVM as the online code (the two don't need to interact
| directly).
the problem is that we can't really know that based on the information
the OP has posted. if we knew exactly what problem he tried to solve
it would be easier to give recommendations.
-Bjørn
Chris Uppal - 14 Jun 2005 09:13 GMT
> the problem is that we can't really know that based on the information
> the OP has posted. if we knew exactly what problem he tried to solve
> it would be easier to give recommendations.
Agreed.
-- chris
Alan Meyer - 14 Jun 2005 15:18 GMT
> > the problem is that we can't really know that based on the information
> > the OP has posted. if we knew exactly what problem he tried to solve
[quoted text clipped - 3 lines]
>
> -- chris
The points about abstraction are well taken. I will indeed
follow the advice given and design the interfaces to the programs
so they can run as threads of one JVM, in separate JVMs, or on
separate machines with no changes to the internals of the
processing.
I can't say anything about the specifics of the problem I'm
working on because it's a commercial project and the customer
requires confidentiality from the developers.
Speaking generally, I can say that the batch processes prepare
collections of documents for publication. There are user
interactive components to the system, but the basic publication
process is non-interactive. Publishing is done on a scheduled
basis. Documents pass through a series of steps to transform
them to make them usable by end users.
Thanks.
Alan
Alan Meyer - 13 Jun 2005 17:18 GMT
> ...
> I'm looking for architecture suggestions for the batch job
> portion.
> ...
Thank you all for your ideas and suggestions. I will follow
up on them.
If anyone else has more ideas, please chime in. I will continue
to follow the thread.
Alan