As part of my research I have created this huge java distributed system
for calculations.
The system is composed of a single "Sender" which sends jobs to a farm
of "Calculators" running on about 1000 different host machines. The
Calculators send their results to a single "Receiver". Jobs could be
easy e.g. 1+1 or hard e.g. something like log(1/[cos(ln3)]). As a
result some jobs finish in seconds while others may even take an hour.
I have implemented this system and it works fine. But I am sure there
is room for improvement and optimisation.
It needs to be:
more robust. E.g. I need good monitoring to deal with hosts crashing
and dying while performing a calculation. How do I detect this? Shall I
use the old heartbeat method? I don't think I can risk too many
messages on the network.
more scalable E.g. what if I increase the number of hosts. Can RMI
handle it or is there an alternative?
more adaptive to the type of job etc
Any other ideas or links to more info are welcome.
Thanks.
David N. Welton - 01 Nov 2005 13:56 GMT
> I have implemented this system and it works fine. But I am sure there
> is room for improvement and optimisation.
[quoted text clipped - 9 lines]
>
> Any other ideas or links to more info are welcome.
You might have a look at how the Erlang folks are doing things - there
are some good ideas there for reliable, distributed systems.
http://www.erlang.org/
Ciao,

Signature
David N. Welton
- http://www.dedasys.com/davidw/
Linux, Open Source Consulting
- http://www.dedasys.com/
Roedy Green - 01 Nov 2005 14:26 GMT
On 1 Nov 2005 04:49:44 -0800, "Wildfire_Heat"
<wildfire_heat@yahoo.com> wrote, quoted or indirectly quoted someone
who said :
>It needs to be:
>more robust. E.g. I need good monitoring to deal with hosts crashing
>and dying while performing a calculation. How do I detect this? Shall I
>use the old heartbeat method?
If you have many machines doing the same task, you can presume say
that at most 10% of them will crash. So you wait for 90% of the
results to be in, then wait another 10% of that elapsed time, then you
send out the posse for the remainder -- an "are you still alive" UDP
packet.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Patrick May - 01 Nov 2005 15:44 GMT
> On 1 Nov 2005 04:49:44 -0800, "Wildfire_Heat"
> <wildfire_heat@yahoo.com> wrote, quoted or indirectly quoted someone
[quoted text clipped - 10 lines]
> you send out the posse for the remainder -- an "are you still alive"
> UDP packet.
No need to reinvent the wheel. Jini's leasing mechanism provides
this kind of resiliency out of the box. See http://www.jini.org for
more details.
Regards,
Patrick
------------------------------------------------------------------------
S P Engineering, Inc. | The experts in large scale distributed OO
| systems design and implementation.
pjm@spe.com | (C++, Java, Common Lisp, Jini, CORBA, UML)
Patrick May - 01 Nov 2005 15:42 GMT
> As part of my research I have created this huge java distributed
> system for calculations.
[quoted text clipped - 14 lines]
> Shall I use the old heartbeat method? I don't think I can risk too
> many messages on the network.
This is a classic example of an application that could benefit
from Jini technology and JavaSpaces in particular. The basic idea is
known as the Master/Worker Pattern. Googling quickly turns up a
number of hits, including this one:
http://today.java.net/pub/a/today/2005/04/21/farm.html?page=last&x-maxdepth=0
Robustness is provided through a combination of transactional
interaction with the JavaSpace and Jini's leasing mechanism
(http://www.jini.org/Newsletter/DesignCorner/jini_intro_jun05.html).
> more scalable E.g. what if I increase the number of hosts. Can RMI
> handle it or is there an alternative? more adaptive to the type of
> job etc
I strongly recommend that you visit http://www.jini.org and give
Jini a try. It is perfectly suited to your requirements.
Regards,
Patrick
------------------------------------------------------------------------
S P Engineering, Inc. | The experts in large scale distributed OO
| systems design and implementation.
pjm@spe.com | (C++, Java, Common Lisp, Jini, CORBA, UML)
Roedy Green - 02 Nov 2005 03:06 GMT
> I strongly recommend that you visit http://www.jini.org and give
>Jini a try. It is perfectly suited to your requirements.
It depends on what you perceive as his requirements. Does he want to
solve the problem or is he trying the learn about the nuts and bolts
under the hood by constructing something himself from the ground up?
In either case he should at least examine Jini to get a rough idea of
how they solved the problem.

Signature
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.