Java Forum / General / August 2006
stale objects in collections
Timo Nentwig - 21 Aug 2006 18:38 GMT Hi!
I'm not entirely sure whether the Set needs to be synchronized. I think yes, but would like to ask people here anyway, pseudo-code:
class Test{ final Set set = Collections.synchronizedSet(new HashSet()):
class MyThread extends Thread{ void run(){ while(foo) { // set is either written to read from never both set.put(someObject): } } }
public main(){ Thread t = new Thread[10]; for( int i = 0; i < t.length... ) (t[i] = new MyThread()).start();
for( int i = 0; i < t.length... ) t[i].join();
// this thread may (not) see stale objects in the collection // without synchronization (?) dump(set); } }
Thanks a lot Timo
Eric Sosman - 21 Aug 2006 20:01 GMT Timo Nentwig wrote On 08/21/06 13:38,:
> Hi! > [quoted text clipped - 26 lines] > } > } Hard to be sure of your intent, because the sample code isn't really Java but a sort of Java-ish patois. But if there's only supposed to be one Set shared by the whole bunch of MyThreads, then yes: The Set needs synchronization because multiple threads are calling its methods "simultaneously." The synchronizedSet() wrapper provides all the synchronization you need at the level of individual method calls, but you need additional protection if you want to make a sequence of method calls "atomic:"
// WRONG if (set.isEmpty()) { // // "set" can change here // set.add("Elvis"); }
// RIGHT synchronized(set) { if (set.isEmpty()) { set.add("Elvis"); } }
There's no problem with staleness at the end of main() because [1] all the competing threads have been joined and thus can no longer interfere with the Set, and [2] the join() call itself is a synchronization point for the purposes of things like memory visibility.
 Signature Eric.Sosman@sun.com
Timo Nentwig - 21 Aug 2006 20:15 GMT > additional protection if you want to make a sequence of method > calls "atomic:" [quoted text clipped - 13 lines] > } > } Why isEmpty()? Why should I start a bunch of threads and only one can actually write to the collections? We are proably talking at cross-purposes here...
What I acutally do is starting multiple threads running the same SQL statement (with different parameters) against a database and storing the result in a Set. So, no duplicates, no Set.get(), no nothing. Simply run query and put result in a Set. So, as I understand your comment regarding join() below, I wouldn't need to synchronize the Set (?).
> thus can no longer interfere with the Set, and [2] the join() > call itself is a synchronization point for the purposes of > things like memory visibility. Eric Sosman - 21 Aug 2006 22:41 GMT Timo Nentwig wrote On 08/21/06 15:15,:
>>additional protection if you want to make a sequence of method >>calls "atomic:" [quoted text clipped - 17 lines] > actually write to the collections? We are proably talking at > cross-purposes here... It was just an illustration of a situation where the synchronization provided by synchronizedSet() is not enough. In the WRONG example, isEmpty() is atomic and add() is atomic, but the combination is not; the set is unlocked in between the two operations, and the state of the world can change in that interval. The RIGHT example avoids this by synchronizing during the entire compound operation, leaving no "window of opportunity" for interference. The synchronized(set) {...} is necessary in RIGHT, even though the set's methods are already synchronized by the synchronizedSet() machinery.
I'll admit the isEmpty()/add() example is perhaps not too realistic, but that's not the essential point.
> What I acutally do is starting multiple threads running the same SQL > statement (with different parameters) against a database and storing > the result in a Set. So, no duplicates, no Set.get(), no nothing. > Simply run query and put result in a Set. So, as I understand your > comment regarding join() below, I wouldn't need to synchronize the Set > (?). You need to synchronize the different threads' accesses to the set, or chaos will ensue. After you've joined all those threads they are no longer running, hence they can no longer be mucking around with the set. So no: Once the interfering threads are out of the picture, no synchronization is needed.
Have you considered doing things just a little differently? Why not let each thread put its results in its own private set, and then combine them when the threads are all finished? The threads don't need to synchronize their accesses to the sets (because each thread manipulates only its own set, and doesn't interfere with any other thread's set). And after the worker threads have been joined, the main thread can access their sets without synchronization because (as before) the threads are no longer around to create interference. Rough sketch:
class MyThread extends Thread { private Set mySet = new HashSet(); public void run() { // fill mySet with results } public Set getSet() { return mySet; } }
class Driver { public static void main(String[] unused) { MyThread[] t = new MyThread[10]; for (int i = 0; i < t.length; ++i) { t[i] = new MyThread(); t[i].start(); } Set bigSet = new HashSet(); for (int i = 0; i < t.length; ++i) { t[i].join(); bigSet.addAll(t[i].getSet()); t[i] = null; // optional } // bigSet now holds combined results } }
 Signature Eric.Sosman@sun.com
Timo Nentwig - 22 Aug 2006 08:44 GMT > Timo Nentwig wrote On 08/21/06 15:15,: > In the WRONG example, isEmpty() is atomic and add() is atomic, > but the combination is not; the set is unlocked in between That's clear...
> You need to synchronize the different threads' accesses to > the set, or chaos will ensue. After you've joined all those Why? So, the answer to my initial question whether a collection can contain stale data if multiple threads write to a non-synchronized collection is: yes?
> Have you considered doing things just a little differently? > Why not let each thread put its results in its own private set, > and then combine them when the threads are all finished? The Well, sounds good despite I don't like copying data twice of not neccessary. Anyway I want to *understand* why the collections must be synchronized (as I thought).
Timo Nentwig - 22 Aug 2006 10:24 GMT > > You need to synchronize the different threads' accesses to > > the set, or chaos will ensue. After you've joined all those > > Why? So, the answer to my initial question whether a collection can According to Concurrency in Practice immutable objects are always threads safe. So, a immutable collection can be accessed (read) by multi threads.
Now I wonder whether the same applies for multi threads writing (and not reading) to a collection. If Thread.join() does sync IMO the answer is yes (?).
Eric Sosman - 22 Aug 2006 13:14 GMT >>> You need to synchronize the different threads' accesses to >>>the set, or chaos will ensue. After you've joined all those [quoted text clipped - 4 lines] > threads safe. So, a immutable collection can be accessed (read) by > multi threads. Right.
> Now I wonder whether the same applies for multi threads writing (and > not reading) to a collection. If Thread.join() does sync IMO the answer > is yes (?). "Immutable" is the combination of "mutable" (meaning "changeable" or "capable of change") with a negating prefix. When we say that something is "immutable" we mean that it is "not mutable," that is, that it does not or cannot change.
Adding elements to a collection changes the collection; removing elements from a collection also changes the collection. Something that changes is not "immutable," so if you are adding and removing things the the /Concurrency in Practice/ statement does not apply.
Perhaps you think the operation of adding an element to a collection is in some sense "write-only," and that this makes it safe to use without synchronization? (I'm imagining an argument something like: "The trouble with a mutable object is that different threads can see the object's internals while they are in an inconsistent state. But since `write-only' operations don't report the state, it doesn't matter.") Is that anything like what you're thinking?
The argument is wrong. Trivially, the add() method for a Set does in fact report some of the Set's state, namely, whether the element was already present in the Set before the add(). More deeply, the add() method must inspect and modify the Set's internals to insert the new element. The details of just what those internals look like depend on what kind of Set you're using, but they will all involve reading the state (or parts of it), making some decisions and computations, and storing new values in state variables. It may seem "from the outside" that add() is write-only (if you decide to ignore the returned value), but in fact it cannot be so. (Consider: even `++x' is not write-only, because it must retrieve the previous value in order to calculate the new one.)
You need synchronization whenever a mutable object is (or can be) manipulated by multiple threads "simultaneously."
You can skip the synchronization if only one thread at a time can access the mutable object. A very common special case of this is when only one thread has a reference to the object; you need no synchronization because no other thread can interfere. Another case arises in your scenario: There's an initial phase during which many threads pound on the collection, and a final phase when only one thread works with it. You need synchronization in the first phase (to keep all those threads from trampling on each other), but you do not need it in the second (because there's only one thread).
You do not need synchronization for immutable objects, even if many threads access them "simultaneously."
 Signature Eric Sosman esosman@acm-dot-org.invalid
Timo Nentwig - 22 Aug 2006 13:42 GMT > Perhaps you think the operation of adding an element to a > collection is in some sense "write-only," and that this makes it [quoted text clipped - 3 lines] > state. But since `write-only' operations don't report the state, it > doesn't matter.") Is that anything like what you're thinking? Yes.
> The argument is wrong. Trivially, the add() method for a Set Okay, I think (hope :) I got it now. Thanks for your and Chris' comprehensive explanation.
Chris Uppal - 22 Aug 2006 10:27 GMT > > You need to synchronize the different threads' accesses to > > the set, or chaos will ensue. After you've joined all those [quoted text clipped - 3 lines] > data if multiple threads write to a non-synchronized collection is: > yes? The problem is not "stale data", but that the Set implementation may /break/ if you modify it from two threads at the same time without synchronisation. There are other issues too, about visibility of modifications made by one thread to code running on another thread.
The Set may (in fact will) depend on certain invariants being maintained in its internal representation of the data. If two or more threads modify its state simultaneously, then there is a very real risk that those invariants will be broken, and so the Set will stop working: go into an infinite loop, return incorrect result, or start throwing exceptions.
So there are three levels at which synchronisation may be needed.
1) At the level of individual accesses to the Set -- it's own methods. This is non-optional because the Set will break if you don't protect it somehow.
2) At the level of ensuring that changes made by one thread are visible by others. This is also non-optional according to the JLS, but may not actually be necessary on some specific JVM/hardware combinations. Again, synchronisation at the level of individual Set methods is sufficient.
3) At the level of the logical meaning of the data contained in the Set. It's hard to invent a plausible example of this using Sets (it's trivially easy with other Collections), but say your application expected that the Set always contained an even number of elements -- there is no way you can ensure that just by synchronising the Set's own methods. You have to synchronise at a coarser level, so that two elements are always added to the Set in a single synchronised block.
-- chris
Patricia Shanahan - 22 Aug 2006 16:16 GMT >> Timo Nentwig wrote On 08/21/06 15:15,: >> In the WRONG example, isEmpty() is atomic and add() is atomic, [quoted text clipped - 17 lines] > neccessary. Anyway I want to *understand* why the collections must be > synchronized (as I thought). You may be able to avoid the second copy, depending on how the result set is used. For example, if it is only accessed through an iterator, you could write your own iterator that goes through each per-thread set in turn.
However, I would only do things like that after trying the simple one-set approach, with synchronization, and finding a real bottleneck.
Patricia
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|