Java Forum / General / July 2006
Lock-free binary trees
Joe Seigh - 02 Jul 2006 17:05 GMT Well, c.p.t is getting boring and slashdot has become the daytime tv of the internet so I suppose I will post some code I did prototyping lock-free binary trees since it's been long enough since I've done anything on it and I still haven't figured out a use for it. Also posting it might be a way to have fun with Sun Microsystems Research again later on (inside joke. I won't explain here)
It's actually just a partial prototype which I wrote to see what the code would sort of look like. It doesn't have any of the base methods that java and STL collections have that are really composite collections including list/array collections. Basically your tree collection is really a tree collection and a list collection side by side under the covers.
Some of the stuff in it like the height stuff isn't actually used since I haven figured out whether to do height balanced or weight balanced trees and it's not complete or correct. It's just there to see what it would look like and where it might go.
It's only reader lock-free. The write side has to use conventional locking or synchronization. Now for the basic techniques.
Adding a node is straight forward. Just insert it as a leaf node. I'm assuming JSR-133 volatile semantics has fixed the memory visibility problems.
Removing a nodes is also pretty straight forward using PCOW (partial copy on write) replacing part of the subtree of the removed node. See my previous posting on this http://groups.google.com/groups?selm=opsock48geqm36vk%40grunion for details.
Balancing a tree is a little tricky. If you move nodes from one subtree to another subtree, there's a risk that any concurrent lookup might miss finding the node as it's move out from underneath them. It's an error for a lookup to not find a node that is in the tree collection. The trick is to have a rotate method that adds the node to be moved to the destination subtree before it removes the node from the source subtree. That way the new location of the node will be atomically visible when the old location is removed. I only implemented rotateRight in this prototype. I also didn't implement the heuristics of the balancing yet either. You obviously don't want to do too many overlapping PCOW operations. The current algorithms for red-black trees or AVL trees probably won't be too useful as is since they weren't written with atomic operations or PCOW in mind.
Source code in a followup post. I am deleting some debug methods which aren't directly applicable. Hopefully I didn't delete too much.
 Signature Joe Seigh
When you get lemons, you make lemonade. When you get hardware, you make software.
Joe Seigh - 02 Jul 2006 17:12 GMT Partial and probably not correct prototype below
import java.util.*;
public class LFTreeSet { class Sequence { int value; synchronized int next() { value++; return value; } } Sequence sequence = new Sequence(); class Node { Object value; int height; // tree height int lheight; // left subtree height int rheight; // right subtree height volatile Node ltree; // left subtree volatile Node rtree; // right subtree int unique; // unique node id; int cost; // node cost int weight; // node weight void init(Object o, Node lnode, Node rnode) { value = o; if (lnode != null) { ltree = lnode; lheight = lnode.height; } else { ltree = null; lheight = 0; } rtree = rnode; if (rnode != null) { rtree = rnode; rheight = rnode.height; } else { rtree = null; rheight = 0; } height = maximum(lheight, rheight) + 1; unique = sequence.next(); // assign unique sequence number } Node(Object o) { init(o, null, null); } Node(Object o, Node lnode, Node rnode) { init(o, lnode, rnode); } }
Comparator comparator; // comparator volatile Node root; // tree root node //========================================================================= // Private Methods //========================================================================= //------------------------------------------------------------------------- // maximum -- //------------------------------------------------------------------------- static int maximum(int x, int y) { if (x > y) return x; else return y; } //------------------------------------------------------------------------- // addNode -- // // returns: // subtree height or // 0 if tree already contains node // //------------------------------------------------------------------------- Node addNode(Node z, Object o) throws ClassCastException { Node p; int rc; if (z == null) { comparator.compare(o, o); // force ClassCastException if wrong class return new Node(o); } rc = comparator.compare(o, z.value); if (rc < 0) { z.ltree = addNode(z.ltree, o); z.lheight = z.ltree.height; } else if (rc > 0) { z.rtree = addNode(z.rtree, o); z.rheight = z.rtree.height; } // calculate height z.height = maximum(z.lheight, z.rheight) + 1;
return z; }
//------------------------------------------------------------------------- // maxNode -- // //------------------------------------------------------------------------- static Node maxNode(Node z) { while (z.rtree != null) z = z.rtree; return z; } //------------------------------------------------------------------------- // pcloneMinusMaxNode -- partial copy // // static won't work? //------------------------------------------------------------------------- Node pcloneMinusMaxNode(Node z) { Node p; if (z.rtree == null) return z.ltree; else return new Node(z.value, z.ltree, pcloneMinusMaxNode(z.rtree)); } //------------------------------------------------------------------------- // rotateRight -- // //------------------------------------------------------------------------- Node rotateRight(Node p) { if (p == null) return null; if (p.ltree == null) return p; if (p.rtree == null) p.rtree = new Node(p.value); else addNode(p.rtree, p.value); return new Node(maxNode(p.ltree).value, pcloneMinusMaxNode(p.ltree), p.rtree); } //========================================================================= // Public Constructors and Methods //========================================================================= //------------------------------------------------------------------------- // LFTreeSet -- ctor // // //------------------------------------------------------------------------- public LFTreeSet(Comparator comparator) { this.comparator = comparator; root = null; } //------------------------------------------------------------------------- // add -- // //------------------------------------------------------------------------- public boolean add(Object o) throws ClassCastException { if (contains(o)) { return false; } else { root = addNode(root, o); return true; } } //------------------------------------------------------------------------- // remove -- // //------------------------------------------------------------------------- public boolean remove(Object o) throws ClassCastException { Node p, p2, p3; int n; // find parent node p2 = null; p = root; while (p != null && ((n = comparator.compare(o, p.value)) != 0)) { p2 = p; if (n < 0) p = p.ltree; else p = p.rtree; } if (p == null) return false; // not found // leaf node if (p.ltree == null && p.rtree == null) p3 = null; // single subtree else if (p.ltree == null) p3 = p.rtree; // single subtree else if (p.rtree == null) p3 = p.ltree; // new subtree with root node replaced by max node from left subtree else p3 = new Node(maxNode(p.ltree).value, pcloneMinusMaxNode(p.ltree), p.rtree); // // store subtree // if (p2 == null) root = p3; else if (p2.ltree == p) { p2.ltree = p3; p2.lheight = p3.height; p2.height = maximum(p2.lheight, p2.rheight) + 1; } else { p2.rtree = p3; p2.rheight = p3.height; p2.height = maximum(p2.lheight, p2.rheight) + 1; } return true; } //------------------------------------------------------------------------- // contains -- // //------------------------------------------------------------------------- public boolean contains(Object o) throws ClassCastException { Node p; int n; p = root; while (p != null && ((n = comparator.compare(o, p.value)) != 0)) { if (n < 0)p = p.ltree; else p = p.rtree; } return (p != null); } //------------------------------------------------------------------------- // comparator -- // //------------------------------------------------------------------------- public Comparator comparator() { return comparator; } //========================================================================= // Debug Methods //========================================================================= //------------------------------------------------------------------------- //------------------------------------------------------------------------- int nodeCost(Node p) { if (p == null) return 0; return p.cost = ((nodeCost(p.ltree) + nodeCost(p.rtree))*2 + 1); } //------------------------------------------------------------------------- //------------------------------------------------------------------------- int nodeWeight(Node p) { if (p == null) return 0; return p.weight = (nodeWeight(p.ltree) + nodeWeight(p.rtree) + 1); } //------------------------------------------------------------------------- //------------------------------------------------------------------------- public void rotateTreeRight() { root = rotateRight(root); } }
//--//
 Signature Joe Seigh
When you get lemons, you make lemonade. When you get hardware, you make software.
Chris Uppal - 03 Jul 2006 10:09 GMT > It's only reader lock-free. The write side has to use conventional > locking or synchronization. Now for the basic techniques. > > Adding a node is straight forward. Just insert it as a leaf node. I'm > assuming JSR-133 volatile semantics has fixed the memory visibility > problems. Rather a dangerous assumption. As I read the code, you need more synchronisation than you have (almost none). There doesn't seem to be anything to ensure that the height values are propagated correctly (I saw your comment that they aren't used), nor the costs and weights, nor the "value" of the node, nor the memory pointed /to/ by value.
Incidentally, the stuff in java.lang.concurrent.atomic might be useful if you don't already know about it.
-- chris
Joe Seigh - 03 Jul 2006 12:29 GMT >>It's only reader lock-free. The write side has to use conventional >>locking or synchronization. Now for the basic techniques. [quoted text clipped - 8 lines] > that they aren't used), nor the costs and weights, nor the "value" of the node, > nor the memory pointed /to/ by value. The height/weight stuff was was only partly implemented. I didn't take it out as I didn't want to do any major editting. Just ignore it.
AFAIK, JSR-133 added acquire and release semantics to the loads and stores of volatile variables respectively.
There isn't any synchronization. Read access doesn't require it and for write access it needs to be explicity added if required. E.g., if you have a single writer then you don't need it.
Not requiring synchronization is sort of the whole point of lock-free. You do know what lock-free is and how it works, don't you?
> Incidentally, the stuff in java.lang.concurrent.atomic might be useful if you > don't already know about it. From the java.util.concurrent.atomic description
The memory effects for accesses and updates of atomics generally follow the rules for volatiles:
* get has the memory effects of reading a volatile variable. * set has the memory effects of writing (assigning) a volatile variable. * weakCompareAndSet atomically reads and conditionally writes a variable, is ordered with respect to other memory operations on that variable, but otherwise acts as an ordinary non-volatile memory operation. * compareAndSet and all other read-and-update operations such as getAndIncrement have the memory effects of both reading and writing volatile variables.
I don't need compareAndSet so volatile is sufficient.
You might want to take a look at the source for the lock-free classes in java.util.concurrent.
 Signature Joe Seigh
When you get lemons, you make lemonade. When you get hardware, you make software.
Chris Uppal - 04 Jul 2006 10:48 GMT [me:]
> > Rather a dangerous assumption. As I read the code, you need more > > synchronisation than you have (almost none). There doesn't seem to be [quoted text clipped - 10 lines] > AFAIK, JSR-133 added acquire and release semantics to the loads and stores > of volatile variables respectively. I believe so too, but that JSR is no longer relevant -- this stuff is now part of the language spec. See JLS3 for details (which I confess I haven't followed).
But -- as far as I know -- reads /though/ a volatile variable are not transitively given the "special" quality that makes volatile not-entirely-useless for, say, integer-valued variables.
For instance, given:
class A { volatile int[] array; // NB: not final A() { array = new int[4]; for (int i = 0; i < array.length; i++) array[i] = i; }
int get(int i) { return array[i]; } }
and in one thread:
someA = new A();
and later, but in a different thread;
int x = someA.get(2);
then that can return either 0 or 2. All that the 'volatile' qualifier does is to make it impossible (in this case) for get() to throw an NPE. The value of "array" itself is passed correctly from thread to thread, but that (unlike synchronisation) doesn't imply that memory transitively reachable via that value is also resynchronised.
If I'm misreading the spec in this, then I'd appreciate correction and a reference (from anyone).
-- chris
Chris Thomasson - 07 Jul 2006 19:34 GMT >> It's only reader lock-free. The write side has to use conventional >> locking or synchronization. Now for the basic techniques. [quoted text clipped - 5 lines] > Rather a dangerous assumption. As I read the code, you need more > synchronisation than you have (almost none). I disagree. I would argue that the synchronization properties are perhaps, too *strong...
He is relying on implicit release semantics for writer threads. A release barrier is used within the writer threads critical section AFTER you initialize the new nodes links/ect... and BEFORE you store a pointer to the new node into a location that is assessable to reader threads.
AFAIK, Java volatiles have release semantics for stores:
JavaVolatileStore(...) { membar #LoadStore|#StoreStore /* Perform the actual store */ }
This will work for lock-free reader patterns in general:
https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&m essageID=11068
* Very Strong Sync Properties
"Iteration" over lock-free read data-structures is an expensive process in Java; volatile qualifier seems to force a release barrier for every store and, AFAIK, force an acquire barrier for every load. Every node visited during a lock-free iteration would be executing full-blown memory barriers, when clearly, simple dependant load ordering is all that is required...
Then you have to factor in the hefty garbage collector that is used in Java... It will be generating a lot more overhead under load (thread suspension/stack introspection, ect...) than a lock-free proxy garbage collector would...
Chris Uppal - 08 Jul 2006 12:17 GMT > I disagree. I would argue that the synchronization properties are perhaps, > too *strong... [quoted text clipped - 5 lines] > new node into a location that is assessable to reader > threads. I don't know what a "release barrier" is. I see that this is x-posted to c.p.threads, where presumably the term is well known, but I'm reading in c.l.j.p where the term has no meaning (unless we want to talk about a specific implementation of the JVM spec, or provide an explanation in terms of the JVMs semantics). Anyway...
It's not clear to me that the synchronised operation included in a Node's constructor is intended to be part of the "real" code, or is just another bit of left-over "this is incorrect, but ignore it" code like the cost and weight fields (note that the 'sequence' field itself is not hread-safe). Also it's not clear whether the use of a synchronised method is intended to invoke the full synchronisation semantics, or is just an implementation detail -- something that might better be implemented (without synchronisation) by using a java.util.concurrent.atomic.AtomicInteger.
Still, assuming all that, I agree that the synchronised Sequence.next() does, after all, provide the necessary happens-before relationship between assignment to the Nodes value field, and uses of it by the Comparator. (And assuming that the value objects don't change significantly once added, but that seems fair even without threading ;-)
-- chris
Thomas Hawtin - 08 Jul 2006 16:15 GMT > I don't know what a "release barrier" is. I see that this is x-posted to I think he means "write barrier". In the new Java Memory Model, the semantics for releasing a lock are not that strong.
> It's not clear to me that the synchronised operation included in a Node's > constructor is intended to be part of the "real" code, or is just another bit [quoted text clipped - 4 lines] > something that might better be implemented (without synchronisation) by using a > java.util.concurrent.atomic.AtomicInteger. As there is only a single writer thread, the synchronisation does nothing and can therefore be removed by the JVM.
> Still, assuming all that, I agree that the synchronised Sequence.next() does, > after all, provide the necessary happens-before relationship between assignment > to the Nodes value field, and uses of it by the Comparator. (And assuming that > the value objects don't change significantly once added, but that seems fair > even without threading ;-) Giving the code a quick once over, it looks as if the volatiles ensure the bits of the code that are supposed to work to what they are supposed to.
I don't know how well it will perform. Reading a volatile is apparently very cheap. The main cost may be chucking away register values, but that wont matter after the first read. If writing is really infrequent, then an immutable tree could be used, with a single volatile holding it (memory allocation is cheap).
Tom Hawtin
 Signature Unemployed English Java programmer http://jroller.com/page/tackline/
Joe Seigh - 08 Jul 2006 16:21 GMT >>I disagree. I would argue that the synchronization properties are perhaps, >>too *strong... [quoted text clipped - 11 lines] > implementation of the JVM spec, or provide an explanation in terms of the JVMs > semantics). Anyway... http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile
might answer some questions.
 Signature Joe Seigh
When you get lemons, you make lemonade. When you get hardware, you make software.
gottlobfrege@gmail.com - 10 Jul 2006 12:57 GMT > Well, c.p.t is getting boring and slashdot has become the daytime tv of the internet > so I suppose I will post some code I did prototyping lock-free binary trees since it's > been long enough since I've done anything on it and I still haven't figured out a use for it. ...
> It's only reader lock-free. The write side has to use conventional locking or synchronization. ....
> Balancing a tree is a little tricky. ...
> -- > Joe Seigh Why not do a skip-list instead? A skiplist - has the same performance characteristics as a tree (O(logN) lookup, etc) - is easier to code - can be both reader and writer lock free
Tony
Chris Thomasson - 10 Jul 2006 22:44 GMT >> Well, c.p.t is getting boring and slashdot has become the daytime tv of >> the internet [quoted text clipped - 14 lines] >> -- >> Joe Seigh [...]
> - can be both reader and writer lock free Is there a lock-free skip-list writer algorithm that has less overhead than an uncontented mutex? I don't ever remember seeing one that used fewer than two atomic operations and/or memory barriers. The basic point that I getting at here is that if an uncontended lock-free writer algorithm has two or more atomic operations and/or memory barriers, the overhead will be equal to or greater than an uncontended mutex! For instance, the well known lock-free queue by Michael and Scott forces producer threads to execute at least two, or three under contention, CAS /w StoreLoad or LoadStore operations; uncontended fast-pathed mutex access will perform equal to or better than their enqueue operations. Therefore, IMHO using a hashed locking scheme for the writer side of a lock-free reader pattern is usually ideal...
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/e0c011 baf08844c4/3ca11e0c3dcf762c?q=multi-mutex&rnum=1#3ca11e0c3dcf762c
Any thoughts?
gottlobfrege@gmail.com - 11 Jul 2006 17:47 GMT > Is there a lock-free skip-list writer algorithm that has less overhead than > an uncontented mutex? I don't ever remember seeing one that used fewer than [quoted text clipped - 11 lines] > > Any thoughts? I haven't looked at the skip lists very closely. One day it dawned on me that lock-free skip lists would be possible, and so I searched the internet and found they already existed (only in the last few years). So I was "too late" and didn't bother looking at them much further.
Anyhow, it's the *contended* mutex path that takes longer :-). If you don't expect many writers, then yeah, the extra work probably isn't worth it. But if you need multiple writers, maybe those extra barriers are still faster than waiting for a writer? Also, by the same logic, an uncontended mutex on reading might not be too high a price to pay (even if it was slower than lock-free, at least the code will be simpler), but then we wouldn't be talking about lock-free at all, would we?
And remember, it is not just the number of atomics/barriers, it is the number MORE than what you had to do anyhow for the sake of the lock-free readers.
Tony
Chris Thomasson - 12 Jul 2006 00:20 GMT >> [...] >> Therefore, IMHO using a hashed locking scheme for [quoted text clipped - 3 lines] >> >> Any thoughts? [...]
> Anyhow, it's the *contended* mutex path that takes longer :-). Well, yeah... However, you kind of have to take adaptive mutexs into account; a contended case under moderate load can usually still hit a fast-path...
> If you don't expect many writers, then yeah, the extra work probably isn't > worth it. But if you need multiple writers, maybe those extra barriers > are still faster than waiting for a writer? Maybe... You have to know if the contended case for a locking scheme based on adaptive mutexs was found to frequently park writer threads. A well designed locking scheme can be "contention-less"... Think of the Solaris memory allocator...
> Also, by the same logic, an uncontended mutex on reading might not be > too high a price to pay I have to disagree here. Lock-free reader patterns are all about "completely removing the need for atomic-ops/membars". IMO, its very hard to try to compare that to mutexs, or "any other" algorithm that uses atomics/membars...
http://groups.google.com/group/comp.programming.threads/msg/ec103670e323111a
http://groups.google.com/group/comp.programming.threads/msg/fdc665e616176dce
For instance, I know that the StoreLoad barrier in SMR can "drastically reduce" per-thread reads-per-second. This is why RCU-SMR and vZOOM was developed...
> (even if it was slower than lock-free, at least > the code will be simpler), but then we wouldn't be talking about > lock-free at all, would we? ;)
> And remember, it is not just the number of atomics/barriers, it is the > number MORE than what you had to do anyhow for the sake of the > lock-free readers. The general idea for lock-free reader patterns is to completely eliminate all of the atomics/barriers for read-only accesses:
http://groups.google.com/group/comp.programming.threads/msg/5c0dbc2ab7fc46c3
http://groups.google.com/group/comp.programming.threads/msg/e0c599d20481fab8
Joe Seigh and I have both developed user-space solutions' for this kind of stuff...
gottlobfrege@gmail.com - 12 Jul 2006 22:36 GMT > The general idea for lock-free reader patterns is to completely eliminate > all of the atomics/barriers for read-only accesses: So let's say I'm at least(!) one step behind on all this stuff - how do you ensure the read thread sees up to date data without a read barrier?
Tony
Joe Seigh - 12 Jul 2006 23:10 GMT >>The general idea for lock-free reader patterns is to completely eliminate >>all of the atomics/barriers for read-only accesses: > > So let's say I'm at least(!) one step behind on all this stuff - how do > you ensure the read thread sees up to date data without a read barrier? Dependent load ordering for the most part. You get it for free on most platforms. Or more accurately, you can't avoid it. The other part of avoiding it on whatever you synchronize the read access critical sections with. VZOOM avoids it if I undertand it correctly. RCU avoids it. RCU+SMR avoids it. And some GC implementations avoid it.
 Signature Joe Seigh
When you get lemons, you make lemonade. When you get hardware, you make software.
Chris Thomasson - 13 Jul 2006 00:02 GMT >> The general idea for lock-free reader patterns is to completely eliminate >> all of the atomics/barriers for read-only accesses: > > So let's say I'm at least(!) one step behind on all this stuff - how do > you ensure the read thread sees up to date data without a read barrier? Dependant loads:
<psuedo>
#if ! defined (ArchAlpha) #define dependant_load(x) (*x) #else dependant_load(x) { l = *x; rmb(); return l; } #endif
void* vzAtomicLoadPtr_mbDepends(void* volatile* d) { return dependant_load(d); }
Chris Thomasson - 13 Jul 2006 00:14 GMT [...]
> void* vzAtomicLoadPtr_mbDepends(void* volatile* d) { > return dependant_load(d); > } DOH. I should of omitted volatile:
void* vzAtomicLoadPtr_mbDepends(void** d) { return dependant_load(d); }
Some C/C++ compilers actually inject full-blown acquire/release barriers on volatile access...!
<volatile rant>
I think a Microsoft compiler for Itanium does this sort of non-sense...
Volatile should be defined as an optimization hint to the compiler ONLY!
It should have absolutely NO effect on any hardware ordering!
</volatile rant>
gottlobfrege@gmail.com - 13 Jul 2006 02:02 GMT > [...] > [quoted text clipped - 10 lines] > Some C/C++ compilers actually inject full-blown acquire/release barriers on > volatile access...! I think this is an option in xcode/gcc as well. I know they have an option for thread safe function-local statics.
> <volatile rant> > [quoted text clipped - 5 lines] > > </volatile rant> David Hopwood - 13 Jul 2006 17:24 GMT > [...] > [quoted text clipped - 7 lines] > return dependant_load(d); > } In that case, how do you know that the compiler isn't optimizing away the call (which it should, assuming we're not on an Alpha machine), and then reordering the load relative to things that it shouldn't be reordered relative to?
Answer: you don't.
 Signature David Hopwood <david.nospam.hopwood@blueyonder.co.uk>
Chris Thomasson - 13 Jul 2006 17:45 GMT >> [...] >> [quoted text clipped - 14 lines] > > Answer: you don't. Well, the call would ideally be an external assembled function:
http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6
David Hopwood - 13 Jul 2006 21:49 GMT >>>[...] >>> [quoted text clipped - 18 lines] > > http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6 The problem with this is that you're still relying on the assumption that your compiler does not perform certain kinds of valid optimization. This is already false for some compilers (e.g. llvm-gcc), and will probably be false for future main-line versions of gcc that will do link-time optimization.
<http://llvm.org> <http://gcc.gnu.org/ml/gcc/2005-11/msg00888.html>
(Although the specific proposal for integrating LLVM into gcc has not been agreed and seems to be at least temporarily stalled, there was concensus on the general idea of supporting link-time inter-procedural optimizations in gcc.)
Using inline asm with a "clobbers memory" directive before and after the load should solve this particular problem. However, I am skeptical of reliance on dependent load ordering in general. I've never seen any rigorous definition of it (as a property of a memory model, rather than in discussions of processor optimizations), or any explicit statement in the architecture manuals of commonly used processor lines saying that it is supported. In fact, I wonder if the idea that there is an implicit commitment for future versions of these processors to guarantee dependent load ordering is not just wishful thinking.
 Signature David Hopwood <david.nospam.hopwood@blueyonder.co.uk>
Joe Seigh - 14 Jul 2006 00:06 GMT >>Well, the call would ideally be an external assembled function: >> [quoted text clipped - 11 lines] > and seems to be at least temporarily stalled, there was concensus on the general > idea of supporting link-time inter-procedural optimizations in gcc.) I suppose you could change gcc so as to break any threading api, except Posix of course. But then you'd have to go into the witless protection program to protect you from the wrath of Linus and all the other Linux kernel developers.
:)
 Signature Joe Seigh
When you get lemons, you make lemonade. When you get hardware, you make software.
David Hopwood - 14 Jul 2006 01:40 GMT >>> Well, the call would ideally be an external assembled function: >>> [quoted text clipped - 19 lines] > kernel developers. > :) <http://gcc.gnu.org/onlinedocs/gcc-3.2.3/gcc/i386-and-x86-64-Options.html> # These -m switches are supported in addition to the above on AMD x86-64 # processors in 64-bit environments. [...] # -mcmodel=kernel # Generate code for the kernel code model. The kernel runs in the # negative 2 GB of the address space. This model has to be used for # Linux kernel code.
<http://gcc.gnu.org/onlinedocs/gcc/S_002f390-and-zSeries-Options.html> # [...] In order to build a linux kernel use -msoft-float.
<http://gcc.gnu.org/ml/gcc-patches/2004-01/msg01619.html> # > People run more and more into problems with kernel modules on x86-64 # > and the new gcc because they don't specify the -mno-red-zone parameter. # > With the old gcc 3.2 they got away, but with the new one it breaks.
<http://www.faqs.org/docs/kernel/x204.html> # Kernel modules need to be compiled with certain gcc options to make # them work. [...] # * -O2: The kernel makes extensive use of inline functions, so modules # must be compiled with the optimization flag turned on. Without # optimization, some of the assembler macros calls will be mistaken # by the compiler for function calls. This will cause loading the # module to fail, since insmod won't find those functions in the kernel.
<http://www.luv.asn.au/overheads/compile.html> # From /usr/doc/gcc/README.Debian.gz in gcc-2.95.2-0pre2:
[an old version, but I found it amusing considering the previous quote:]
# As of gcc-2.95, optimisation at level 2 (-O2) and higher includes an # optimisation which breaks C code that does not adhere to the C standard. # Such code occurs in the Linux kernel, and probably other places.
Anyway, the point is that the Linux kernel is a law unto itself, which may or may not work with any particular combination of gcc version and options (it certainly won't work with other compilers, except maybe icc with a patch).
If you are writing a general purpose library or application, you want it to be less compiler-version/option-dependent than that; preferably, you want something that works for a documented reason, rather than just by coincidence of the compiler not doing some optimizations. As I said earlier,
Using inline asm with a "clobbers memory" directive before and after the load should solve this particular problem. However, I am skeptical of reliance on dependent load ordering in general. I've never seen any rigorous definition of it (as a property of a memory model, rather than in discussions of processor optimizations), or any explicit statement in the architecture manuals of commonly used processor lines saying that it is supported. In fact, I wonder if the idea that there is an implicit commitment for future versions of these processors to guarantee dependent load ordering is not just wishful thinking.
 Signature David Hopwood <david.nospam.hopwood@blueyonder.co.uk>
Joe Seigh - 14 Jul 2006 11:12 GMT ...
> Anyway, the point is that the Linux kernel is a law unto itself, which may > or may not work with any particular combination of gcc version and options [quoted text clipped - 13 lines] > that there is an implicit commitment for future versions of these processors > to guarantee dependent load ordering is not just wishful thinking. Hypothetically, anything is possible. So gcc could do something extremely stupid. Especially since the current C standard does not recognise threads. However C++0x is discussing threaded support so unless there is a complete break between C and C++, something will have to be worked out.
Worst case is that gcc eliminates itself as a research tool for multi-threading and that putting support for any new threading features in C will become as tedious as the JSR process except the place where the new features came from will be a lot smaller. Think about it. The prototypes for the stuff that was put in JSR-166 came from somewhere. Some of it may have been assembler but I doubt all of it was.
The commercial compiler vendors would be happy with gcc doing that. Intel provides a library that uses lock-free algorithms. That you'd have to buy a compiler from them as well would make their day.
 Signature Joe Seigh
When you get lemons, you make lemonade. When you get hardware, you make software.
David Hopwood - 14 Jul 2006 16:29 GMT > ... > [quoted text clipped - 21 lines] > Hypothetically, anything is possible. So gcc could do something > extremely stupid. Link-time optimization is *not* extremely stupid. Relying on it not being done is arguably stupid, when there are alternatives that rely only on documented (if architecture-dependent) features.
> Especially since the current C standard does not recognise threads. > However C++0x is discussing threaded support so unless there is a complete > break between C and C++, something will have to be worked out. I have no confidence in the C++0x committee to produce anything usable. I will be pleasantly surprised if they don't make things worse.
 Signature David Hopwood <david.nospam.hopwood@blueyonder.co.uk>
Joe Seigh - 15 Jul 2006 15:58 GMT >>Hypothetically, anything is possible. So gcc could do something >>extremely stupid. > > Link-time optimization is *not* extremely stupid. Relying on it not being > done is arguably stupid, when there are alternatives that rely only on > documented (if architecture-dependent) features. Not that. I was refering to some gcc developer completely destroying gcc's ability to tolerate multi-threading since the standards would allow them to do that.
>>Especially since the current C standard does not recognise threads. >>However C++0x is discussing threaded support so unless there is a complete >>break between C and C++, something will have to be worked out. > > I have no confidence in the C++0x committee to produce anything usable. > I will be pleasantly surprised if they don't make things worse. Hell freezes over happens-before cpp-threads figures out how to define multi-threading semantics.
To slightly change the topic from my alledged claim to 100% accurately predict the future with respect to what will or will not be in C/C++, let's assume for the sake of argument that C/C++ will no longer be around. The question becomes how would you develop and test new concurrent programming facilities and constructs? With C/C++, you could prototype and compare against alternative techniques. Without C/C++, you'd have to add support for in in a language and modify the compiler for that language. And you'd test it against what? Not having the facility? You could test it against another language but that would be an apples and oranges comparison due to the number of overall differences between the languages. All you'd get would be a flame war. There must be some great flame wars out there on Java vs. Erlang vs. Occam. None of which conclusively proved anything.
 Signature Joe Seigh
When you get lemons, you make lemonade. When you get hardware, you make software.
Chris Thomasson - 15 Jul 2006 21:25 GMT >>>>[...] >>>> [quoted text clipped - 35 lines] > general > idea of supporting link-time inter-procedural optimizations in gcc.) http://groups.google.com/group/linux.kernel/msg/e10f88ada7cd04f4
Humm...
> Using inline asm with a "clobbers memory" directive before and after the > load [quoted text clipped - 7 lines] > commonly > used processor lines saying that it is supported. https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?forumID=1797&m essageID=11068
- I received a request from SUN to talk to the "press" about the CoolThreads contest, and basically vZOOM techniques'/coolthreads technology/"uses" in general...
I actually "may have a chance" to bring all of this "stuff" up, and perhaps get it published in a "respectable medium"...
> In fact, I wonder if the idea > that there is an implicit commitment for future versions of these > processors > to guarantee dependent load ordering is not just wishful thinking. I will address this "issue" in detail...
Chris Thomasson - 16 Jul 2006 01:35 GMT >>>>>[...] [...]
>> In fact, I wonder if the idea >> that there is an implicit commitment for future versions of these >> processors >> to guarantee dependent load ordering is not just wishful thinking. An atomic api abstraction should provide the means for loads with #LoadLoad barrier semantics. So, if dependant loads are no longer supported, we switch to the #LoadLoad version.... Simple...
;)
gottlobfrege@gmail.com - 13 Jul 2006 02:07 GMT > Dependant loads: > [quoted text clipped - 13 lines] > return dependant_load(d); > } so for the typical case of one time init / DCLP:
if (!initted) // [1] { Lock lock; init(foo); }
// use foo:
foo->doSomething(); // [2]
you need (or get for free) a dependant load on the init flag at [1]? and/or when accessing foo, at [2]?
I thought Itanium also didn't quite do this?
Tony
Chris Thomasson - 16 Jul 2006 21:52 GMT >> Dependant loads: >> [quoted text clipped - 27 lines] > > you need (or get for free) a dependant load on the init flag at [1]? No. You need #LoadLoad after loading the init flag, and a #StoreStore before setting the flag. You have introduced a level of indirection via. the "init flag", so you need #LoadLoad... Dependent ordering is not strong enough here...
> and/or when accessing foo, at [2]? Dependent ordering applies to the load of foo, but not the load of the init flag...
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|