Java Forum / General / May 2005
Creating a unique random id
Fritz Bayer - 25 May 2005 11:01 GMT Hello,
how likely is it that the following code produces duplicate ie identical random numbers, when it gets executed on different virtual machines running on different computers?
import java.util.Random;
public class RandomNumberGenerator { Random rand = new Random();
public RandomNumberGenerator() { super(); }
public int get(int x) { return (int) rand.nextInt(x) + 1; }
public String getRandomSequence(int length) { StringBuffer stringBuffer = new StringBuffer(); for (int i = 0; i < length; i++) stringBuffer.append(get(10)); return stringBuffer.toString(); } }
public final static void main() { RandomNumberGenerator randomNumberGenerator = new RandomNumberGenerator(); System.println("How likely is it to get this number again: " + randomNumberGenerator(20)); }
Here is a test case at least for one computer it does not seem to return the same numbers:
public class TestRandomNumberGenerator extends TestCase {
public void testUniqueness() { String last =""; String current; for (int i = 0; i < 10000000; i++) { RandomNumberGenerator randomNumberGenerator = new RandomNumberGenerator(); current = randomNumberGenerator.getRandomSequence(20); assertNotSame(current, last); //System.out.println(current + " " + last); last = current; } }
public static Test suite() { return new TestSuite(TestRandomNumberGenerator.class); } }
Andrew Thompson - 25 May 2005 11:15 GMT > how likely is it that the following code produces duplicate ie > identical random numbers, What are the odds you can produce a self contained piece of code that compiles and displays the problem you discuss? <http://www.physci.org/codes/sscce.jsp>
>..when it gets executed on different virtual > machines running on different computers? [quoted text clipped - 9 lines] > super(); > } But a question. Given your class extends Object, and gets a default 'no args' constructor, what is the point of the last four lines shown above?
 Signature Andrew Thompson http://www.PhySci.org/codes/ Web & IT Help http://www.PhySci.org/ Open-source software suite http://www.1point1C.org/ Science & Technology http://www.LensEscapes.com/ Images that escape the mundane
JScoobyCed - 25 May 2005 12:08 GMT > Hello, > [quoted text clipped - 17 lines] > } > } In the piece of code above, you test only successive values, ie number [N] with [N-1] (having N>0). You should test the [N] value with all /i/ previous values, ie [N-1], [N-2], ..., [N-i]. A simple way to code it is to put each value in a HashTable:
public void testUniqueness() { HashTable ht = new HashTable(); String current; for (int i = 0; i < 10000000; i++) { RandomNumberGenerator randomNumberGenerator; randomNumberGenerator = new RandomNumberGenerator(); current = randomNumberGenerator.getRandomSequence(20); // Fails if 'current' is a duplicate assertNull(ht.put(current)); } }
This would make sure you don't have duplicate.
-- JSC
Ingo R. Homann - 25 May 2005 12:47 GMT Hi,
> how likely is it that the following code produces duplicate ie > identical random numbers, when it gets executed on different virtual [quoted text clipped - 4 lines] > return (int) rand.nextInt(x) + 1; > } I guess, the bets are 1:2.147.483.648! :-)
Ciao, Ingo
Jacob - 25 May 2005 13:06 GMT > how likely is it that the following code produces duplicate ie > identical random numbers, when it gets executed on different virtual > machines running on different computers? I use the currentTimeMillis() in order to create unique IDs. It requires that the ID generator is centralized and it needs to ensure at least one millisecond between each created ID.
Murray - 25 May 2005 14:10 GMT > > how likely is it that the following code produces duplicate ie > > identical random numbers, when it gets executed on different virtual [quoted text clipped - 4 lines] > needs to ensure at least one millisecond between each > created ID. Actually it's more that 1ms. Most operating systems / hardware have a time accuracy of significantly more than 1ms
Bob - 25 May 2005 14:10 GMT > Hello, > > how likely is it that the following code produces duplicate ie > identical random numbers, when it gets executed on different virtual > machines running on different computers? In my opinion, it's irrelevant how likely it is to occur. The fact that it can occur means that your solution is not sound.
Why not just create an object that can be accessed by all the machines, which has a static method that just increments a static counter and hands that next number out? That way, you're guaranteed to have 2^63 - 1 unique identities before you run out.
 Signature Bob
Chris Uppal - 25 May 2005 15:12 GMT > how likely is it that the following code produces duplicate ie > identical random numbers, when it gets executed on different virtual > machines running on different computers? There are two aspects to this question.
The first is theoretical. /Assuming/ that you are generating 20-digit strings completely at random, you can calculate the probability of finding duplicates. If the number of possible values of some random value is N, then if you have a pool of sqrt(N) values, the chances are that one value occurs twice (Google for the birthday paradox). In this case you have 1e20 possible random strings, so you would expect to have to create 1e10 (10 billion) strings on average before you got a duplicate.
However, that's on the assumption that the strings are completely random. Which happens not to be true in your case. For one thing, java.util.Random is a pseudo-random generator, and may have statistical properties that make duplicates more likely (I'm not mathematician enough to know). More importantly it has the designed-in quality that each instance of Random always produces the /same/ sequence of numbers given that it has started with a specific seed. As a result each instance of your RandomNumberGenerators will also produce the same sequence of numbers if the underlying Random has the same seed. So the number of possible /sequences/ of 20-digit number is determined by the number of possible seeds to Random (the space of possible numbers that they can produce is larger, because each generator can produce more than one number -- it may even be as large as 1e20, though I doubt it). In fact, Random only keeps 48 bits of its seed so the most you can have is 2**48 different RandomNumberGenerators. That in turn means that if you create 2**24 instances of RandomNumberGenerator then you'd expect to find two that were identical, and therefore generated identical sequences of numbers. 2**24 is about 17 million, which is quite a lot (depending on the application, of course), but remember that that's an /upper/ bound. The actual situation may be worse.
So, in practise the chances of finding duplicate 20-digit numbers is dominated by the chances of two machines using the same seed to initialise a RandomNumberGenerator.
In Sun's Java 1.4.2 the seed used by default (if you don't provide one yourself) is System.currentTimeMillis(). That is declared to answer a 64-bit long, but there won't really be 64 bits-worth of possible values because it is answering an absolute time, and that is likely to be similar on all the machines.
If two machines create new RandomNumberGenerators at a steady (on average) rate of one per second, and if their system clocks are synchronised closely enough that they read the same time on each machine at some point in the run (not necessarily the same point), then during the periods where they overlap, the chances are 1 in 1000 of producing an identical RandomNumberGenerator at /each/ step.
Say that you have 10 machines, all synchronised to within 1 minute, and you run the generators for an hour on all of them (creating a new instance once a second as before). For 59 out of the 60 minutes, each new RandomNumberGenerator will be created with a millisecond seed that is not more than 1 second different from that of one of the others created on each of the other 9 machines. So you will create 59 * 60 "batches" of 10 RandomNumberGenerators all with seeds that can differ by no more than 1000. The chances of all 10 from one batch being different are 1.000 * 0.999 * 0.998 * ... * 0.991 = 0.956. So each batch has e a 4.4% chance of containing duplicate RandomNumberGenerators. Since you create 59*60=3480 such batches in the hour, the chances of none of them containing duplicates is 0.956**3480. Which I make to be about 6e-69. I.e. it is essentially certain that you'll create two identical RandomNumberGenerators which will then naturally produce identical sequences of 20-digit numbers.
Of course, there's also the possibility that two /different/ RandomNumberGenerators will produce the same 20-digit number at some point, but the chances are very much lower, as described above.
In Sun's Java 1.5.0 Random uses System.nanoTime() as its default seed; that isn't an absolute time, so it is more "random". However there is no guarantee that it can actually answer all 2**64 possible values. If, say, it only answered 32-bits-worth of real information, then the number of possible random sequences would be cut down to 2**32.
And there will probably be a tendency for machines to have reset their nanosecond counters at roughly the same time (within the same month, say) so there's a higher than chance probability that two machines will answer the same number at some points in their lives. You can work out the probability in the same kind of way as I did above.
Repeating the calculation. First off there is less chance that any two machine's nanosecond clocks will overlap during the hour since they are not synchronised, but instead reflect when the machine was last reset (or something like that). For the sake of argument assume that they have all been reset in the last month. Secondly there is much less chance that any two machines with nanosecond clocks that differ by < 1 second will answer the same nanosecond time. If we assume that the nanosecond timer actually has microsecond resolution [*], then the chances are 1 in a million -- which is a lot smaller than 1 in 1000.
([*] I don't know how likely this is, but this machine's 'high performance counter' claims to click 3.6 million times a second so it may be plausible.)
With those assumptions, for each Random created (once per second per machine) there is a 1 / (31 * 24) chance that it is created during the a "second" (as seen by the nano clocks) that is covered by any specific one of the other machines during that run. So there is approximately 9 / (31*24) chance of it overlapping with any of the other 9 machines. That's about 1.2%. If it /does/ overlap, then there's a 1e-6 chance of it actually seeing the /same/ nanosecond value. That means that each Random has about 1.2e-8 chance of being identical to some other, or to put it another way, about (1-1.2e-8) chance of being unique (i.e. very near certainty) During the run we produce 10*60*60 instances of Random, so the chances of all of them being unique are about (1-1.2e-8)**36000. That's about 0.99956, or to put it the other way around there's about one chance in 2300 of finding a duplicate Random anywhere.
For comparison, if we consider 36000 genuinely random 20-digit numbers, then I make the chances of there being a duplicate about 1 in 178 billion.
So it all comes down to the seed for Random. I'd say that duplicates might be quite likely on 1.4 machines. On 1.5 machines they are probably quite unlikely, providing that the nanosecond counter has a respectably large range of possible values (as it probably does on Windows machines, I don't know about any others).
-- chris
Wibble - 26 May 2005 02:38 GMT How about
http://java.sun.com/j2se/1.5.0/docs/api/java/rmi/server/UID.html
Prefix it with a host id and away you go.
Wasn't Fritz just looking for consecutive duplicates anyway? Most pseudo random number generators wont repeat for a fairly large, known period.
I dont think java gaurantees any particular random implementation, so its usually better to choose your own. There's a pretty big tradeoff between randomness and performance.
>>how likely is it that the following code produces duplicate ie >>identical random numbers, when it gets executed on different virtual [quoted text clipped - 114 lines] > > -- chris Filip Larsen - 26 May 2005 07:54 GMT > Here is a test case at least for one computer it does not seem to > return the same numbers: [quoted text clipped - 22 lines] > } > } Your code is testing that current and last refer to same object, not that they are unequal which is probably what you want.
If you change the code to test for equality instead you will now probably find that your generator produces a lot of equal numbers in every clock tick because you are instantiating a new Random every time you generate a number and Random uses currentTimeMillis() to initiate its seed. If you fix this, for instance by making rand a static instance variable, your code will still generate the same number on two VM's if you instantiate your generators at the same clock tick. To fix that, you can use SecureRandom which makes an effort to make a unguessable (but not globally unique) seed.
However, instead of all that I would recommend that you look at java.util.UUID if you are on Java 1.5 and java.rmi.server.UID if you are on a previous version (although the UID class do not really provide the "universal" part).
Regards,
 Signature Filip Larsen
Fritz Bayer - 26 May 2005 08:07 GMT > Hello, > [quoted text clipped - 61 lines] > } > } So what's your suggestion for creating unique ids? The processes (http proxies) can't communicate with each other and are not allowed to contact a central server for generating the ids.
They run on different machines and get started at a random time by somebody. The distribution is not uniform, since most people have a preference when to start their services - for example in the early hours of the day.
For every user an id should be generated when he contacts the site. So the time when a user contacts the proxy is truely random (maybe this helps to solve the problem?). The generated id is then trasmitted and saved on a server, where all ids are kept. So thats why they should be unqiue.
Maybe by using the random time combined with the startup time of the process we can get some more unique random numbers? What is by the way with the class java.rmi.server.UID? I have read that this creates unique ids but only for 48 days. Well....
jonck - 26 May 2005 10:46 GMT > So what's your suggestion for creating unique ids? The processes (http > proxies) can't communicate with each other and are not allowed to > contact a central server for generating the ids. This statement contradicts the following, since in the above statement you say that your clients don't contact a central server...
> For every user an id should be generated when he contacts the site. So > the time when a user contacts the proxy is truely random (maybe this > helps to solve the problem?). The generated id is then trasmitted and > saved on a server, where all ids are kept. So thats why they should be > unqiue. ... and here you're saying that they _do_ contact a central server!
You're looking at your problem backwards: you want your clients to create an id and store these on the server. If you simply turn this around, let the server create the id's once the clients contact the server, your problem is solved.
Roland - 26 May 2005 12:32 GMT >>Hello, >> [quoted text clipped - 81 lines] > with the class java.rmi.server.UID? I have read that this creates > unique ids but only for 48 days. Well.... So you want to generate a globally unique ID without using some central server.
Java 5 offers a class for this: java.util.UUID. <http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html> This class has a static method to create a pseudo random UUID (there are other categories of UUIDs too). FWIW, the implementation of this method (randomUUID()) uses java.security.SecureRandom, which uses more bits for generation than java.util.Random (48-bit seed).
See also <http://en.wikipedia.org/wiki/Universally_Unique_Identifier>
Another interesting site I found: <http://www.random.org/> which offers a "True Random Number Service".
 Signature Regards,
Roland de Ruiter ___ ___ /__/ w_/ /__/ / \ /_/ / \
The Wogster - 26 May 2005 14:17 GMT > So what's your suggestion for creating unique ids? The processes (http > proxies) can't communicate with each other and are not allowed to [quoted text clipped - 10 lines] > saved on a server, where all ids are kept. So thats why they should be > unqiue. From what I see, your not looking for random, but rather unique values, and that is much, much easier.
If you have access to a central server -- it's debatable from what you have written here. The server can simply dole out from an incremented counter. If you do not, then assign each machine a unique ID, this could be based on a number of things, for example the IP address (if fixed), user ID number, or typed in on first use. This could also be a counter stored on a server, that is contacted by each machine ON FIRST USE, it gives each machine that contacts it for this, a new machine ID, which gets stored locally on the machine itself (I would prefer this, if possible, you know the values are guaranteed to be unique - at least as long as you have less then 2 147 483 648 machines.
Now on each machine you store an int sized counter, start at 0.
When you need a generated ID then, in a long you copy the machine ID multiplied by 4 294 967 296 then increment the machines counter and add the counter value to your long. This gives you a unique 64bit ID, that allows a very large number of machines to access it, and at one contact per second has enough capacity for over 135 years on each machine. If your counter ever does reach 2 147 483 647 then simply delete the machine ID and counter, then obtain a new machine ID and start again.
The reason why we use insane numbers like 4294967296 is that 4294967296 is 2^32, so on 32bit machines it will be faster because your dealing with full values, rather then partial values.
Fritz Bayer - 27 May 2005 07:04 GMT > > So what's your suggestion for creating unique ids? The processes (http > > proxies) can't communicate with each other and are not allowed to [quoted text clipped - 38 lines] > is 2^32, so on 32bit machines it will be faster because your dealing > with full values, rather then partial values. Hi,
it`s a requirement that the ids do not come from a server but get stored on one. Sounds stupid but that's the way it is.
I can use the ip address as a seed. And you are right, that I do not need random numbers but rather unique numbers. Since I don't have jdk 1.5 I can't use UUID unfortunaltey.
The problem with your counter is that I can't make it persistent so at the next time the program starts up it will be zero again. I need a methode which based on the IP, the startup time and the time at which a user contacts the proxy generates a unique id.
Owen Jacobson - 27 May 2005 08:15 GMT >> > So what's your suggestion for creating unique ids? The processes >> > (http proxies) can't communicate with each other and are not allowed >> > to contact a central server for generating the ids. (snip)
>> > For every user an id should be generated when he contacts the site. >> > So the time when a user contacts the proxy is truely random (maybe [quoted text clipped - 16 lines] > it`s a requirement that the ids do not come from a server but get stored > on one. Sounds stupid but that's the way it is. Don't just say "it's a requirement that this be so". *Especially* for something so apparently backwards, it's a good idea to explain why the requirement's there; sometimes, the reasons for the requirement can even reveal a solution that fits properly.
> I can use the ip address as a seed. And you are right, that I do not need > random numbers but rather unique numbers. Since I don't have jdk 1.5 I > can't use UUID unfortunaltey. For a network connection, the IP address + originating port will be unique as long as the connection lasts; for a web application this is less useful because the connection is transient, but it makes a temporarily-unique starting point, anyways.
> The problem with your counter is that I can't make it persistent so at the > next time the program starts up it will be zero again. I need a methode > which based on the IP, the startup time and the time at which a user > contacts the proxy generates a unique id. What's the actual purpose behind the randomly generated IDs? How will they be used? This is also important for creating a correct, useful solution to the whole problem.
Owen
Chris Uppal - 27 May 2005 10:51 GMT > I need a > methode which based on the IP, the startup time and the time at which > a user contacts the proxy generates a unique id. The IP address is not suitable for this purpose -- the problem is that many client machines will be on NAT-ed subnets and will likely all be using IP addresses from a small pool of "private use" addresses. The basic idea is sound, but what you need to be using is the MAC address from the network interface. I'm afraid I have no idea whether that is directly accessible via pure Java.
If you use that, the millisecond timestamp (or nano if it's available), and the date, plus perhaps any other bit of "random" data that happens to be available (the user's name ?), then you should be able to brew up an ID that is unique /provided/ that you don't have technically sophisticated users attempting to break the system by (e.g.) messing with the system clock.
Note that including all that info in a UID has fairly serious privacy implications (may even be illegal in some places for all I know). You'd be better off pushing the whole block of data through something like SHA1 or some similar (or better) crypto-quality one-way hash. Of course, doing that would re-introduce the possibility of clashes, but (unless the users are malevolent) the odds are very, very, low.
One thought, could you create UIDs on the client, and use that to set up the initial "handshake" but then supply a higher quality UID (created on the server) and hand that back to the client ? If so then, since the client-generated UIDs would be short-lived (a few seconds, minutes, hours, or days, depending on your app) the problem of clashes would presumably be eliminated (unless the users are malevolent).
-- chris
Lucy - 27 May 2005 16:07 GMT <big snip>
> it`s a requirement that the ids do not come from a server but get > stored on one. Sounds stupid but that's the way it is. [quoted text clipped - 7 lines] > methode which based on the IP, the startup time and the time at which > a user contacts the proxy generates a unique id. First you say that this number is stored on the server, then you say you can't make it persistent - make up your mind. This solution assumes that there are only two machines, but you can make it work for more. All you need to know is which machine you are running on.
Say you have two machines, A and B. Assign A to be the ODD one and B the EVEN one. Then your code just generates a random number using any technique you like. If you are on A and the number is even, just keep generating numbers till you get an odd one and use that. Similar for B. If you can store numbers between runs you can look at the last value to see if you are A or if you are B.
Free MagazinesGet these publications absolutely FREE for up to 12 months. There are no hidden fees and no obligation. Simply choose a title, complete the application form and submit it. Read more ...
|
|
|