Choosing Random Elements

| 15 Comments | No TrackBacks

Choosing random elements from a set of items is a problem that appears regularly. If you are already given the array of items, this can be implemented trivially:

import random
def random_element( a, N ):
    return a[ int( random.random() * N ) ]

Pick a number between 0 to N - 1, and return the element with that index.

Before we go on, here are “Some Minor Disclaimers”:

I’m aware that Python’s random module has some of these implemented. You are more than encouraged to use those, as the first thing you should do is if someone else did it already! The N can easily be substituted with len(a). It is there to make the parameter explicit. This article is meant for teaching and the coding style aims to make the code as clear (and language-neutral) as possible, and intentionally does not take advantage of a certain Python conciseness!

The Perl Cookbook has an optimization that is subtle, but amazingly concise:

srand;
rand($.) < 1 && ($line = $_) while <>;
# $line is the random line

Rewritten in Python, with an iterator instead of a file:

import random
def random_element_iterator( iterator ):
    N = 0

    for item in iterator:
        N += 1
        # 1/N chance
        if random.random() * N < 1:
            element = item

    return element

For the N-th element of the iterator, there is a 1/N probability of keeping that line. When I first came upon it, it worked like magic. It doesn’t have to stay that way — you can prove this with induction.

The base case N = 1 is true: the first element has 100% chance of being picked. The second element has 50% chance of being picked (and in this case, the first element is kept with 50% probability as well).

Suppose for the first N items, the probability of choosing any of the N elements is equal — 1/N. When you process the item N+1, the probability of it being chosen is 1/(N+1), which is by definition above. The probability of choosing any previous item was 1/N, and with N/(N+1) probability of staying that way - (1/N) * (N /(N+1)) = 1/(N+1). Each of the N+1 items now has 1/(N+1) of being chosen.

This was my first exposure to an online algorithm — you don’t need to know how many elements we need to account for. This is particularly practical, as in the Perl Cookbook example, it means you do not have to keep all the data in the memory to retrieve a random item — you can only need enough memory to store one item.

It turns out that this was covered in Knuth’s “The Art of Computer Programming”, which I read a few years later:

Can we pick a random element from a set of items whose cardinality we do not know?


Let’s revisit the naive algorithm. It’s as simple as returning one element based on a randomly generated index.

Now let’s take a look at the Fisher-Yates algorithm:

import random
def shuffle( a, N ):
    for i in range( N - 1, 0, -1 ):
        j = int( random.random() * i )
        a[i], a[j] = a[j], a[i]
    return a

This is the canonical way to create an unbiased shuffled array, meaning each of the possible permutation is equally likely. This is O( N ) time, which is as good as it gets, with O( N ) space. On every step of the iteration, you’re getting one random item from the remaining items. Did you make the connection? The naive algorithm can be viewed as one step of this iteration, returning the first element.

Disclaimer: It is very easy to implement Fisher-Yates incorrectly, including Sattolo’s algorithm. Double check your implementation!


Now we’re at the general problem - picking K distinct elements randomly out of a set of N items. The keyword here is distinct — if you want to choose the same item multiple times, just run the above algorithm K times. That is essentially picking K elements out of a set of N elements, with replacement. What we want is an algorithm for picking K elements out of a set of N elements, without replacement.

A simple, naive solution for choosing distinct elements is to keep track of what elements we’ve already picked:

# Assumes K <= N, or this will never terminate
import random
def random_subset_track( a, N, K ):
    used = [ False ] * N
    result = []

    for i in range( 0, K ):
        # get a random element between 0 and N.
        j = int( random.random() * N )
        while ( used[j] ):
            j = int( random.random() * N )
        result.append( a[j] )
        used[j] = True

    return result

This might scream inefficient to you, as it should. Linear in the base case, it probably will take longer. A better solution, with some foreshadowing, is to simply use the Fisher-Yates shuffle! Get one of the N! permutations, and take the first K elements:

import random
# Use the slice a[:K]
def shuffle( a, N ):
    for i in range( N - 1, 0, -1 ):
        j = int( random.random() * i )
        a[i], a[j] = a[j], a[i]
    return a

Can we do better? Yes! Wait, didn’t I say this was O( N ), and thus optimal? How can you do better? Well, it turns out it is O( N ), and thus optimal in terms of N, but you can do better if you analyze its running time in terms of K, its output — for this problem, K can be much smaller than N, and thus we can do this problem in O(K) time with a small tweak — as we briefly mentioned before, each iteration of the Fisher-Yates algorithm will net you one item that is unused. Just stop early after K iterations:

import random
def random_subset_fy( a, N, K ):
    for i in range( N - 1, N - K - 1, -1 ):
        j = int( random.random() * i )
        a[i], a[j] = a[j], a[i]
    return a[ N - K : N ]

(since we update the end of array, we want the last K items)

This is a simple example of an output-sensitive algorithm.


Don’t have the memory to store all the items at the same time? If you know the number of items, then the probability of choosing an element, assuming K more element is needed out of N elements, is simply K/N:

import random
def random_subset_p( a, N, K ):
    result = []

    for i in range( N ):
        # We want to keep this element with probability K/N
        if ( random.random() * ( N - i ) < K ):
            K -= 1
            result.append( a[i] )

    return result

We can prove this, once again, with induction.

However, this still requires us to know how big the set is. To paraphrase Knuth:

Can we pick K random elements from a set of items whose cardinality we do not know?

Yes! Similar to the previous Perl solution, for the N-th item, keep it with K / N probability. If we keep it, replace one of K-th entries at random (with 1/K probability):

import random
def random_subset( iterator, K ):
    result = []
    N = 0

    for item in iterator:
        N += 1
        if len( result ) < K:
            result.append( item )
        else:
            s = int(random.random() * N)
            if s < K:
                result[ s ] = item

    return result

We will prove this with induction.

Base case: For the first K items, you have 100% chance of keeping the current items. For the K+1-th item, this item has a K/(K+1) probability of being chosen. Each previously chosen item has the probability of 1/(K+1) probability of the new item being ignored, plus a K/(K+1) probability that the new item will be chosen, with a (K-1)/K probability that another previously chosen item will be replaced (instead of this one). This is:

Now suppose for the first N items, the algorithm is correct, that is, each previous item had a K/N chance of being kept.

When you consider the item N+1, the probability of it being kept is K/(N+1), which is by definition.

The probability of keeping the previous items was K/N. With (N-K+1)/(N+1) probability that the current item is not kept, plus the probability that the current item is kept (K/(N+1)), but other items were chosen instead ((K-1)/K).

For the N+1-th item, it has K/(N+1) probability of being kept.

This is now still O( N ) time, but with a reasonable O( K ) space. Great for when you can’t just put everything in memory!

There is a difference between this and the Fisher-Yates version — while both return K items, this algorithm does not return the K uniformly in a random order, e.g., the first element is always found in the first position, if it is chosen in the K items. If the order of the K items matter, then simply perform a Fisher-Yates shuffle on the result. (Extra Credit: There is also an O( N log K ) algorithm that preserves this property, without the memory overhead of the Fisher-Yates version, using only O( K ) space.)

And here we go, from start to finish, the best algorithms (and a wrong one) for choosing K distinct elements out of N elements, even if we don’t necessarily know what N is in the beginning!

No TrackBacks

TrackBack URL: http://propersubset.com/mt/mt-tb.cgi/5

15 Comments

Interesting post. You end up trading more CPU for less memory usage, which could be a win some scenarios. Problem that comes to mind is that you'll be wanting to do that on large lists, which means many potential calls to random(), which isn't all that cheap, meaning your CPU could go through the roof.

Also, I think you can be a bit more concise and "pythonic" with your basic python choice method ported from perl:

import random
def random_element_iterator( iterator ):
    for N, item in enumerate(iterator):
        if random.random() * N 

Unless I've missed something?

I learned Python over a few days just last week (a few days before the post), so the code is most certainly not as tight it could be, thanks for the improvement!

there is a nice function in random called choice, it will do this for free :)
random.choice(seq) to pull a single item from the sequence, or random.sample(seq, n) to pull n items from the sequence.

The correct way to get a random element from a list is to use random.choice(). Your random_element() function doesn't correctly pick a random index. 0 and N are each half as likely as other indexes. If you *really* want to pick a random list index, use random.randint(0, N). Then you don't have to worry about messing it up.

The correct way to shuffle a list is to use random.shuffle(). Then you don't have to worry if you correctly implemented the Fisher Yates shuffle.

Thanks, clearly using the random library is the way to go, but this is more for learning purposes. =)

Adam,
Your the reason I hate programmers, you think your "smart" by giving what you think is the "correct" way to do it. But if you were smart you would realize this article was about the theory involved in the algorithms, and not about knowing the python libaries. Knowing the python libraries is the kind of knowledge, that while useful is not deep real knowledge, very few programmers are smart enough to understand theory.

Huh. Brian, that's out of line. The lead 'solution' is poor code. Even if you're avoiding using random.choice() or random.shuffle(), random.randint() should be used instead of random.random() in order to generate a random integer.

Further, there's no justification for the poor api of random_element() to have a second argument. It's unnecessary and will lead to bugs. Programming is as much about writing correct and clear code as it is about understanding algorithms.

Please read the disclaimer, and hey, if you still need to feel superior over what is by far a choice in clarity (which you are free to disagree), feel free to continue to comment!

All the code is meant to be "pseudo-code" that happens to be in Python.

I love that choice algorithm (I posted on it myself a while back), but beware: random() is expensive, and often the naive version (computing the length, choosing a random index, and walking to it) is faster than calling random() N times. Generally speaking, you probably don't want to use this on in-memory data structures, but only in cases such as reading from files where walking the elements is expensive.

Thanks for the comment, I agree, it's all about the specifics of the problem, and there are subtle things to optimize for (such as the number of random() call) that you have to make a decision on what costs more!

Adam,

Your random_element() function doesn't correctly pick a random index. 0 and N are each half as likely as other indexes.

This is completely untrue.

import random
N = 100
dict = [0] * N
for i in xrange(10000000): dict[int(random.random() * N)]

Look at the results yourself, or...

Running these sums with Student's t-test (separated into two groups: first & last item and other items), I come up with a P-value of 0.46, which is far too high to reject the null hypothesis, so there is no reason to believe that the first and last item are accessed more or less than the others.

Larry,

The lead 'solution' is poor code. Even if you're avoiding using random.choice() or random.shuffle(), random.randint() should be used instead of random.random() in order to generate a random integer.

I wouldn't be so quick to say that; random.randint() is merely a call to random.randrange(). If you supply random.randrange() with a single argument N, it assumes you want an integer in the range [0,N[, and int(random.random() * N) is exactly what it returns.

Whoops, seems I mis-pasted. The line in my reply to Adam should read:

for i in xrange(10000000): dict[int(random.random() * N)] += 1

Thanks for the comments, porges! I actually missed that part of Adam's statement. Coding choices aside, all the code is tested, so it should be correct. While I agree with Stephen's statement that for better Python code, random.randint() is a better choice for clarity as to what it's doing, but perhaps not as clear as to the math (too many steps in between?):

Essentially, this line (and variants):

if random.random() * N < 1:

is really supposed to show 1/N probability, which would really be:

if random.random() < 1 / N:

and it's only one minor step. If I jumped straight to:

if random.randint( 0, N - 1 ) < 1:

I felt it would be harder to understand without adding a few lines of comments, despite the tighter code.

Ah, I just realized I addressed that comment to you instead of Stephen :)

Leave a comment

Larry on things worth learning.

About this Entry

This page contains a single entry by Larry published on April 5, 2010.

Five Algorithms You Must Know was the previous entry in this blog.

Mathematical Formula (TeX) image generator is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Pages