Site Search

Article Topics

Powered by Framewerk

random_agg()

Earlier today a user came onto the #postgresql channel on IRC.freenode.net with an interesting challenge: he wanted a "random" aggregate.  Jan Wieck stated that it couldn't be done, but I thought there might be a way.  Oddly, the user wasn't satisfied with my solution (I'm really not clear on what he really wanted), but I liked it a great deal so I'm sharing it here.

If you check out the PL/R project, you'll see some of Joe Conway's innovative solutions for special aggregates like MEDIAN().  Medians can't be done as simple, classic aggregates; using brute force approaches they require two passes over the data, which postgresql can't support in an aggregate function.  However, Joe does some very creative things with storing running values in registers which I thought could be adapted to the RANDOM_AGG() problem, and I was right.

First off, in order to hold multiple values in the state variable of the aggregate function we're going to need a composite type.  I tend to define types and functions which exist only for internal use by starting their names with an underscore, as I do here:

CREATE TYPE _random_text AS (
    runcount BIGINT,
    choice TEXT
);

Next, we're going to need two functions.  One does the calculation of the aggregate, and the other is the final function which "formats" the exit value for the query.  I'll actually start with the final function because it's easy, it just strips out the final random value from the composite and gives it to the query:

CREATE OR REPLACE FUNCTION _exit_random_text (
    _random_text )
RETURNS TEXT AS $f$
SELECT $1.choice;
$f$ LANGUAGE sql IMMUTABLE STRICT;

The other function is conceptually more complex.  Aggregate functions iterate through the data set, in no particular row order, performing their operation once per row.  So, at any given time our aggregate function can know two things: the number of rows it's covered so far, and the value of the current row.   Based on that we can follow this logic with each row:

 n = iteration number
 v = value held by current row
 b = value held in the "buffer" of the state variable

IF random < ( 1 / n ) THEN v ELSE b

What's all that, you ask?  Simple: for each row, compare a random number (between 0 and 1) with one divided by the number of rows you've processed so far.  If the random number is less, choose the value of the current row; otherwise, keep the value held in the state variable.  Now, the user who had originally asked for this was convinced that the math couldn't work; he insisted that the chance of getting stuck with the first value was unreasonably high.  Probability theory was one of my favorite areas of math, though, so I could see that it did work.  For example, for a series of 5 values, the chance of keeping the first value is:

 1/1 * 1/2 * 2/3 * 3/4 * 4/5 = 24/120 = 1/5

The proability equation for this calculation is:

 (n - 1)! / n!  =  1 / n

Now, 1/n is exactly the value we want; in any large enough sample over n values we would expect any individual value to appear 1/n times.  So it works mathematically.  In code, it works this way:

 CREATE OR REPLACE FUNCTION _choose_random_text (
     thestate _random_text,
     newvalue TEXT )
 RETURNS _random_text AS $f$
 DECLARE result _random_text;
 BEGIN
     result.runcount := COALESCE(thestate.runcount, 0) + 1;
     IF random() < ( 1::FLOAT / result.runcount::FLOAT ) THEN
         result.choice := newvalue;
     ELSE
         result.choice := thestate.choice;
     END IF;
     RETURN result;
 END; $f$ LANGUAGE plpgsql;

CREATE AGGREGATE random_agg(
     BASETYPE = text,
     SFUNC = _choose_random_text,
     STYPE = _random_text,
     FINALFUNC = _exit_random_text
 );

Now, the question was, how does it work?  Do we actually get a random value?  And if so, how long does it take compared to the traditional approach, "SELECT value FROM table ORDER BY random() LIMIT 1"?   The answer is, surprisingly well.  Here's some examples of using it on a small table, which happens to be the regional contact list for the PostgreSQL project:

 pgpr=# \timing
 Timing is on.
 pgpr=# select name from contacts order by random() limit 1;
         name
 ---------------------
  Michalis Kamprianis

(1 row)
Time: 2.445 ms
 pgpr=# select name from contacts order by random() limit 1;
       name
 ----------------
  Francois Suter
 (1 row)
Time: 0.903 ms
 pgpr=# select name from contacts order by random() limit 1;
      name
 ---------------
  Jussi Mikkola
 (1 row)
Time: 0.712 ms
 pgpr=# select random_agg(name) from contacts;
     random_agg
 -------------------
  Anastasios Hatzis
 (1 row)
Time: 4.014 ms
 pgpr=# select random_agg(name) from contacts;
    random_agg
 ----------------
  Dawid Kuroczko
 (1 row)
Time: 1.037 ms
 pgpr=# select random_agg(name) from contacts;
     random_agg
 ------------------
  Dennis Bjorklund
 (1 row)
Time: 1.117 ms

As you can see, for a small dataset (70 rows) it's a bit slower than ORDER BY random(), about 30%.  However, another user on #postgresql had the theory that on large datasets, ones which were larger than work_mem, random_agg() would be faster than ORDER BY random() because the aggregate does not require a sort.  Let's see if he's right.  This time, the time the table is the downloads log for ftp.postgresql.org, about 3.6 million rows of up-to-100-character values.  Larger than I've set work_mem.

 pg_downloads=# explain analyze select path from clickthrus order by random() limit 1;
                    QUERY PLAN         
 -----------------------------------------------------------------------------
  Limit  (cost=358037.96..358037.96 rows=1 width=54) 
(actual time=54360.854..54360.856 rows=1 loops=1)
    ->  Sort  (cost=358037.96..361194.77 rows=3156808 width=54) 
(actual time=54360.851..54360.851 rows=1 loops=1)
          Sort Key: random()
          ->  Seq Scan on clickthrus  (cost=0.00..78278.85 rows=3156808 width=54) 
(actual time=0.065..5376.734 rows=3068298 loops=1)
  Total runtime: 54523.907 ms
 (5 rows)
 pg_downloads=# explain analyze select random_agg(path) from clickthrus;
                           QUERY PLAN            
 --------------------------------------------------------------------------
  Aggregate  (cost=78278.85..78278.85 rows=1 width=54) 
(actual time=27084.871..27084.871 rows=1 loops=1)
    ->  Seq Scan on clickthrus  (cost=0.00..75122.04 rows=3156808 width=54) 
(actual time=0.036..2890.440 rows=3068298 loops=1)
  Total runtime: 27084.921 ms
 (3 rows)

Looks like he was right!  Cutting out the sort step cut our query time in half.  Imagine what we could do if the aggregate were written in C!

However, on performance values alone this aggregate isn't very useful.   Its main utility is the ability to do queries which would require extensive subquerying via the ORDER BY random() method.  For example, imagine that (for some reason) I wanted to select each continent from by contacts database and one random contact from that continent.  Using ORDER BY random() that would be pretty messy; with random_agg() it becomes easy and fast:

 pgpr=# select continent, random_agg(name) from contacts group by continent order by continent;
  continent |    random_agg
 -----------+-------------------
  Africa    | Anton de Wet
  Americas  | Bruce Momjian
  Asia      | Abhijit Menon-Sen
  Europe    | Aleksander Kmetec
  Oceania   | Andrej Ricnik
 (5 rows)
Time: 1.690 ms

Much nicer than:

 pgpr=# select continent, 
pgpr=# (select name from contacts where contacts.continent = continents.continent
pgpr=# order by random() limit 1) as name from ( select distinct continent from contacts )
pgpe=# as continents order by continent;
  continent |       name
 -----------+------------------
  Africa    | Anton de Wet
  Americas  | Bruce Momjian
  Asia      | Samer Abukhait
  Europe    | Dennis Bjorklund
  Oceania   | Andrej Ricnik
 (5 rows)
Time: 2.212 ms

... don't you think?

Or say I wanted to pick a random region, continent, and contact out of my table but wanted each to be independently random and not match up.  There is no way to do this with ORDER BY random() at all except by three independant subqueries.  With random_agg() it's simple!

 pgpr=# select random_agg(continent), random_agg(region), random_agg(name) from contacts;
  random_agg | random_agg | random_agg
 ------------+------------+-------------
  Oceania    | Argentina  | Simon Riggs
 (1 row)
Time: 1.951 ms

On the other hand, of course, random_agg is not much use for picking a coherent random row.  For that, ORDER BY random() is better.

So here's random_agg() again all together for your cut-and-paste pleasure.  Note that you'll need to create one for each data type you want to randomize.  Have fun with it!

 BEGIN TRANSACTION;

CREATE TYPE _random_text AS (
     runcount BIGINT,
     choice TEXT
 );

CREATE OR REPLACE FUNCTION _choose_random_text (
     thestate _random_text,
     newvalue TEXT )
 RETURNS _random_text AS $f$
 DECLARE result _random_text;
 BEGIN
     result.runcount := COALESCE(thestate.runcount, 0) + 1;
     IF random() < ( 1::FLOAT / result.runcount::FLOAT ) THEN
         result.choice := newvalue;
     ELSE
         result.choice := thestate.choice;
     END IF;
     RETURN result;
 END; $f$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION _exit_random_text (
     _random_text )
 RETURNS TEXT AS $f$
 SELECT $1.choice;
 $f$ LANGUAGE sql IMMUTABLE STRICT;

CREATE AGGREGATE random_agg(
     BASETYPE = text,
     SFUNC = _choose_random_text,
     STYPE = _random_text,
     FINALFUNC = _exit_random_text
 );

COMMIT;