Skip to content Skip to footer

Drawing From a Random Distribution in SQL | by Sami Abboud | Feb, 2024


From a probability density function to random samples

Photo by Moritz Kindler on Unsplash

There are different methods for updating a reinforcement learning agent’s policy at each iteration. A few weeks ago we started experimenting with replacing our current method with a Bayesian inference step. Some of the data workloads within our agent are written in SQL that is executed on GCP’s BigQuery engine. We use this stack because it provides scalable computational capabilities, ML packages and a straightforward SQL interface.

The Bayesian inference step we looked to implement uses one of the alternative parametrizations of a beta distribution. Which means that we will need to be able to draw from a beta distribution in SQL. While working on this, I discovered that drawing from the random distribution in SQL is a topic with very few well documented examples. So I’m writing about it here..

BigQuery doesn’t have a beta distribution. It doesn’t have the capability to draw from any random distribution. So my first intuition was to take the definition of the beta distribution, write it in SQL, set the parameters using a CTA, draw a random number between 0 and 1 and compute the value of that function.

But it’s 2024, so I asked ChatGPT how it would do it:

Me: “​How do you create random draws from a beta distribution in BigQuery?

ChatGPT:

WITH beta_draws AS (
SELECT
(alpha — 1) / (alpha + beta — 2) AS a,
(beta — 1) / (alpha + beta — 2) AS b
FROM
(SELECT 2 AS alpha, 5 AS beta) — Set your desired alpha and beta values
)
SELECT
(a * POW(RAND(), alpha - 1) * POW(1 - RAND(), beta - 1) + b) AS beta_draw
FROM
beta_draws,
UNNEST(GENERATE_ARRAY(1, 1000)) AS rand_num

Me thinking to myself: Right, so that clearly won’t work.

Do you see the problem in the code? ChatGPT draws two different x values for the presumed beta distribution PDF. I fixed this, cleaned up the query a little and sampled 1,000 values. And here’s the SQL code for doing that:

WITH raw_data AS (
SELECT
‘user_a’ AS contact_id
, 0.674 AS probability
, 72 AS lifetime_messages_received
), parameters AS (
SELECT
contact_id
, probability*lifetime_messages_received AS alpha
, (1.0 — probability)*lifetime_messages_received AS beta
, RAND() AS x
FROM raw_data
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000)) AS draw_id
)
SELECT
contact_id
, ARRAY_AGG(POW(x, alpha — 1.0) * POW(1.0 — x, beta — 1)) AS beta_x
FROM parameters
GROUP BY contact_id

Thank you all, that’s a wrap 🎁 See you in the next post!

WRONG! 🔴

Let’s take a trusted implementation of drawing from a beta distribution using the same parameters and compare the results. I’ve used SciPy’s beta.rvs() in Python and here are two 100-bin histograms that will allow comparing the two drawn distributions.

from scipy.stats import beta

alpha_param = 0.674 * 72
beta_param = (1–0.674) * 72

scipy_beta_draw = beta.rvs(alpha_param, beta_param, size=1000)

Left: Naive draws using BigQuery. Right: Draws using SciPy’s beta.rvs()
(Left): Naive draws using BigQuery. (Right): Draws using SciPy’s beta.rvs()

Well, it doesn’t take a magnifying glass to realize that the distributions are different. I went back the beta distribution definition and realized that it might be because the beta distribution also has a scaling constant which depends on the gamma function that I didn’t include in the calculation 🤦.

Problem: the gamma function does not have a closed-form expression, and BigQuery doesn’t provide an implementation that approximates it. So at this point I decided to switch to Python, a language that I’m more familiar with and will make my experimentation more efficient. The thinking was that if I nail it down in Python, I’ll be able to translate it to SQL. I would still need some way to approximate a gamma function, but one step at a time.

Let’s implement a manual draw from a beta distribution in Python, but now with the correct constant using SciPy’s gamma function:

import numpy as np
from scipy.special import gamma
from scipy.stats import uniform

alpha_param = 0.674 * 72
beta_param = (1–0.674) * 72

constant = gamma(alpha_param + beta_param) / (gamma(alpha_param) * gamma(beta_param))
scipy_manual_beta_draw = np.array([
constant*pow(x, alpha_param-1)*pow(1-x, beta_param-1)
for x in uniform.rvs(size=1000)
])

Let’s examine the distribution using a 100-bin histogram again:

Naive draws using Python

The first thing we notice is that the scale is now different, but the distribution still looks like the one drawn in BigQuery.

… something is wrong… it’s time for a short walk to think 🚶

After a short walk:

What does drawing from a random distribution actually mean? What I’ve implemented so far is randomly sampling from the beta probability density function (PDF) and it wasn’t working.

So I had to dig up some statistics classes.

Here are a couple of good refreshers on:

In short, the conclusion is that drawing from a random variable actually means sampling from the inverse cumulative distribution function (CDF), not from the probability density function (PDF) like I was doing so far.

Of course 🤦. My probability professor, who I just learned had passed away from illness in 2020, would have encouraged me to “review the basics” at this point..

Ok. Let’s revisit the Python code, now drawing samples from the inverse CDF (which is also called the quantile function) of our beta distribution, and compare it to the distribution drawn using SciPy’s beta.rvs():

import numpy as np
from scipy.special import gamma
from scipy.stats import uniform, beta

alpha_param = 0.674 * 72
beta_param = (1–0.674) * 72
n_draws = 1000

# Use SciPy RVS for comparison
scipy_beta_draw = beta.rvs(alpha_param, beta_param, size=n_draws)

# Manual beta draw with the help of the SciPy Gamma function

# We start with a discrete analogue of the Beta PDF we wish to draw from.
# This is just sampling from the PDF at fixed intervals but do check out
# this review for a more in-depth treatment of the subject:
# https://jsdajournal.springeropen.com/articles/10.1186/s40488-015-0028-6

# Set the resolution for generating the discrete PDF
n_samples = 1000

# The beta distribution is supported on the range [0, 1], so we set the
# pdf min and max parameters accordingly
pdf_min = 0.0
pdf_max = 1.0

x_span = np.linspace(pdf_min, pdf_max, n_samples)
constant = gamma(alpha_param + beta_param) / (gamma(alpha_param) * gamma(beta_param))
beta_pdf = np.array([
constant * pow(x, alpha_param — 1) * pow(1 — x, beta_param — 1)
for x in x_span
])

# Using the discrete Beta PDF, we now compute a discrete Beta CDF.
# To do that, we integrate the PDF. For each point x, we sum the PDF until
# that point and multiple with the width of each sample.
freq = 1.0 / n_samples
beta_cdf = beta_pdf.cumsum() * freq

def inv(cdf, q):
“””Return inverse CDF for value q using the quantile function”””
return x_span[np.argmin(cdf < q)]

# Finally, we can now draw n_draws from the discrete inverse of CDF, aka
# generate random samples from a beta distribution
manual_beta_draw = np.array([
inv(beta_cdf, x)
for x in uniform.rvs(size=n_draws)
])

*phew* this looks much better:

An overlay of two histograms comparing a 1,000 draws using SciPy’s beta.rvs() and a manual draw

Now that we’ve got drawing samples from a random variable straight, it’s time to move back to SQL. For the sake of simplicity, and because BigQuery does not readily come with an implementation of a Gamma function¹ I’m going to draw from the logistic distribution (with parameters a=0 and b=1).

 — The following 3 parameters need to be adjusted based on the support of the
— PDF of the distribution you wish to draw from. This values are set for a logistic
— distribution with a=0 and b=1

DECLARE pdf_min INT64 DEFAULT -10;
DECLARE pdf_max INT64 DEFAULT 10;
DECLARE n_samples INT64 DEFAULT 5000;
DECLARE sampling_step FLOAT64 DEFAULT (pdf_max — pdf_min) / n_samples;

— The number of random draws you wish to perform
DECLARE n_draws INT64 DEFAULT 1000;

WITH pdf AS (

— The discrete sampling of the logistic distribution PDF

SELECT
x
, exp(-x) / pow(1 + exp(-x), 2) AS y — a=0, b=1
FROM UNNEST(GENERATE_ARRAY(pdf_min, pdf_max, sampling_step)) AS x
), cdf AS (

— The discrete CDF

SELECT
x
, SUM(y)
OVER (
ORDER BY x
) * (1.0 / n_samples) AS y
FROM pdf
), random_draws AS (

— Random draws in the range of [0, max(cdf)]

SELECT
RAND() * (SELECT MAX(y) FROM cdf) as q
, draw_id
FROM UNNEST(GENERATE_ARRAY(1, n_draws)) AS draw_id
)

— Calculate the inverse CDF per draw using the quantile function by generating
— and array of the discrete support of the distribution and returning the value
— of the index just before the randomly generated number is larger than the CDF

SELECT
ARRAY_AGG(x ORDER BY x)[OFFSET(SUM(CAST(y < q AS INT64)))] AS x
FROM random_draws
JOIN cdf
ON TRUE
GROUP BY draw_id;

Let’s now compare the distributions of the three sampling methods:

  1. SciPy’s logistic.rvs()
  2. Manually sampling the logistic distribution PDF in Python and drawing a random sample as per Step 2 above
  3. Doing the same in SQL
An overlay of three histograms comparing 1,000 draws using SciPy’s beta.rvs(), a manual draw in Python and a manual draw in SQL

This looks like a success to me! 💪

This SQL code above samples from the logistic distribution, but it should work on any distribution where you are able to get a discrete representation of the PDF by sampling it at consistent intervals!



Source link