Guest post by Miklos Racz: A primer on exact recovery in the general stochastic block model

Foreword: this will be a series of blog posts written by our Theory Group postdoc Miklos Racz. It is essentially the transcript of a mini-course that Miki gave at the University of Washington and at the XX Brazilian School of Probability, and that I gave at the University of Nice. These notes are also posted on arxiv. I now leave the floor to Miki:

 

Community detection is a fundamental problem in many sciences, such as sociology (e.g., finding tight-knit groups in social networks), biology (e.g., detecting protein complexes), and beyond. Given its importance, there have been a plethora of algorithms developed in the past few decades to detect communities. In the past few years there have also been many works studying the fundamental limits of these recovery algorithms.

This post describes the recent results of Abbe and Sandon which characterize the fundamental limits of exact recovery in the most canonical and popular probabilistic generative model, the stochastic block model (SBM).

The stochastic block model and exact recovery

The general stochastic block model is a distribution on graphs with latent community structure, and it has three parameters:

n, the number of vertices;

a probability distribution p = (p_1, \dots, p_k) that describes the relative sizes of the communities;

and \overline{Q} \in \left[0,1\right]^{k \times k}, a symmetric k \times k matrix that describes the probabilities with which two given vertices are connected, depending on which communities they belong to.

The number of communities, k, is implicit in this notation; in these notes we assume that k is a fixed constant.

A random graph from \mathrm{SBM}( n, p, \overline{Q}) is defined as follows:

  • The vertex set of the graph is V = \left\{ 1, \dots, n \right\} \equiv \left[ n \right].
  • Every vertex v \in V is independently assigned a (hidden) label \sigma_v \in \left[ k \right] from the probability distribution p on \left[k \right]. That is, \mathbb{P} \left( \sigma_v = i \right) = p_i for every i \in \left[k \right].
  • Given the labels of the vertices, each (unordered) pair of vertices \left( u, v \right) \in V \times V is connected independently with probability \overline{Q}_{\sigma_{u}, \sigma_{v}}.

Example [Symmetric communities]

A simple example to keep in mind is that of symmetric communities, with more edges within communities than between communities.

This is modeled by the SBM with p_i = 1/k for all i \in \left[ k \right] and \overline{Q}_{i,j} = a if i = j and \overline{Q}_{i,j} = b otherwise, with a > b > 0.

We write G \sim \mathrm{SBM}(n,p,\overline{Q}) for a graph generated according to the SBM without the hidden vertex labels revealed.

The goal of a statistical inference algorithm is to recover as many labels as possible using only the underlying graph as an observation.

There are various notions of success that are worth studying. In this post we focus on exact recovery: we aim to recover the labels of all vertices exactly with high probability (whp).

More precisely, in evaluating the success of an algorithm, the agreement of a partition with the true partition is maximized over all relabellings of the communities, since we are not interested in the specific original labelling per se, but rather the partition (community structure) it induces.

For exact recovery, all vertices in all but one community should be non-isolated (in the symmetric case this means that the graph should be connected), requiring the edge probabilities to be \Omega \left( \ln(n) / n \right).

It is thus natural to scale the edge probability matrix accordingly, i.e., to consider \mathrm{SBM} \left( n, p, \ln \left( n \right) Q / n \right), where Q \in \mathbb{R}_{+}^{k \times k}.

We also assume that the communities have linear size, i.e., that p is independent of n, and p_i \in (0,1) for all i.

From exact recovery to testing multivariate Poisson distributions

As a thought experiment, imagine that not only is the graph G given, but also all vertex labels are revealed, except for that of a given vertex v \in V. Is it possible to determine the label of v?

Understanding this question is key for understanding exact recovery, since if the error probability of this is too high, then exact recovery will not be possible. On the other hand, it turns out that in this regime it is possible to recover all but o(n) labels using an initial partial recovery algorithm. The setup of the thought experiment then becomes relevant, and if we can determine the label of v given the labels of all the other nodes with low error probability, then we can correct all errors made in the initial partial recovery algorithm, leading to exact recovery. We will come back to the connection between the thought experiment and exact recovery; for now we focus on understanding this thought experiment.

Given the labels of all vertices except v, the information we have about v is the number of nodes in each community it is connected to. In other words, we know the degree profile d(v) of v, where, for a given labelling of the graph’s vertices, the i-th component d_{i} (v) is the number of edges between v and the vertices in community i.

The distribution of the degree profile d(v) depends on the community that v belongs to. Recall that the community sizes are given by a multinomial distribution with parameters n and p, and hence the relative size of community i \in [k] concentrates on p_i. Thus if \sigma_v = j, the degree profile d(v) = (d_1(v), \dots, d_k(v)) can be approximated by independent binomials, with d_i (v) approximately distributed as

    \[\mathrm{Bin} \left( np_i, \ln(n) Q_{i,j} / n \right) ,\]

where \mathrm{Bin}(m,q) denotes the binomial distribution with m trials and success probability q. In this regime, the binomial distribution is well-approximated by a Poisson distribution of the same mean. Thus the degree profile of a vertex in community j is approximately Poisson distributed with mean

    \[\ln \left( n \right) \sum_{i \in [k]} p_i Q_{i,j} e_i ,\]

where e_i is the i-th unit vector. Defining P = \mathrm{diag}(p), this can be abbreviated as \ln \left( n \right) \left( PQ \right)_{j}, where \left( PQ \right)_{j} denotes the j-th column of the matrix PQ.

We call the quantity \left( PQ \right)_{j} the community profile of community j; this is the quantity that determines the distribution of the degree profile of vertices from a given community.

Our thought experiment has thus been reduced to a Bayesian hypothesis testing problem between k multivariate Poisson distributions.

The prior on the label of v is given by p, and we get to observe the degree profile d(v), which comes from one of k multivariate Poisson distributions, which have mean \ln(n) times the community profiles \left( PQ \right)_j, j \in [k].

Testing multivariate Poisson distributions

We now turn to understanding the testing problem described above; the setup is as follows.

We consider a Bayesian hypothesis testing problem with k hypotheses.

The random variable H takes values in [k] with prior given by p, i.e., \mathbb{P} \left( H = j \right) = p_j.

We do not observe H, but instead we observe a draw from a multivariate Poisson distribution whose mean depends on the realization of H:

given H = j, the mean is \lambda(j) \in \mathbb{R}_{+}^{k}.

In short:

    \[ D \, | \, H = j \sim \mathrm{Poi} \left( \lambda(j) \right), \qquad j \in [k]. \]

In more detail:

    \[ \mathbb{P} \left( D = d \, \middle| \, H = j \right) = \mathcal{P}_{\lambda(j)} \left( d \right), \qquad d \in \mathbb{Z}_{+}^{k}, \]

where

    \[ \mathcal{P}_{\lambda(j)} \left( d \right) = \prod_{i \in [k]} \mathcal{P}_{\lambda_{i}(j)} \left( d_{i} \right) \]

and

    \[ \mathcal{P}_{\lambda_{i}(j)} \left( d_{i} \right) = \frac{\lambda_{i}(j)^{d_{i}}}{d_{i}!} e^{-\lambda_{i}(j)}. \]

Our goal is to infer the value of H from a realization of D.

The error probability is minimized by the maximum a posteriori (MAP) rule, which, upon observing D = d, selects

    \[ \mathrm{argmax}_{j \in [k]} \mathbb{P} \left( D = d \, \middle| \, H = j \right) p_j \]

as an estimate for the value of H, with ties broken arbitrarily.

Let P_e denote the error of the MAP estimator.

One can think of the MAP estimator as a tournament of k-1 pairwise comparisons of the hypotheses:

if

    \[\mathbb{P} \left( D = d \, \middle| \, H = i \right) p_i >\mathbb{P} \left( D = d \, \middle| \, H = j \right) p_j\]

then the MAP estimate is not j.

The probability that one makes an error during such a comparison is exactly

(1)   \begin{equation*} P_{e} \left( i, j \right) := \sum_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathbb{P} \left( D = x \, \middle| \, H = i \right) p_i , \mathbb{P} \left( D = x \, \middle| \, H = j \right) p_j \right\}. \end{equation*}

For finite k, the error of the MAP estimator is on the same order as the largest pairwise comparison error, i.e., \max_{i,j} P_{e} \left( i, j \right).

In particular, we have that

(2)   \begin{equation*} \frac{1}{k-1} \sum_{i < j} P_{e} \left( i, j \right) \leq P_{e} \leq \sum_{i < j} P_{e} \left( i, j \right). \end{equation*}

Thus we desire to understand the magnitude of the error probability P_{e} \left( i, j \right) in (1) in the particular case when the conditional distribution of D given H is a multivariate Poisson distribution with mean vector on the order of \ln \left( n \right). The following result determines this error up to first order in the exponent.

Lemma [Abbe and Sandon 2015]

For any c_{1}, c_{2} \in \left( 0, \infty \right)^{k} with c_1 \neq c_2 and p_1, p_2 > 0, we have

(3)   \begin{align*} \sum_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathcal{P}_{\ln \left( n \right) c_{1}} \left( x \right) p_{1}, \mathcal{P}_{\ln \left( n \right) c_{2}} \left( x \right) p_{2} \right\} &= O \left( n^{-D_{+} \left( c_1, c_2 \right) - \tfrac{\ln \ln \left( n \right)}{2 \ln \left( n \right)}} \right), \\ \sum_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathcal{P}_{\ln \left( n \right) c_{1}} \left( x \right) p_{1}, \mathcal{P}_{\ln \left( n \right) c_{2}} \left( x \right) p_{2} \right\} &= \Omega \left( n^{-D_{+} \left( c_1, c_2 \right) - \tfrac{k \ln \ln \left( n \right)}{2 \ln \left( n \right)}} \right), \label{eq:LB} \end{align*}

where

(4)   \begin{equation*} D_{+} \left( c_{1}, c_{2} \right) = \max_{t \in \left[ 0, 1 \right]} \sum_{i \in \left[ k \right]} \left( t c_{1} \left( i \right) + \left( 1 - t \right) c_{2} \left( i \right) - c_{1} \left( i \right)^{t} c_{2} \left( i \right)^{1-t} \right). \end{equation*}

Due to connections with other, well-known measures of divergence (which we do not discuss here), Abbe and Sandon termed D_{+} the Chernoff-Hellinger divergence.

We do not go over the proof of this statement—which we leave to the reader as a challenging exercise—but we provide some intuition in the univariate case.

Observe that

    \[\min \left\{ \mathcal{P}_{\lambda} \left( x \right), \mathcal{P}_{\mu} \left( x \right) \right\}\]

decays rapidly away from

    \[x_{\mathrm{max}} := \mathrm{argmax}_{x \in \mathbb{Z}_{+}} \min \left\{ \mathcal{P}_{\lambda} \left( x \right), \mathcal{P}_{\mu} \left( x \right) \right\} ,\]

so we can obtain a good estimate of the sum

    \[\sum_{x \in \mathbb{Z}_{+}} \min \left\{ \mathcal{P}_{\lambda} \left( x \right), \mathcal{P}_{\mu} \left( x \right) \right\}\]

by simply estimating the term

    \[\min \left\{ \mathcal{P}_{\lambda} \left( x_{\mathrm{max}} \right), \mathcal{P}_{\mu} \left( x_{\mathrm{max}} \right) \right\} .\]

Now observe that x_{\mathrm{max}} must satisfy

    \[\mathcal{P}_{\lambda} \left( x_{\mathrm{max}} \right) \approx \mathcal{P}_{\mu} \left( x_{\mathrm{max}} \right) ;\]

after some algebra this is equivalent to

    \[x_{\mathrm{max}} \approx \frac{\lambda - \mu}{\log \left( \lambda / \mu \right)} .\]

Let t^{*} denote the maximizer in the expression of D_{+} \left( \lambda, \mu \right) in~\eqref{eq:D_+}.

By differentiating in t, we obtain that t^{*} satisfies

    \[\lambda - \mu - \log \left( \lambda / \mu \right) \cdot \lambda^{t^{*}} \mu^{1 - t^{*}} = 0 ,\]

and so

    \[\lambda^{t^{*}} \mu^{1 - t^{*}} = \frac{\lambda - \mu}{\log \left( \lambda / \mu \right)} .\]

Thus we see that

    \[x_{\mathrm{max}} \approx \lambda^{t^{*}} \mu^{1 - t^{*}} ,\]

from which, after some algebra, we get that

    \[\mathcal{P}_{\lambda} \left( x_{\mathrm{max}} \right) \approx \mathcal{P}_{\mu} \left( x_{\mathrm{max}} \right) \approx \exp \left( - D_{+} \left( \lambda, \mu \right) \right) .\]

The proof of (3) in the multivariate case follows along the same lines: the single term corresponding to

    \[x_{\mathrm{max}} := \mathrm{argmax}_{x \in \mathbb{Z}_{+}^{k}} \min \left\{ \mathcal{P}_{\ln \left( n \right) c_{1}} \left( x \right), \mathcal{P}_{\ln \left( n \right) c_{2}} \left( x \right) \right\}\]

gives the lower bound. For the upper bound of (3) one has to show that the other terms do not contribute much more.

Characterizing exact recoverability using CH-divergence

Our conclusion is thus that the error exponent in testing multivariate Poisson distributions is given by the explicit quantity D_{+} in (4).

The discussion above then implies that D_{+} plays an important role in the threshold for exact recovery.

In particular, it intuitively follows from the above Lemma that a necessary condition for exact recovery should be that

    \[ \min_{i,j \in [k], i \neq j} D_{+} \left( \left( PQ \right)_{i}, \left( PQ \right)_{j} \right) \geq 1. \]

Suppose on the contrary that

D_{+} \left( \left( PQ \right)_{i}, \left( PQ \right)_{j} \right) < 1

for some i and j.

This implies that the error probability in the testing problem is \Omega \left( n^{\varepsilon - 1} \right) for some \varepsilon > 0 for all vertices in communities i and j. Since the number of vertices in these communities is linear in n, and most of the hypothesis testing problems are approximately independent, one expects there to be no error in the testing problems with probability at most

\left( 1 - \Omega \left( n^{\varepsilon - 1} \right) \right)^{\Omega \left( n \right)} = \exp\left( - \Omega \left( n^{\varepsilon} \right) \right) = o(1).

It turns out that this indeed gives the recoverability threshold.

Theorem [Abbe and Sandon 2015]

Let k \in \mathbb{Z}_{+} denote the number of communities,

let p \in (0,1)^{k} with \left\| p \right\|_{1} = 1 denote the community prior,

let P = \mathrm{diag}(p),

and let Q \in \left( 0, \infty \right)^{k \times k} be a symmetric k \times k matrix with no two rows equal.

Exact recovery is solvable in \mathrm{SBM} \left( n, p, \ln(n) Q / n \right) if and only if

(5)   \begin{equation*} \min_{i,j \in [k], i \neq j} D_{+} \left( \left( PQ \right)_{i}, \left( PQ \right)_{j} \right) \geq 1. \end{equation*}

This theorem thus provides an operational meaning to the CH-divergence for the community recovery problem.

Example [Symmetric communities]

Consider again k symmetric communities, that is,

p_i = 1/k for all i \in [k],

Q_{i,j} = a if i = j,

and Q_{i,j} = b otherwise, with a, b > 0.

Then exact recovery is solvable in \mathrm{SBM} \left( n, p, \ln(n) Q / n \right) if and only if

(6)   \begin{equation*} \left| \sqrt{a} - \sqrt{b} \right| \geq \sqrt{k}. \end{equation*}

We note that in this case D_{+} is the same as the Hellinger divergence.

Above we have heuristically described why the condition (5) is necessary.

To see why it is sufficient, first note that if (5) holds, then the Lemma tells us that in the hypothesis testing problem between Poisson distributions the error of the MAP estimate is o(1/n).

Thus if the setting of the thought experiment applies to every vertex, then by looking at the degree profiles of the vertices we can correctly reclassify all vertices, and the probability that we make an error is o(1) by a union bound.

However, the setting of the thought experiment does not quite apply. Nonetheless, in this logarithmic degree regime it is possible to partially reconstruct the labels of the vertices, with only o(n) vertices being misclassified. The details of this partial reconstruction procedure would require a separate post—in brief, it determines whether two vertices are in the same community or not by looking at how their \log(n) size neighborhoods interact.

One can then apply two rounds of the degree-profiling step to each vertex, and it can be shown that all vertices are then correctly labelled whp.

This entry was posted in Probability theory, Random graphs. Bookmark the permalink.