THE END

In addition to being the best comedy TV show ever, Seinfeld was a great source of wisdom. In one episode, Jerry counsels George: “When you hit that high note, you say goodnight and walk off.” Later, George has a good line at a meeting at Kruger Industrial Smoothing. Then he says: “Alright! That’s it for me. Goodnight everybody.” The show followed its own advice by ending production after Season 9, despite the fact that it could have continued for several more years.

Other shows had made the same wise choice. Breaking Bad ended while it was still going strong. So did Deadwood, Friday Night Lights, Flight of the Conchords, Rome, and, of course, Star Trek. (Not by choice, in the case of Star Trek.)

Then there are the shows that ended after their prime. Examples are Dexter (should have stopped after Trinity), MASH (three good seasons), and the ultimate example: Happy Days which gave birth to the expression “jumping the shark” because of an episode where Fonzie literally did a waterski jump over a shark.

Which brings me to the point of this post which, to quote George Costanza, is: Alright! That’s it for me. Goodnight everybody.

I have pretty much covered all the topics I wanted to cover in this blog. So at this point I have three choice:

1. Start reporting on daily trivia such as the antics of my cats Dizzy and Miles. (Did I mention that Dizzy caught a mouse a few days ago?)

2. Start posting every other month. (No good: people stop reading.)

3. Stop.

I think the third option is the best.

To all of you who have been reading the blog: thanks. And a special thanks to those who have commented or written guest posts.

Who knows, perhaps I will start another blog someday. But for now, Hasta la vista, baby.

Stein’s Method

I have mentioned Stein’s method in passing, a few times on this blog. Today I want to talk about Stein’s method in a bit of detail.

1. What Is Stein’s Method?

Stein’s method, due to Charles Stein, is actually quite old, going back to 1972. But there has been a great deal of interest in the method lately. As in many things, Stein was ahead of his time.

Stein’s method is actually a collection of methods for showing that the distribution {F_n} of some random variable {X_n} is close to Normal. What makes Stein’s method so powerful is that it can be used in a variety of situations (including cases with dependence) and it tells you how far {F_n} is from Normal.

I will follow Chen, Goldstein and Shao (2011) quite closely.

2. The Idea

Let {X_1,\ldots, X_n\in \mathbb{R}} be iid with mean 0 and variance 1. Let

\displaystyle  X = \sqrt{n}\ \overline{X}_n =\frac{1}{\sqrt{n}}\sum_i X_i

and let {Z\sim N(0,1)}. We want to bound

\displaystyle  \Delta_n = \sup_z \Bigl| \mathbb{P}(X \leq z) - \mathbb{P}( Z \leq z)\Bigr| = \sup_z \Bigl| \mathbb{P}(X \leq z) - \Phi(z)\Bigr|

where {\Phi} is the cdf of a standard Normal. For simplicity, we will assume here

\displaystyle  |X_i| \leq B < \infty

but this is not needed. The key idea is this:

\displaystyle  \mathbb{E}[Y f(Y)] = \mathbb{E}[f'(Y)]

for all smooth {f} if and only if {Y\sim N(0,1)}. This suggests the following idea: if we can show that {\mathbb{E}[Y f(Y)-f'(Y)]} is close to 0, then {Y} should be almost Normal.

More precisely, let {Z\sim N(0,1)} and let {h} be any function such that {\mathbb{E}|h(Z)| < \infty}. The Stein function {f} associated with {h} is a function satisfying

\displaystyle  f'(x) - x f(x) = h(x) - \mathbb{E}[h(Z)].

It then follows that

\displaystyle  \mathbb{E}[h(X)] - \mathbb{E}[h(Z)]= \mathbb{E}[f'(X) - Xf(X)]

and showing that {X} is close to normal amounts to showing that {\mathbb{E}[f'(X) - Xf(X)]} is small.

Is there such an {f}? In fact, letting {\mu = \mathbb{E}[h(Z)]}, the Stein function is

\displaystyle  f(x) =f_h(x)= e^{x^2/2}\int_{-\infty}^x[ h(y) - \mu] e^{-y^2/2} dy. \ \ \ \ \ (1)

More precisely, {f_h} is the unique solution to the above equation subject to the side condition {\lim_{x\rightarrow\pm\infty} e^{-x^2/2}f(x) = 0}.

Let’s be more specific. Choose any {z\in \mathbb{R}} and let {h(x) = I(x\leq z) - \Phi(z)}. Let {f_z} denote the Stein function for {h}; thus {f_z} satisfies

\displaystyle  f_z'(x)-xf_z(x) = I(x\leq z) - \Phi(z).

The unique bounded solution to this equation is

\displaystyle  f(x)\equiv f_z(x) = \begin{cases} \sqrt{2\pi}e^{x^2/2}\Phi(x)[1-\Phi(z)] & x\leq z\\ \sqrt{2\pi}e^{x^2/2}\Phi(z)[1-\Phi(x)] & x> z. \end{cases}

The function {f_z} has the following properties:

\displaystyle  \Bigl| (x+a)f_z(x+a) - (x+b)f_z(x+b) \Bigr| \leq ( |x| + c)(|a|+|b|)

where {c = \sqrt{2\pi}/4}. Also,

\displaystyle  ||f_z||_\infty \leq \sqrt{\frac{\pi}{2}},\ \ \ ||f'_z||_\infty \leq 2.

Let {{\cal F} = \{f_z:\ z\in \mathbb{R}\}}. From the equation {f'(x) - x f(x) = h(x) - \mathbb{E}[h(Z)]} it follows that

\displaystyle  \mathbb{P}(X \leq z) - \mathbb{P}(Y \leq z) = \mathbb{E}[f'(X) - Xf(X)]

and so

\displaystyle  \Delta_n = \sup_z \Bigl| \mathbb{P}(X \leq z) - \mathbb{P}( Z \leq z)\Bigr| \leq \sup_{f\in {\cal F}} \Bigl| \mathbb{E}[ f'(X) - X f(X)]\Bigr|.

We have reduced the problem of bounding {\sup_z \Bigl| \mathbb{P}(X \leq z) - \mathbb{P}( Z \leq z)\Bigr|} to the problem of bounding {\sup_{f\in {\cal F}} \Bigl| \mathbb{E}[ f'(X) - X f(X)]\Bigr|}.

3. Zero Bias Coupling

As I said, Stein’s method is really a collection of methods. One of the easiest to explain is “zero-bias coupling.”

Let {\mu_3 = \mathbb{E}[|X_i|^3]}. Recall that {X_1,\ldots,X_n} are iid, mean 0, variance 1. Let

\displaystyle  X = \sqrt{n}\ \overline{X}_n = \frac{1}{\sqrt{n}}\sum_i X_i = \sum_i \xi_i

where

\displaystyle  \xi_i = \frac{1}{\sqrt{n}} X_i.

For each {i} define the leave-one-out quantity

\displaystyle  X^i = X - \xi_i

which plays a crucial role. Note that {X^i} is independent of {X_i}. We also make use of the following simple fact: for any {y} and {a}, {\Phi(y+a) - \Phi(y)\leq a/\sqrt{2\pi}} since the density of the Normal is bounded above by {1/\sqrt{2\pi}}.

Recall that {X\sim N(0,\sigma^2)} if and only if

\displaystyle  \sigma^2 \mathbb{E}[f'(X)] = \mathbb{E}[X f(X)]

for all absolutely continuous functions {f} (for which the expectations exists). Inspired by this, Goldstein and Reinert (1997) introduced the following definition. Let {X} be any mean 0 random variable with variance {\sigma^2}. Say that {X^*} has the {X}-zero bias distribution if

\displaystyle  \sigma^2 \mathbb{E} [f'(X^*)] = \mathbb{E}[X f(X)].

for all absolutely continuous functions {f} for which {\mathbb{E}| X f(X)|< \infty}. Zero-biasing is a transform that maps one random variable {X} into another random variable {X^*}. (More precisely, it maps the distribution of {X} into the distribution of {X^*}.) The Normal is the fixed point of this map. The following result shows that {X^*} exists and is unique.

Theorem 1 Let {X} be any mean 0 random variable with variance {\sigma^2}. There exists a unique distribution corresponding to a random variable {X^*} such that

\displaystyle  \sigma^2 \mathbb{E}[f'(X^*)] = \mathbb{E}[X f(X)]. \ \ \ \ \ (2)

for all absolutely continuous functions {f} for which {\mathbb{E}| X f(X)|< \infty}. The distribution of {X^*} has density

\displaystyle  p^*(x) = \frac{\mathbb{E}[ X I(X>x)]}{\sigma^2} = -\frac{\mathbb{E}[ X I(X\leq x)]}{\sigma^2}.

Proof: It may be verified that {p^*(x) \geq 0} and integrates to 1. Let us verify that (2) holds. For simplicity, assume that {\sigma^2=1}. Given an absolutely continuous {f} there is a {g} such that {f(x) = \int_0^x g}. Then

\displaystyle  \begin{array}{rcl}  \int_0^\infty f'(u)\ p^*(u) du &=& \int_0^\infty f'(u)\ \mathbb{E}[X I(X>u)] du = \int_0^\infty g(u)\ \mathbb{E}[X I(X>u)] du\\ &=& \mathbb{E}\left[X \int_0^\infty g(u) I(X>u) du \right]= \mathbb{E}\left[X \int_{0}^{X\vee 0} g(u) du \right]= \mathbb{E}[X f(X)I(X\geq 0)]. \end{array}

Similarly, {\int_{-\infty}^0 f'(u)\ p^*(u) du =\mathbb{E}[X f(X)I(X\geq 0)].} \Box

Here is a way to construct {X^*} explicitly when dealing with a sum.

Lemma 2 Let {\xi_1,\ldots, \xi_n} be independent, mean 0 random variables and let {\sigma_i^2 = {\rm Var}(\xi_i)}. Let {W =\sum_i \xi_i}. Let {\xi_1^*,\ldots, \xi_n^*} be independent and zero-bias. Define

\displaystyle  W^* = W - \xi_J + \xi_J^*

where {\mathbb{P}(J=i)=\sigma_i^2}. Then {W^*} has the {W}-zero bias distribution. In particular, suppose that {X_1,\ldots, X_n} have mean 0 common variance and let {W = \frac{1}{\sqrt{n}}\sum_i X_i}. Let {J} be a random integer from 1 to {n}. Then {W^* = W - \frac{1}{\sqrt{n}}X_J + \frac{1}{\sqrt{n}}X_J^*} has the {W}-zero bias distribution.

Now we can prove a Central Limit Theorem using zero-biasing.

Theorem 3 Suppose that {|X_i| \leq B}. Let {Y\sim N(0,1)}. Then

\displaystyle  \sup_z \Bigl| \mathbb{P}(X \leq z) - \mathbb{P}( Y \leq z)\Bigr| \leq \frac{6 B}{\sqrt{n}}.

Proof: Let {X_1^*,\ldots, X_n^*} be independent random variables where {X_i^*} is zero-bias for {X_i}. Let {J} be chosen randomly from {\{1,\ldots, n\}} and let

\displaystyle  X^* = X - \frac{1}{\sqrt{n}}(X_J - X_J^*).

Then, by the last lemma, {X^*} is zero-bias for {X} and hence

\displaystyle  \mathbb{E}[ X f(X)] = \mathbb{E}[f'(X^*)] \ \ \ \ \ (3)

for all absolutely continuous {f}. Also note that

\displaystyle  |X^* - X| \leq \frac{2B}{\sqrt{n}} \equiv \delta. \ \ \ \ \ (4)

So,

\displaystyle  \begin{array}{rcl}  \mathbb{P}(X \leq z) - \mathbb{P}(Y \leq z) &\leq & \mathbb{P}(X \leq z) - \mathbb{P}(Y \leq z+\delta) +\mathbb{P}(Y \leq z+\delta) -\mathbb{P}(Y \leq z) \leq \mathbb{P}(X \leq z) - \mathbb{P}(Y \leq z+\delta) + \frac{\delta}{\sqrt{2\pi}}\\ &\leq & \mathbb{P}(X^* \leq z+\delta) - \mathbb{P}(Y \leq z+\delta) + \frac{\delta}{\sqrt{2\pi}} \leq \sup_z |\mathbb{P}(X^* \leq z+\delta) - \mathbb{P}(Y \leq z+\delta)| + \frac{\delta}{\sqrt{2\pi}}\\ &=& \sup_z |\mathbb{P}(X^* \leq z) - \mathbb{P}(Y \leq z)| + \frac{\delta}{\sqrt{2\pi}}. \end{array}

By a symmetric argument, we deduce that

\displaystyle  \sup_z |\mathbb{P}(X \leq z) - \mathbb{P}(Y \leq z)| \leq \sup_z |\mathbb{P}(X^* \leq z) - \mathbb{P}(Y \leq z)| + \frac{\delta}{\sqrt{2\pi}}.

Let {f=f_z}. From the properties of the Stein function given earlier we have

\displaystyle  \begin{array}{rcl}  \sup_z |\mathbb{P}(X^* \leq z) - \mathbb{P}(Y \leq z)| &\leq & \sup_{f\in {\cal F}}\Big|\mathbb{E}[ f'(X^*) - X^* f(X^*)]\Bigr|= \sup_{f\in {\cal F}}\Big|\mathbb{E}[ X f(X) - X^* f(X^*)]\Bigr|\\ &\leq & \mathbb{E}\left[ \left( |X| + \frac{2\pi}{4}\right)\ |X^* - X| \right]\leq \delta \left(1 + \frac{2\pi}{4}\right). \end{array}

Combining these inequalities we have

\displaystyle  \sup_z |\mathbb{P}(X \leq z) - \mathbb{P}(Y \leq z)| \leq \delta \left(1 + \frac{1}{\sqrt{2\pi}} + \frac{2\pi}{4}\right)\leq 3\delta =\frac{6 B}{\sqrt{n}}.

\Box

4. Conclusion

This is just the tip of the iceberg. If you want to know more about Stein’s method, see the references below. I hope I have given you a brief hint of what it is all about.

References

Chen, Goldstein and Shao. (2011). Normal Approximation by Stein’s Method. Springer.

Nourdin and Peccati (2012). Normal Approximations With Malliavin Calculus. Cambridge.

Ross, N. (2011). Fundamentals of Stein’s method . Probability Surveys, 210-293. link

Nonparametric Regression, ABC and CNN

On Monday we had an interesting seminar by Samory Kpotufe on nonparametric regression. Samory presented a method for choosing the smoothing parameter, locally, in nonparametric regression. The method is simple and intuitive: construct confidence intervals using many different values of the smoothing parameter. Choose the value at which the confidence intervals stop intersecting. He has a theorem that shows that the estimator adapts to the local dimension and to the local smoothness. Very cool. The idea is similar in spirit to ideas that have been developed by Oleg Lepski and Alex Goldenshluger. I am looking forward to seeing Samory’s paper.

On Wednesday we were fortunate to have another seminar by my old friend Christian Robert. Christian and I have not seen each other for a while so it was fun to have dinner and hang out. Christian spoke about ABC (approximate Bayesian computation). But really, ABC was just an excuse to talk about a very fundamental question: if you replace the data by a summary statistics (not sufficient), when will the Bayes factor be consistent? He presented sufficient conditions and then did some interesting examples. The conditions are rather technical but I don’t think this can be avoided. In our era of Big Data, this type of question will arise all the time (not just in approximate Bayesian computation) and so it was nice to see that Christian and his colleagues are working on this.

On a third, and unrelated note, I was watching CNN today. Someone named Roshini Raj (I believe she is a doctor at NYU) discussed a study from Harvard that showed that many foods, like pasta and red meat, are associated with depression. These reports drive me crazy. I have not looked at the study so I can’t comment on the study itself. But I had hoped that at least one anchor or the doctor would raise the obvious objection: this could be association without being causal. It did not occur to any of them to even raise this question. Instead, they immediately assume it is causal and started talking about the importance of avoiding pasta and red meat. I don’t find pasta or red meat depressing but I do find this kind of bad reporting depressing. Again, it is possible that the paper itself is careful about this; I don’t know. But the reporting sucks.

PubMed Commons

Don’t you wish there was a way to comment on papers?

Now there is. Thanks to the efforts of Rob Tibshirani, Pat Brown, Mike Eisen, David Lipman and others there is now a system called PubMed Commons.

PubMed is the central repository for biomedical research. PubMed Commons allows people to have active discussions of papers. A full description is available at Rob’s website here.

This is very good news. We need this. But there is more good news. Rob is working towards including all the major statistical and machine learning journals. Eventually there may even be some sort of link to arXiv. Note that there are discussion websites for arXiv, such as CosmoCoffee but I don’t know of one for statistics.

Congratulations to Rob and his colleagues on this new effort.

The Terminator, Statistics, Fair Use and Hedge Funds

I am visiting the Department of Statistics at the University of Washington today and am enjoying their warm hospitality (thanks Emily!). I have a little breathing room now so I thought I would write a quick blog post.

Earlier this year, I posted my contribution to the collection celebrating the 50th anniversary of The Committee of Presidents of Statistical Societies (COPSS). My essay, entitle “Rise of the Machines” was about the interaction of Machine Learning and Statistics.

Of course, I was paying homage to The Terminator and my essay began with this iconic image:

terminator

The publisher, Taylor and Francis, was unhappy with my use of the terminator image. Two of the people editing the collection, Xihong Lin and David L. Banks, did their best to help me in my quest to use the image. This is the image, in case you forgot:

terminator

According to the principle of Fair Use, I have the legal right to use this image. But Taylor and Francis, fearing a lawsuit, still did not want me to use the image.

So I wrote to Warner Brothers to request permission to use the image. I mean this image:

terminator

Warner Brothers quickly wrote back and said they did not own the copyright. They directed me to a company called Intermedia. I then asked Intermedia if I could use the image. Here is the image, in case you forgot:

terminator

But my attempts to contact Intermedia failed. It appears that they went broke and were bought by a hedge fund. Next we tried to reach the hedge fund but were unsuccessful. My only option at this point was to remove the image from the article. The image, by the way, looks like this:

terminator

So, fear of litigation has won the day. My article will appear without a picture of the terminator. This is a real shame. By the way, this is the image:

terminator

Assumption-Free High-Dimensional Inference

Richard Lockhart, Jonathan Taylor, Ryan Tibshirani and Rob Tibshirani have an interesting paper about significance tests for the lasso. The paper will appear in The Annals of Statistics. I was asked to write a discussion about the paper. Here is my discussion. (I suggest your read their paper before reading my discussion.)

Assumption-Free High-Dimensional Inference:
Discussion of LTTT
Larry Wasserman

The paper by Lockhart, Taylor, Tibshirani and Tibshirani (LTTT) is an important advancement in our understanding of inference for high-dimensional regression. The paper is a tour de force, bringing together an impressive array of results, culminating in a set of very satisfying convergence results. The fact that the test statistic automatically balances the effect of shrinkage and the effect of adaptive variable selection is remarkable.

The authors make very strong assumptions. This is quite reasonable: to make significant theoretical advances in our understanding of complex procedures, one has to begin with strong assumptions. The following question then arises: what can we do without these assumptions?

1. The Assumptions

The assumptions in this paper — and in most theoretical papers on high dimensional regression — have several components. These include:

  1. The linear model is correct.
  2. The variance is constant.
  3. The errors have a Normal distribution.
  4. The parameter vector is sparse.
  5. The design matrix has very weak collinearity. This is usually stated in the form of incoherence, eigenvalue restrictions or incompatibility assumptions.

To the best of my knowledge, these assumptions are not testable when {p > n}. They are certainly a good starting place for theoretical investigations but they are indeed very strong. The regression function {m(x) = \mathbb{E}(Y|X=x)} can be any function. There is no reason to think it will be close to linear. Design assumptions are also highly suspect. High collinearity is the rule rather than the exception especially in high-dimensional problems. An exception is signal processing, in particular compressed sensing, where the user gets to construct the design matrix. In this case, if the design matrix is filled with independent random Normals, the design matrix will be incoherent with high probability. But this is a rather special situation.

None of this is meant as a criticism of the paper. Rather, I am trying to motivate interest in the question I asked earlier, namely: what can we do without these assumptions?

Remark 1 It is also worth mentioning that even in low dimensional models and even if the model is correct, model selection raises troubling issues that we all tend to ignore. In particular, variable selection makes the minimax risk explode (Leeb and Potscher 2008 and Leeb and Potscher 2005). This is not some sort of pathological risk explosion, rather, the risk is large in a neighborhood of 0, which is a part of the parameter space we care about.

2. The Assumption-Free Lasso

To begin it is worth pointing out that the lasso has a very nice assumption-free interpretation.

Suppose we observe {(X_1,Y_1),\ldots, (X_n,Y_n)\sim P} where {Y_i\in\mathbb{R}} and {X_i \in \mathbb{R}^d}. The regression function {m(x) = \mathbb{E}(Y|X=x)} is some unknown, arbitrary function. We have no hope to estimate {m(x)} nor do we have licence to impose assumptions on {m}.

But there is sound theory to justify the lasso that makes virtually no assumptions. In particular, I refer to Greenshtein and Ritov (2004) and Juditsky and Nemirovski (2000).

Let {{\cal L} = \{ x^t\beta:\ \beta\in\mathbb{R}^d\}} be the set of linear predictors. For a given {\beta}, define the predictive risk

\displaystyle  R(\beta) = E( Y - \beta^T X)^2

where {(X,Y)} is a new pair. Let us define the best, sparse, linear predictor {\ell_*(x) = \beta_*^T x} (in the {\ell_1} sense) where {\beta_*} minimizes {R(\beta)} over the set {B(L) = \{ \beta:\ ||\beta||_1 \leq L\}}. The lasso estimator {\hat\beta} minimizes the empirical risk {\hat R(\beta) = \frac{1}{n}\sum_{i=1}^n (Y_i - \beta^T X_i)^2} over {B(L)}. For simplicity, I’ll assume that all the variables are bounded by {C} (but this is not really needed). We make no other assumptions: no linearity, no design assumptions, and no models. It is now easily shown that

\displaystyle  R(\hat\beta) \leq R(\beta_*) + \sqrt{ \frac{8 C^2 L^4}{n} \log \left(\frac{2p^2}{\delta}\right)}

except on a set of probability at most {\delta}.

This shows that the predictive risk of the lasso comes close to the risk of the best sparse linear predictor. In my opinion, this explains why the lasso “works.” The lasso gives us a predictor with a desirable property — sparsity — while being computationally tractable and it comes close to the risk of the best sparse linear predictor.

3. Interlude: Weak Versus Strong Modeling

When developing new methodology, I think it is useful to consider three different stages of development:

  1. Constructing the method.
  2. Interpreting the output of the method.
  3. Studying the properties of the method.

I also think it is useful to distinguish two types of modeling. In strong modeling, the model is assumed to be true in all three stages. In weak modeling, the model is assumed to be true for stage 1 but not for stages 2 and 3. In other words, one can use a model to help construct a method. But one does not have to assume the model is true when it comes to interpretation or when studying the theoretical properties of the method. My discussion is guided by my preference for weak modeling.

4. Assumption Free Inference: The HARNESS

Here I would like to discuss an approach I have been developing with Ryan Tibshirani. We call this: High-dimensional Agnostic Regresson Not Employing Structure or Sparsity, or, the HARNESS. The method is a variant of the idea proposed in Wasserman and Roeder (2009).

The idea is to split the data into two halves. {{\cal D}_1} and {{\cal D}_2}. For simplicity, assume that {n} is even so that each half has size {m=n/2}. From the first half {{\cal D}_1} we select a subset of variables {S}. The method is agnostic about how the variable selection is done. It could be forward stepwise, lasso, elastic net, or anything else. The output of the first part of the analysis is the subset of predictors {S} and an estimator {\hat\beta = (\hat\beta_j:\ j\in S)}. The second half of the data {{\cal D}_2} is used to provide distribution free inferences for the following questions:

  1. What is the predictive risk of {\hat\beta}?
  2. How much does each variable in {S} contribute to the predictive risk?
  3. What is the best linear predictor using the variables in {S}?

All the inferences from {{\cal D}_2} are interpreted as being conditional on {{\cal D}_1}. (A variation is to use {{\cal D}_1} only to produce {S} and then construct the coefficients of the predictor from {{\cal D}_2}. For the purposes of this discussion, we use {\hat\beta} from {{\cal D}_1}.)

In more detail, let

\displaystyle  R = \mathbb{E}| Y-X^T \hat\beta|

where the randomness is over the new pair {(X,Y)}; we are conditioning on {{\cal D}_1}. Note that in this section I have changed the definition of {R} to be on the absolute scale which is more interpretable. In the above equation, it is understood that {\hat\beta_j=0} when {j\notin S}. The first question refers to producing a estimate and confidence interval for {R} (conditional on {{\cal D}_1}). The second question refers to inferring

\displaystyle  R_j = \mathbb{E}| Y-X^T \hat\beta_{(j)}| - \mathbb{E}| Y-X^T \hat\beta|

for each {j\in S}, where {\beta_{(j)}} is equal to {\hat\beta} except that {\hat\beta_j} is set to 0. Thus, {R_j} is the risk inflation by excluding {X_j}. The third question refers to inferring

\displaystyle  \beta^* = {\rm argmin}_{\beta \in \mathbb{R}^k} \mathbb{E}(Y - X_S^T \beta)^2

the coefficient of the best linear predictor for the chosen model. We call {\beta^*} the projected parameter. Hence, {x^T\beta^*} is the best linear approximation to {m(x)} on the linear space spanned by the selected variables.

A consistent estimate of {R} is

\displaystyle  \hat{R} = \frac{1}{m}\sum_{i=1}^m\delta_i

where the sum is over {{\cal D}_2}, and {\delta_i = |Y_i - X_i^T \hat\beta|}. An approximate {1-\alpha} confidence interval for {R} is {\hat R \pm z_{\alpha/2}\, s/\sqrt{m}} where {s} is the standard deviation of the {\delta_i}‘s.

The validity of this confidence interval is essentially distribution free. In fact, if want to be purely distribution free and avoid asymptotics, we could instead define {R} to be the median of the law of {|Y-X^T \hat\beta|}. Then the order statistics of the {\delta_i}‘s can be used in the usual way to get a finite sample, distribution free confidence interval for {R}.

Estimates and confidence intervals for {R_j} can be obtained from {e_1,\ldots, e_m} where

\displaystyle  e_i = |Y_i - X^T \hat\beta_{(j)}| - |Y_i - X^T \hat\beta|.

Estimates and confidence intervals for {\beta_*} can be obtained by standard least squares procedures based on {{\cal D}_2}. The steps are summarized here:

The HARNESS

Input: data {{\cal D} = \{(X_1,Y_1),\ldots, (X_n,Y_n)\}}.

  1. Randomly split the data into two halves {{\cal D}_1} and {{\cal D}_2}.
  2. Use {{\cal D}_1} to select a subset of variables {S}. This can be forward stepwise, the lasso, or any other method.
  3. Let {R = \mathbb{E}( (Y-X^T \hat\beta)^2 | {\cal D}_1)} be the predictive risk of the selected model on a future pair {(X,Y)}, conditional on {{\cal D}_1}.
  4. Using {{\cal D}_2} construct point estimates and confidence intervals for {R}, {(R_j:\ \hat\beta_j\neq 0)} and {\beta_*}.

The HARNESS bears some similarity to POSI (Berk et al 2013) which is another inference method for model selection. They both eschew any assumption that the linear model is correct. But POSI attempts to make inferences that are valid over all possible selected models while the HARNESS restricts attention to the selected model. Also, the HARNESS emphasizes predictive inferential statement.

Here is an example using the wine dataset. (Thanks to the authors for providing the data.) Using the first half of the data we applied forward stepwise selection and used {C_p} to select a model. The selected variables are Alcohol, Volatile-Acidity, Sulphates, Total-Sulfur-Dioxide and pH. A 95 percent confidence interval for the predictive risk of the null model is (0.65,0.70). For the selected model, the confidence interval for {R} is (0.46,0.53). The (Bonferroni-corrected) 95 percent confidence intervals for the {R_j}‘s are shown in the first plot below. The (Bonferroni-corrected) 95 percent confidence intervals for the parameters of the projected model are shown in the second plot below.

Wine1

Wine2

5. The Value of Data-Splitting

Some statisticians are uncomfortable with data-splitting. There are two common objections. The first is that the inferences are random: if we repeat the procedure we will get different answers. The second is that it is wasteful.

The first objection can be dealt with by doing many splits and combining the information appropriately. This can be done but is somewhat involved and will be described elsewhere. The second objection is, in my view, incorrect. The value of data splitting is that leads to simple, assumption free inference. There is nothing wasteful about this. Both halves of the data are being put to use. Admittedly, the splitting leads to a loss of power compared to ordinary methods if the model were correct. But this is a false comparison since we are trying to get inferences without assuming the model is correct. It’s a bit like saying that nonparametric function estimators have slower rates of convergence than parametric estimators. But that is only because the parametric estimators invoke stronger assumptions.

6. Conformal Prediction

Since I am focusing my discussion on regression methods that make weak assumptions, I would also like to briefly mention Vladimir Vovk’s theory of conformal inference. This is a completely distribution free, finite sample method for predictive regression. The method is described in Vovk, Gammerman and Shafer (2005) and Vovk, Nouretdinov and Gammerman (2009). Unfortunately, most statisticians seem to be unaware of this work which is a shame. The statistical properties (such as minimax properties) of conformal prediction were investigated in Lei, Robins, and Wasserman (2012) and Lei and Wasserman (2013).

A full explanation of the method is beyond the scope of this discussion but I do want to give the general idea and say why it is related the current paper. Given data {(X_1,Y_1),\ldots,(X_n,Y_n)} suppose we observe a new {X} and want to predict {Y}. Let {y\in\mathbb{R}} be an arbitrary real number. Think of {y} as a tentative guess at {Y}. Form the augmented data set

\displaystyle  (X_1,Y_1),\ldots, (X_n,Y_n), (X,y).

Now we fit a linear model to the augmented data and compute residuals {e_i} for each of the {n+1} observations. Now we test {H_0: Y=y}. Under {H_0}, the residuals are invariant under permutations and so

\displaystyle  p(y) = \frac{1}{n+1}\sum_{i=1}^{n+1} I( |e_i| \geq |e_{n+1}|)

is a distribution-free p-value for {H_0}.

Next we invert the test: let {C = \{y:\ p(y) \geq \alpha\}}. It is easy to show that

\displaystyle  \mathbb{P}(Y\in C) \geq 1-\alpha.

Thus {C} is distribution-free, finite-sample prediction interval for {Y}. Like the HARNESS, the validity of the method does not depend on the linear model being correct. The set {C} has the desired coverage probability no matter what the true model is. Both the HARNESS and conformal prediction use the linear model as a device for generating predictions but neither requires the linear model to be true for the inferences to be valid. (In fact, in the conformal approach, any method of generating residuals can be used. It does not have to be a linear model. See Lei and Wasserman 2013.)

One can also look at how the prediction interval {C} changes as different variables are removed. This gives another assumption free method to explore the effects of predictors in regression. Minimizing the length of the interval over the lasso path can also be used as a distribution-free method for choosing the regularization parameter of the lasso.

On a related note, we might also be interested in assumption-free methods for the related task of inferring graphical models. For a modest attempt at this, see Wasserman, Kolar and Rinaldo (2013).

7. Causation

LTTT do not discuss causation. But in any discussion of the assumptions underlying regression, causation is lurking just below the surface. Indeed, there is a tendency to conflate causation and inference. To be clear: prediction, inference and causation are three separate ideas.

Even if the linear model is correct, we have to be careful how we interpret the parameters. Many articles and textbooks describe {\beta_j} as the change in {Y} if {X_j} is changed, holding the other covariates fixed. This is incorrect. In fact, {\beta_j} is the change in our prediction of {Y} if {X_j} is changed. This may seem like nit picking but this is the very difference between association and causation.

Causation refers to the change in {Y} as {X_j} is changed. Association (prediction) refers to the change in our prediction of {Y} as {X_j} is changed. Put another way, prediction is about {\mathbb{E}(Y| {\rm observe}\ X=x)} while causation is about {\mathbb{E}(Y| {\rm set}\ X=x)}. If {X} is randomly assigned, they are the same. Otherwise they are different. To make the causal claim, we have to include in the model, every possible confounding variable that could affect both {Y} and {X}. This complete causal model has the form

\displaystyle  Y = g(X,Z) + \epsilon

where {Z=(Z_1,\ldots,Z_k)} represents all confounding variables in the world. The relationship between {Y} and {X} alone is described as

\displaystyle  Y = f(X) + \epsilon'.

The causal effect — the change in {Y} as {X_j} is changed — is given by {\partial g(x,z)/\partial x_j}. The association (prediction) — the change in our prediction of {Y} as {X_j} is changed — is given by {\partial f(x)/\partial x_j}. If there are any omitted confounding variables then these will be different.

Which brings me back to the paper. Even if the linear model is correct, we still have to exercise great caution in interpreting the coefficients. Most users of our methods are non-statisticians and are likely to interpret {\beta_j} causally no matter how many warnings we give.

8. Conclusion

LTTT have produced a fascinating paper that significantly advances our understanding of high dimensional regression. I expect there will be a flurry of new research inspired by this paper.

My discussion has focused on the role of assumptions. In low dimensional models, it is relatively easy to create methods that make few assumptions. In high dimensional models, low-assumption inference is much more challenging.

I hope I have convinced the authors that the low assumption world is worth exploring. In the meantime, I congratulate the authors on an important and stimulating paper.

Acknowledgements

Thanks to Rob Kass, Rob Tibshirani and Ryan Tibshirani for helpful comments.

References

Berk, Richard and Brown, Lawrence and Buja, Andreas and Zhang, Kai and Zhao, Linda. (2013). Valid post-selection inference. The Annals of Statistics, 41, 802-837.

Greenshtein, Eitan and Ritov, Ya’cov. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.

Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist., 28, 681-712.

Leeb, Hannes and Potscher, Benedikt M. (2008). Sparse estimators and the oracle property, or the return of Hodges’ estimator. Journal of Econometrics, 142, 201-211.

Leeb, Hannes and Potscher, Benedikt M. (2005). Model selection and inference: Facts and fiction. Econometric Theory, 21, 21-59.

Lei, Robins, and Wasserman (2012). Efficient nonparametric conformal prediction regions. Journal of the American Statistical Association.

Lei, Jing and Wasserman, Larry. (2013). Distribution free prediction bands. J. of the Royal Statistical Society Ser. B.

Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.

Vovk, V., Nouretdinov, I., and Gammerman, A. (2009). On-line predictive linear regression. The Annals of Statistics, 37, 1566-1590.

Wasserman, L., Kolar, M. and Rinaldo, A. (2013). Estimating Undirected Graphs Under Weak Assumptions. arXiv:1309.6933.

Wasserman, Larry and Roeder, Kathryn. (2009). High dimensional variable selection. Annals of statistics, 37, p 2178.

Estimating Undirected Graphs Under Weak Assumptions

Mladen Kolar, Alessandro Rinaldo and I have uploaded a paper to arXiv entitled “Estimating Undirected Graphs Under Weak Assumptions.”

As the name implies, the goal is to estimate an undirected graph {G} from random vectors {Y_1,\ldots, Y_n \sim P}. Here, each {Y_i = (Y_i(1),\ldots, Y_i(D))\in\mathbb{R}^D} is a vector with {D} coordinates, or features.

The graph {G} has {D} nodes, one for each feature. We put an edge between nodes {j} and {k} if the partial correlation {\theta_{jk}\neq 0}. The partial correlation {\theta_{jk}} is

\displaystyle  \theta_{jk} = - \frac{\Omega_{jk}}{\sqrt{\Omega_{jj}\Omega_{kk}}}

where {\Omega = \Sigma^{-1}} and {\Sigma} is the {D\times D} covariance matrix for {Y_i}.

At first sight, the problem is easy. We estimate {\Sigma} with the sample covariance matrix

\displaystyle  S = \frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y})(Y_i - \overline{Y})^T.

Then we estimate {\Omega} with {\hat\Omega = S^{-1}}. We can then use the bootstrap to get confidence intervals for each {\theta_{jk}} and then we put an edge between nodes {j} and {k} if the confidence interval excludes 0.

But how close is the bootstrap distribution {\hat F} to the true distribution {F} of {\hat\theta_{jk}}? Our paper provides a finite sample bound on {\sup_t | \hat F(t) - F(t)|}. Not surprisingly, the bounds are reasonable when {D < n}.

What happens when {D>n}? In that case, estimating the distribution of {\hat\theta_{jk}} is not feasible unless one imposes strong assumptions. With these extra assumptions, one can use lasso-style technology. The problem is that, the validity of the inferences then depends heavily on strong assumptions such as sparsity and eigenvalues restrictions, which are not testable if {D_n > n}. Instead, we take an atavistic approach: we first perform some sort of dimension reduction followed by the bootstrap. We basically give up on the original graph and instead estimate the graph for a dimension-reduced version of the problem.

If we were in a pure prediction framework I would be happy to use lasso-style technology. But, since we are engaged in inference, we take this more cautious approach.

One of the interesting parts of our analysis is that it leverages recent work on high dimensional Berry-Esseen theorems namely the results by Victor Chernozhukov, Denis Chetverikov and Kengo Kato which can be found here.

The whole issue of what assumptions are reasonable in high-dimensional inference is quite interesting. I’ll have more to say about the role of assumptions in high dimensional inference shortly. Stay tuned. In the meantime, if I have managed to spark your interest, please have a look at our paper.

Two Announcements

A couple of announcements:

First: A message from Jeff Leek:

“We are hosting an “unconference” on Google Hangouts. We got some really amazing speakers to talk about the future of statistics. I wonder if you could help advertise the unconference on your blogs. Here is our post:

http://simplystatistics.org/2013/09/17/announcing-the-simply-statistics-unconference-on-the-future-of-statistics-futureofstats/

and the sign up page:

https://plus.google.com/events/cd94ktf46i1hbi4mbqbbvvga358

We also hope people will tweet their own ideas with the hashtag #futureofstatistics on Twitter.”

Second:

There is an interesting discussion at Deborah Mayo’s blog:

http://errorstatistics.com/2013/09/14/when-bayesian-inference-shatters-owhadi-scovel-and-sullivan-guest-post/

It is a guest post by
Owhadi, Scovel, and Sullivan on their paper:
“When Bayesian Inference Shatters”

That’s all

Consistency, Sparsistency and Presistency

There are many ways to discuss the quality of estimators in statistics. Today I want to review three common notions: presistency, consistency and sparsistency. I will discuss them in the context of linear regression. (Yes, that’s presistency, not persistency.)

Suppose the data are {(X_1,Y_1),\ldots, (X_n,Y_n)} where

\displaystyle  Y_i = \beta^T X_i + \epsilon_i,

{Y_i\in\mathbb{R}}, {X_i\in\mathbb{R}^d} and {\beta\in\mathbb{R}^d}. Let {\hat\beta=(\hat\beta_1,\ldots,\hat\beta_d)} be an estimator of {\beta=(\beta_1,\ldots,\beta_d)}.

Probably the most familiar notion is consistency. We say that {\hat\beta} is consistent if

\displaystyle  ||\hat\beta - \beta|| \stackrel{P}{\rightarrow} 0

as {n \rightarrow \infty}.

In recent years, people have become interested in sparsistency (a term invented by Pradeep Ravikumar). Define the support of {\beta} to be the location of the nonzero elements:

\displaystyle  {\rm supp}(\beta) = \{j:\ \beta_j \neq 0\}.

Then {\hat\beta} is sparsistent if

\displaystyle  \mathbb{P}({\rm supp}(\hat\beta) = {\rm supp}(\beta) ) \rightarrow 1

as {n\rightarrow\infty}.

The last one is what I like to call presistence. I just invented this word. Some people call it risk consistency or predictive consistency. Greenshtein and Ritov (2004) call it persistency but this creates confusion for those of us who work with persistent homology. Of course, presistence come from shortening “predictive consistency.”

Let {(X,Y)} be a new pair. The predictive risk of {\beta} is

\displaystyle  R(\beta) = \mathbb{E}(Y-X^T \beta)^2.

Let {{\cal B}_n} be some set of {\beta}‘s and let {\beta_n^*} be the best {\beta} in {{\cal B}_n}. That is, {\beta_n^*} minimizes {R(\beta)} subject to {\beta \in {\cal B}_n}. Then {\hat\beta} is presistent if

\displaystyle  R(\hat\beta) - R(\beta_n^*) \stackrel{P}{\rightarrow} 0.

This means that {\hat\beta} predicts nearly as well as the best choice of {\beta}. As an example, consider the set of sparse vectors

\displaystyle  {\cal B}_n = \Bigl\{ \beta:\ \sum_{j=1}^d |\beta_j| \leq L\Bigr\}.

(The dimension {d} is allowed to depend on {n} which is why we have a subscript on {{\cal B}_n}.) In this case, {\beta_n^*} can be interpreted as the best sparse linear predictor. The corresponding sample estimator {\hat\beta} which minimizes the sums of squares subject to being in {{\cal B}_n}, is the lasso estimator. Greenshtein and Ritov (2004) proved that the lasso is presistent under essentially no conditions.

This is the main message of this post: To establish consistency or sparsistency, we have to make lots of assumptions. In particular, we need to assume that the linear model is correct. But we can prove presistence with virtually no assumptions. In particular, we do not have to assume that the linear model is correct.

Presistence seems to get less attention than consistency of sparsistency but I think it is the most important of the three.

Bottom line: presistence deserves more attention. And, if you have never read Greenshtein and Ritov (2004), I highly recommend that you read it.

Reference:

Greenshtein, Eitan and Ritov, Ya’Acov. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.

Is Bayesian Inference a Religion?

Time for a provocative post.

There is a nice YouTube video with Tony O’Hagan interviewing Dennis Lindley. Of course, Dennis is a legend and his impact on the field of statistics is huge.

At one point, Tony points out that some people liken Bayesian inference to a religion. Dennis claims this is false. Bayesian inference, he correctly points out, starts with some basic axioms and then the rest follows by deduction. This is logic, not religion.

I agree that the mathematics of Bayesian inference is based on sound logic. But, with all due respect, I think Dennis misunderstood the question. When people say that “Bayesian inference is like a religion,” they are not referring to the logic of Bayesian inference. They are referring to how adherents of Bayesian inference behave.

(As an aside, detractors of Bayesian inference do not deny the correctness of the logic. They just don’t think the axioms are relevant for data analysis. For example, no one doubts the axioms of Peano arithmetic. But that doesn’t imply that arithmetic is the foundation of statistical inference. But I digress.)

The vast majority of Bayesians are pragmatic, reasonable people. But there is a sub-group of die-hard Bayesians who do treat Bayesian inference like a religion. By this I mean:

  1. They are very cliquish.
  2. They have a strong emotional attachment to Bayesian inference.
  3. They are overly sensitive to criticism.
  4. They are unwilling to entertain the idea that Bayesian inference might have flaws.
  5. When someone criticizes Bayes, they think that critic just “doesn’t get it.”
  6. They mock people with differing opinions.

To repeat: I am not referring to most Bayesians. I am referring to a small subgroup. And, yes, this subgroup does treat it like a religion. I speak from experience because I went to all the Bayesian conferences for many years, and I watched witty people at the end-of-conference Cabaret, perform plays and songs that merrily mocked frequentists. It was all in fun and I enjoyed it. But you won’t see that at a non-Bayesian conference.

No evidence you can provide would ever make the die-hards doubt their ideas. To them, Sir David Cox, Brad Efron and other giants in our field who have doubts about Bayesian inference, are not taken seriously because they “just don’t get it.”

It is my belief that Dennis agrees with me on this. (If you are reading this Dennis, and you disagree, please let me know.) Here is my evidence. Many years ago, there was a debate about whether to start a Bayesian journal. Dennis argued strongly against it precisely because he feared it would make Bayesian inference appear like a sect. Instead, he argued, Bayesians should just think of themselves as statisticians and send their papers to JASA, the Annals, etc. I think Dennis was 100 percent correct. Dennis lost this fight, and we ended up with the journal Bayesian Analysis.

So is Bayesian inference a religion? For most Bayesians: no. But for the thin-skinned, inflexible die-hards who have attached themselves so strongly to their approach to inference that they make fun of, or get mad at, critics: yes, it is a religion.