WAIC and WBIC

Widely Applicable and Bayesian Information Criterion

Sumio Watanabe Homepage

WAIC and WBIC are information criteria beyond Laplace and Fisher.

(A) If you want to estimate the predictive loss, you had better use WAIC.

(B) If you want to identify the true model, you had better use WBIC.

(C) Both WAIC and WBIC are applicable even if the posterior distribution is far from any normal distribution. They can be used even if Fisher information matrix is singular or even if a true distribution is unrealizable by a statistical model. Especially, they can be applicable even when the posterior is not log-concave.

(D) Both WAIC and WBIC have the completely new and mathematically rigorous theoretical support, algebraic geometry and empirical process theory. Neither Fisher asymptotic theory nor Laplace approximation is necessary.

(E) If you have an MCMC software, it is very easy to implement WAIC and WBIC.

(F) Both WAIC and WBIC are useful in practical applications.

(Special Remark) Bayes is much more useful than ML when the posterior is far from the normal distribution. Both WAIC and WBIC work even in such cases.

(Special Remark) Bayes is much more useful than ML when a statistical model is made of integration of complex, deep, and hierarchical inferences. Both WAIC and WBIC work even in such cases.

$Widely Applicable IC$

This theorem holds,
(1) even if Fisher information matrix is not positive definite,
(2) or even if asymptotic normality of MLE fails,
(3) or even if Laplace approximation does hot hold.
WAIC and WBIC are supported by singular learning theory.

$Widely Applicable IC$

PDF file for printing.

(Remark 1) Identifying the true model is different from estimating the predictive loss.

(Remark 2) If the posterior distribution can not be approximated by any normal distribution, then neither BIC nor DIC can be used in statistical model evaluation. Both WAIC and WBIC can be employed in any circumstance. WAIC and WBIC have the theoretical support. Asymptotically WAIC has the same expectation value as the predictive log likelihood and WBIC is the same random variable as the Bayes marginal likelihood.

(Remark 3) If you want to check your MCMC software, you had better compare the theoretical RLCT with empirical one.
* One method to know RLCT is to use Theorem 2 in WAIC paper .
* Another is to use the equations (19) and (20) in p.877 of WBIC paper .

Let's see the true posterior distribution

(1) Even if Fisher Information matrix at the true parameter is positive definite, the posterior distribution might not be approximated by a normal distribution. In this example, the number of parameters is only two, whereas the number of empirical samples is 10000. Fisher asymptotic theory can not be applied to this case.

(2) Neither AIC, BIC, TIC, DIC, nor MDL is applicable, even if n=10000.

(3) WAIC and WBIC are applicable for all cases, n=100, 1000, and 10000.

(4) If you are a mathematician, you find algebraic geometry in this problem.

(5) If you are a statistician, you had better have the courage to see the true posterior distribution.

$Example of Posterior Distributions$

(Remark) One might think that, in a real world problem, a true distribution seldom coincides with a singularity of a statistical model, thus Fisher asymptotic theory holds if the number of empirical samples is sufficiently large. However, ``the sufficiently large number" is often far larger than one may expect.

Related Article

The evaluation reports of WAIC are given by the expert statisticians.

(Remark.1) The original cross-validation in the Bayesian inference might be used, but it needs quite heavy computational costs, because we need to construct the posterior distributions n times (n is the number of training samples).

(Remark.2) The important sampling leaving-one-out cross validation (ISLOOCV) can be calculated by the same computational cost as WAIC. In the above Vehtari-Ojanen paper (pp.189-190), it is explained in a case of linear regression that ISLOOCV may not satisfy the central limit theorem in posterior parameter sampling when a leverage sample is contained in the training samples (Peruggia, 1997; Epifani, MacEachern and Peruggia, 2008). Here ``a leverage sample" is sometimes equal to ``an outlier". This phenomenon is caused by the large importance weight (1/p(x_i|w)), or equivalently by the small p(x_i|w). On the other hand, in WAIC, such a problem does not occur, because importance weight is not used.

References

Mathematical foundation of WAIC and WBIC is found in the following references.

1. Mathematical background of algebraic geometry, empirical process, and singular learning theory:

S. Watanabe, "Algebraic Geometry and Statistical Learning Theory," Cambridge University Press, Cambridge, UK, 2009, September.

2. Proof of WAIC in singular cases:

Sumio Watanabe, "Equations of states in singular statistical estimation", Neural Networks, Vol.23, No.1, pp.20-34, 2010, January. arXiv:0712.0653

3. Proof of WAIC in unrealizable cases:

Sumio Watanabe, "Equations of states in statistical learning for an unrealizable and regular case," IEICE Transactions, Vol.E93-A, No.3, pp.617-626, 2010, March. arXiv:0906.0211

4. Proof of asymptotic equivalence of WAIC and Leaving-one-out Cross-Validation:

Sumio Watanabe, "Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory," Journal of Machine Learning Research, Vol.11, (DEC), pp.3571-3591, 2010.

5. Proof of asymptotic expansion of Bayes Marginal:

S.Watanabe,"Algebraic analysis for nonidentifiable learning machines," Neural Computation, Vol.13, No.4, pp.899-933, 2001.

6. Proof of WBIC:

Sumio Watanabe, "A widely applicable Bayesian information criterion," Journal of Machine Learning Research, Vol.14, (Mar), pp.867-897, 2013.

Sumio Watanabe Homepage