Subscribe to Dr. Granville's Weekly Digest

How to build simple, accurate, data-driven, model-free confidence intervals

An updated version with source code and detailed explanations can be found here.

If observations from a specific experiment (for instance, scores computed on 10 million credit card transactions) are assigned a random bin ID (labeled 1, ··· ,k), then you can easily build a confidence interval for any proportion or score computed on these k random bins, using the Analyticridge theorem.

The proof of this theorem relies on complicated combinatorial arguments and the use of the Beta function. Note that the final result does not depend on the distribution associated with your data - in short, your data does not have to follow a Gaussian (a.k.a normal) or any prespecified statistical distribution, to make the confidence intervals valid. You can find more details regarding the proof of the theorem in the book Statistics of Extremes by E.J. Gumbel, pages 58-59 (Dover edition, 2004)

Parameters in the Analyticbridge theorem can be chosen to achieve the desired level of precision - e.g a 95%, 99% or 99.5% confidence interval. The theorem will also tell you what your sample size should be to achieve a pre-specified accuracy level. This theorem is a fundamental result to compute simple, per-segment, data-driven, model-free confidence intervals in many contexts, in particular when generating predictive scores produced via logistic / ridge regression or decision trees / hidden decision trees (e.g. for fraud detection, consumer or credit scoring).

Application:

A scoring system designed to detect customers likely to fail on a loan, is based on a rule set. On average, for an individual customer, the probability to fail is 5%. In a data set with 1 million observations (customers) and several metrics such as credit score, amount of debt, salary, etc. if we randomly select 99 bins each containing 1,000 customers, the 98% confidence interval (per bin of 1,000 customers) for the failure rate is (say) [4.41%, 5.53%], based on the  Analyticridge theorem, with k = 99 and m = 1 (read the theorem to understand what k and m mean - it's actually very easy to understand the signification of these parameters).

Now, looking at a non-random bin with 1,000 observations, consisting of customers with credit score < 650 and less than 26 years old, we see that the failure rate is 6.73%. We can thus conclude that the rule credit score < 650 and less than 26 years older is actually a good rule to detect failure rate, because 6.73% is well above the upper bound of the [4.41%, 5.53%] confidence interval.

Indeed, we could test hundreds of rules, and easily identify rules with high predictive power, by systematically and automatically looking at how far the observed failure rate (for a given rule) is from a standard confidence interval. This allows us to rule out effect of noise, and process and rank numerous rules (based on their predictive power - that is, how much their failure rate is above the confidence interval upper bound) at once.

Related article

Views: 9776

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Claudio Lucio do Val Lopes on May 3, 2013 at 1:04pm

Based on my understanding about this issue I wrote a 'toy code' in R. I also created a video about the simulation.

VIDEO:

https://www.youtube.com/watch?v=72uhdRrf6gM

CODE:

set.seed(3.1416)
x <- rnorm(1000000,10,2)


analytic_teorem <- function (N,x,med,desv)


{
    length(x)
    meansamples <- rep(NA,N)
    for(i in 1:N){
      meansamples[i]  <- mean(sample(x, length(x)/N, replace = TRUE))
    }
    meansamples <- sort(meansamples)
    confidence <- rep(NA,N/2)
    p_lower <- rep(NA,N/2)
    p_upper <- rep(NA,N/2)

    for(k in 1:N/2){
        confidence[k] <- 2*k/(N+1)
        p_lower[k] <- meansamples[k]
        p_upper[k] <- meansamples[N-k+1]
    }
   
    mean_t <- (meansamples[k] + meansamples[N-k+1])/2
   
    x_a <- seq(1:(N/2))
    par(mar=c(5,4,4,5)+.1)

    plot(x_a, p_upper,type="l", ylim=range(c(med-1*desv,med+1*

desv)),col="blue",xlab="K",ylab="Values")

    lines(x_a, p_lower,type="l",col="red",xaxt="n",yaxt="n",xlab="",ylab="")

    par(new=TRUE)
    plot(confidence*(N/2),rep(mean_t,N/2),ylim=range(c(med-1*desv,med+1*desv)),type="l",

         col="black",xaxt="n",yaxt="n",ylab='',xlab='',lty=2)
    axis(3,seq(from=0,to=1,by=.25)*(N/2),las=0,at=seq(from=0,to=1,by=.25)*(N/2),labels=c(1,0.75,0.50,0.25,0))
    mtext("Confidence",side=3,line=3)
    legend("bottomright",col=c("red","blue"),lty=1,legend=c("Lower bound","Upper Bound"))
    legend("topright",legend=c(paste("N=",as.character(N))))
}

for(i in seq(1000,10000,100)) analytic_teorem(i,x,10,2)
Is it right? Does it make sense?
Comment by Vincent Granville on April 30, 2013 at 1:07pm

This was an earlier post about the same result.

Comment by Claudio Lucio do Val Lopes on April 30, 2013 at 12:04pm

I´m a little bit confused with the relation to this post:

http://www.analyticbridge.com/forum/topics/easy-to-compute?commentI...

Easy to compute, distribution-free, fractional confidence intervals

Comment by Marc d. Paradis on April 15, 2013 at 12:20pm

Vincent: correct me if I am wrong...

k would appear to be the number of random bins;

m would appear to be the number of non-random bins;

5% would appear to be a given (average fail rate = 50,000 fails/1,000,000 customers);

6.73% would appear to be a given (67.3 fails/1,000 customers with less than <650 credit and <26 years old) - although I don't really understand how you get 0.3 of a fail for 100 customers, assuming that the whole customer, and not some fraction, fails.

 

What I do not understand is where [4.41%,5.53%] comes from and how that is related to the 98% confidence interval given in the example.

 

I have to agree with Brt Dnk - Vincent, can you give us a spreadsheet or a dataset from which to replicate your example (or one similar but with perhaps only 100,000 total obserations) above?

 

-Marc d. Paradis

Comment by Brt Dnk on March 22, 2013 at 7:40pm

HI Vincent,

This sounds great but I'm having trouble understanding how to set the two parameters and the pdf describing the theorem din't help much. Could you please post a spreadsheet with specific calculation example and perhaps some pointers around selecting the right parameters?

Thank you in advance, this would be very useful if I knew how to get right k and m...

© 2014   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC

Badges  |  Report an Issue  |  Terms of Service