AnalyticBridge

Data Intelligence, Business Analytics

Subscribe to Dr. Granville's Weekly Digest

How to build simple, accurate, data-driven, model-free confidence intervals

An updated version with source code and detailed explanations can be found here.

If observations from a specific experiment (for instance, scores computed on 10 million credit card transactions) are assigned a random bin ID (labeled 1, ··· ,k), then you can easily build a confidence interval for any proportion or score computed on these k random bins, using the Analyticridge theorem.

The proof of this theorem relies on complicated combinatorial arguments and the use of the Beta function. Note that the ﬁnal result does not depend on the distribution associated with your data - in short, your data does not have to follow a Gaussian (a.k.a normal) or any prespecified statistical distribution, to make the confidence intervals valid. You can ﬁnd more details regarding the proof of the theorem in the book Statistics of Extremes by E.J. Gumbel, pages 58-59 (Dover edition, 2004)

Parameters in the Analyticbridge theorem can be chosen to achieve the desired level of precision - e.g a 95%, 99% or 99.5% confidence interval. The theorem will also tell you what your sample size should be to achieve a pre-specified accuracy level. This theorem is a fundamental result to compute simple, per-segment, data-driven, model-free conﬁdence intervals in many contexts, in particular when generating predictive scores produced via logistic / ridge regression or decision trees / hidden decision trees (e.g. for fraud detection, consumer or credit scoring).

Application:

A scoring system designed to detect customers likely to fail on a loan, is based on a rule set. On average, for an individual customer, the probability to fail is 5%. In a data set with 1 million observations (customers) and several metrics such as credit score, amount of debt, salary, etc. if we randomly select 99 bins each containing 1,000 customers, the 98% conﬁdence interval (per bin of 1,000 customers) for the failure rate is (say) [4.41%, 5.53%], based on the Analyticridge theorem, with k = 99 and m = 1 (read the theorem to understand what k and m mean - it's actually very easy to understand the signification of these parameters).

Now, looking at a non-random bin with 1,000 observations, consisting of customers with credit score < 650 and less than 26 years old, we see that the failure rate is 6.73%. We can thus conclude that the rule credit score < 650 and less than 26 years older is actually a good rule to detect failure rate, because 6.73% is well above the upper bound of the [4.41%, 5.53%] confidence interval.

Indeed, we could test hundreds of rules, and easily identify rules with high predictive power, by systematically and automatically looking at how far the observed failure rate (for a given rule) is from a standard conﬁdence interval. This allows us to rule out eﬀect of noise, and process and rank numerous rules (based on their predictive power - that is, how much their failure rate is above the confidence interval upper bound) at once.

Related article

Earlier version of this article (maybe easier to understand)

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Claudio Lucio do Val Lopes on May 3, 2013 at 1:04pm: Based on my understanding about this issue I wrote a 'toy code' in R. I also created a video about the simulation.

VIDEO:

https://www.youtube.com/watch?v=72uhdRrf6gM

CODE:

set.seed(3.1416)
x <- rnorm(1000000,10,2)

analytic_teorem <- function (N,x,med,desv)

{
    length(x)
    meansamples <- rep(NA,N)
    for(i in 1:N){
    meansamples[i] <- mean(sample(x, length(x)/N, replace = TRUE))
    }
    meansamples <- sort(meansamples)
    confidence <- rep(NA,N/2)
    p_lower <- rep(NA,N/2)
    p_upper <- rep(NA,N/2)

    for(k in 1:N/2){
        confidence[k] <- 2*k/(N+1)
        p_lower[k] <- meansamples[k]
        p_upper[k] <- meansamples[N-k+1]
    }

    mean_t <- (meansamples[k] + meansamples[N-k+1])/2

    x_a <- seq(1:(N/2))
    par(mar=c(5,4,4,5)+.1)

    plot(x_a, p_upper,type="l", ylim=range(c(med-1*desv,med+1*

desv)),col="blue",xlab="K",ylab="Values")

    lines(x_a, p_lower,type="l",col="red",xaxt="n",yaxt="n",xlab="",ylab="")

    par(new=TRUE)
    plot(confidence*(N/2),rep(mean_t,N/2),ylim=range(c(med-1*desv,med+1*desv)),type="l",

         col="black",xaxt="n",yaxt="n",ylab='',xlab='',lty=2)
    axis(3,seq(from=0,to=1,by=.25)*(N/2),las=0,at=seq(from=0,to=1,by=.25)*(N/2),labels=c(1,0.75,0.50,0.25,0))
    mtext("Confidence",side=3,line=3)
    legend("bottomright",col=c("red","blue"),lty=1,legend=c("Lower bound","Upper Bound"))
    legend("topright",legend=c(paste("N=",as.character(N))))
}

for(i in seq(1000,10000,100)) analytic_teorem(i,x,10,2)

Is it right? Does it make sense?

Comment by Vincent Granville on April 30, 2013 at 1:07pm: This was an earlier post about the same result.

Comment by Claudio Lucio do Val Lopes on April 30, 2013 at 12:04pm: I´m a little bit confused with the relation to this post:

http://www.analyticbridge.com/forum/topics/easy-to-compute?commentI...

Easy to compute, distribution-free, fractional confidence intervals

Comment by Marc d. Paradis on April 15, 2013 at 12:20pm: Vincent: correct me if I am wrong...

k would appear to be the number of random bins;

m would appear to be the number of non-random bins;

5% would appear to be a given (average fail rate = 50,000 fails/1,000,000 customers);

6.73% would appear to be a given (67.3 fails/1,000 customers with less than <650 credit and <26 years old) - although I don't really understand how you get 0.3 of a fail for 100 customers, assuming that the whole customer, and not some fraction, fails.

What I do not understand is where [4.41%,5.53%] comes from and how that is related to the 98% confidence interval given in the example.

I have to agree with Brt Dnk - Vincent, can you give us a spreadsheet or a dataset from which to replicate your example (or one similar but with perhaps only 100,000 total obserations) above?

-Marc d. Paradis

Comment by Brt Dnk on March 22, 2013 at 7:40pm: HI Vincent,

This sounds great but I'm having trouble understanding how to set the two parameters and the pdf describing the theorem din't help much. Could you please post a spreadsheet with specific calculation example and perhaps some pointers around selecting the right parameters?

Thank you in advance, this would be very useful if I knew how to get right k and m...

RSS

Welcome to
AnalyticBridge

Sign Up
or Sign In

Or sign in with:

Big Data Jobs

Statistical Modeler/Analyst

Data Analyst

GIS & Analytics Manager

Senior Cloud Tools Engineer - Netflix

Technical Business Development Manager-Big Data - Toshiba

Principal Research Scientist - Machine Learning - Rocket Fuel

Technical Lead Software Engineering Data Analytics-Security-Machine Learning - Cisco

Applied Machine Learning Engineer - Microsoft

Data Scientist:"Go to Market" - Toshiba

Senior Machine Learning Engineer - AdRoll

More…

Badges | Report an Issue | Terms of Service

1	Highest Paying Programming Skills
2	Simplifies SQL-style Computations – Records Corresponding to Max Value
3	How to normalise data
4	Four great data science, big data, and deep machine learning books
5	The 10 Highest-Paying Jobs For Math Geeks
6	Processing Structured Text in Java–Conditional Filtering
7	Upscaling your skill
8	Sentiment Analysis of 11 Million Tweets from Apple Live 2014 - Going beyond positive and negative
9	Why Propensity Scores not working in Campaign Analytics??
10	Data Scientist, The Magician?

AnalyticBridge

How to build simple, accurate, data-driven, model-free confidence intervals

You need to be a member of AnalyticBridge to add comments!

Easy to compute, distribution-free, fractional confidence intervals

Top Content

Highest Paying Programming Skills

Simplifies SQL-style Computations – Records Corresponding to Max Value

How to normalise data

Four great data science, big data, and deep machine learning books

The 10 Highest-Paying Jobs For Math Geeks

Processing Structured Text in Java–Conditional Filtering

Upscaling your skill

Sentiment Analysis of 11 Million Tweets from Apple Live 2014 - Going beyond positive and negative

Why Propensity Scores not working in Campaign Analytics??

Data Scientist, The Magician?

Follow Us

Big Data Jobs