Wednesday, January 31, 2007

Box plot and whisker plots in Excel 2007

2011-10-05 Update! A new version of this article and a FREE and enhanced Microsoft Excel 2007 example is available in my blog post at Cardinal Path: The math behind web analytics: box plot.

This articles aims to highlight the importance of box plots (or horizontal box, or Whisker plot) in analytics and provides a step by step example to make them in Microsoft Excel 2007.

What are box plots?

A box plot is a visualization method used in statistics. It shows, in a glimpse, several very important elements describing a data sample:
  • the smallest observation
  • the lower quartile (25%)
  • the median (50%)
  • the higher quartile (75%)
  • the largest observation
In addition, a box plot can help visualize two other elements:
  • abnormal data (outliers)
  • average

How can they be used in web analytics?

Boxplots are a quick visualization approach for examining one or more data sets. Because they can easily reveal the limits of acceptable data and any extremes, it becomes very easy to explain trends and abnormalities and communicate the right information.

We have to consider, however, than in term of data visualization, some ways of showing Probability Density Function can also reveal the same information, and more, in a very interesting way. However, drawing them in Excel might be very difficult...

How can I draw my own box plot?

For this example, we will use the fictitious data representing number of Visits to a Web site on a day by day basis, as shown below:

  1. The first thing to do is to sort the values (the Visits column) in ascending order.
  2. Get the following statistical information from your data. Those are the essential ones to build the box plot:
  3. We now need to calculate the width of each box and the lower/higher non-outliers limits:
  4. Select the 25th, 50th and 75th percentile values (yellow cells) and Insert a new 2-D Stacked Bar graph (or press Alt-F1 and then Change Chart Type). Then Switch Row/Colum (under Chart Tools/Design menu). You should get something like this:
  5. Some formatting will get us closer to our goal:
    • Delete the Legend
    • Delete the Horizontal (Value) Axis Major Gridlines
    • Set the 1st box and set it to be totally transparent: right-click, Format Data Series..., Fill, No fill
    • Set the 2nd and 3rd boxes to be transparent with a border: no fill as above, then Border Color, Solid Line
    • Set Vertical axis to cross at maximum value: select the Horizontal (Value) Axis, right-click, Format Axis..., Vertical axis crosses: Maximum axis value
    • Hide the vertical axis bar: select the Vertical (Category) Axis, right-click, Format Axis..., Line Color, No line
    • Set the Category axis label: right-click in the chart, Select Data, pick the Horizontal (Category) Axis labels, click the Edit button and pick the cell containing your data label
  6. Set the lower limit and higher limit ranges
    • Select the first box, click on Chart Tools, Layout. Select Error bars, More error bars options..., pick the Display Direction: Minus, indicate the Error Amount: Custom and click the Specify Value button. Leave the Positive Error Value as is and select the blue cell (Lower limit) for the Negative Error Value.
    • Same as above for the 3rd box, but choose the Display Direction: Plus and pick the Higher limit (green cell) cell for the Positive Error Value.
  7. We're getting close, we will now add the average:
    • Select the cells containing the average label and value (C3:D3). Do a copy (ctrl-c). Simply select the graph and paste this value (ctrl-v) as a new series.
    • It's being added as a new box in our graph. Select it and Change Series Chart Type..., pick the X Y (Scatter) with only markers.
    • This will show the marker, but it won't be positioned correctly in the graph. Remember we switched rows and columns earlier: right-click anywhere in the chart, Select Data... then pick the newly added series (Average). Edit and switch the X and Y values.
    • Pick the left-side Secondary Vertical (Value) Axis and rick-click to Format Axis..., set the Minimum value to -1 and the Maximum to 1. Close the dialog. You can now delete this extra axis scale indicator.
    • The average is now positioned correctly. You might want to change the indicator to a more visible symbol.
  8. Let's add the last touch! We want to show the outliers:
    • Right-click the graph and Select Data...
    • Click the Add button
    • Set the Series name to "Low outliers"
    • For the the Series X, select any value from the sorted data that is lower than the minimum valid range (D2)
    • For the Series Y Values, put ={0} so the new data points are shown correctly on the vertical scale.
    • Use the same technique for the "High outliers" with any value that is larger then the maximum valid range (D7)
  9. Some minor aesthetic adjustments and we're set!

Analysis

  • We see that Visits ranged from about 125 to 250.
  • We see the normal range to be between 165 and about 200.
  • We spot two abnormal values: one too low, probably an indication there was a problem on that day, and one two high, for example a spider or web bot crawling our site.
Comments and suggestions are always appreciated!

Some references