Search nationalatlas.gov About | Fact Sheets | Contact Us | Partners | Products | Site Map | FAQ | Help |
 Home ›  Articles › Understanding Descriptive Statistics

 mapping

 climate Map Maker Tornadoes 1950-2008: 2000-2004 Map Layer Tornadoes 1950-2008 Articles When and Where Do Tornadoes Occur?

# Article

Understanding Descriptive Statistics

Introduction

The term statistics can have several meanings. In one sense, statistics refers to data. For instance, the number of people per county within a State is an example of numerical data. The type of vegetative cover found across a State is an example of non-numerical data. Statistics can also refer to specific mathematical operations performed on data.

When talking about statistics as mathematical operations, there are two basic divisions within the field: descriptive and inferential. Descriptive statistics uses graphical and numerical summaries to give a 'picture' of a data set. Inferential statistics, which use mathematical probabilities, make generalizations about a large group based on data collected from a small sample of that group. This article focuses on descriptive statistics, and on their use.
Descriptive Statistics

To help explain descriptive statistics, we will use the total number of tornadoes recorded by State (including the District of Columbia) from 2000, as shown in Table 1.

Descriptive statistics can include graphical summaries that show the spread of the data, and numerical summaries that either measure the central tendency (a 'typical' data value) of a data set or that describe the spread of the data.

 State Number of Tornadoes State Number of Tornadoes Alabama 44 Montana 10 Alaska 0 Nebraska 60 Arizona 0 Nevada 2 Arkansas 37 New Hampshire 0 California 9 New Jersey 0 Colorado 60 New Mexico 5 Connecticut 1 New York 5 Delaware 0 North Carolina 23 District of Columbia 0 North Dakota 28 Florida 77 Ohio 25 Georgia 28 Oklahoma 44 Hawaii 0 Oregon 3 Idaho 13 Pennsylvania 5 Illinois 55 Rhode Island 1 Indiana 13 South Carolina 20 Iowa 45 South Dakota 18 Kansas 59 Tennessee 27 Kentucky 23 Texas 147 Louisiana 43 Utah 3 Maine 2 Vermont 0 Maryland 8 Virginia 11 Massachusetts 1 Washington 3 Michigan 4 West Virginia 4 Minnesota 32 Wisconsin 18 Mississippi 27 Wyoming 5 Missouri 28 Table 1. The total number of recorded tornadoes in 2000, arranged alphabetically by State and including the District of Columbia. Source: National Oceanic and Atmospheric Administration's National Climatic Data Center
Graphical Summaries: Dispersion Graphs

Dispersion graphs (also called dot plots) are an example of one kind of graphical summary. Researchers use dispersion graphs to identify patterns in data such as concentrations, locations of data 'gaps', or atypical data (i.e. observations that do not fit the general character of the data.) A dispersion graph places individual data values along a number line, thereby representing the position of each data value in relation to all the other data values. Figure 1 shows a dispersion graph of the tornado data from Table 1. We can see that most of the data is concentrated at the lower end of the graph, indicating that most States had 20 or fewer tornadoes, and only a few States had more than 50 tornadoes. Texas had 147 tornadoes during 2000 and this data value is positioned on the far right-hand side of the graph. There are also several gaps in the data where there are no values; these are shown by the green rectangles on the graph. Since the data are concentrated toward the lower end of the dispersion graph, we can say that the number of tornadoes for Texas (147) is atypical of this particular data set. Atypical data values are also referred to as outliers.

Figure 1. A dispersion graph of the 2000 tornado data from Table 1.

Graphical Summaries: Histograms

Another kind of graphical summary is the histogram, which combines data into groups or classes as a way to generalize the details of a data set while at the same time illustrating the data's overall pattern. On a histogram, the x-axis represents the data values arranged into classes while the y-axis shows the number of occurrences in each class.

Figure 2. A histogram showing the 2000 tornado data.

In the Figure 2 histogram we see that the first class contains all the States that experienced between zero and nineteen tornadoes during 2000. Notice that each class has the same width along the x-axis. The decision to set the width of each class at nineteen is arbitrary. A different width could easily be used and would likely change the overall appearance of the histogram. As with dispersion graphs, histograms can show gaps where no data values exist (the 100-119 class). In Figure 2, there are three empty classes: 80-99, 100-119, and 120-139.

When histogram data cluster to one side or the other, the shape of the histogram is described as 'skewed'. In Figure 2, the tornado data are clustered on the lower or left-hand side, which is known as positive skew. Due to the single outlier, the data is said to 'tail' to the positive side.

Figure 3 illustrates the different degrees of skew that are typical of data sets. Data sets that have a greater number of high values, with outliers on the low end of the data scale (data that 'tail' to the negative side), are said to have negative skew. Histogram B in Figure 3 is an example of data having a negative skew. Histogram C in Figure 3 is an example of a normal data set which is without a skew due to the absence of outliers concentrated on one particular side of the distribution.

Figure 3. Histograms displaying examples of different degrees of skew.
Numerical Summaries: Measures of Central Tendency

Measures of central tendency are numerical summaries used to summarize a data set with a single 'typical' number. Three commonly reported measures of central tendency are the mean, median, and mode. With large data sets, the calculation of a measure of central tendency is best handled through a computer software package that will minimize the chance of errors.

Mean
The mean, commonly called the average, is a mathematically computed value which represents a central value of a given data set. The mean is computed by adding all the data values together and dividing by n, where n represents the total number of data values. For our tornado data, adding all the data values together results in 1,076—the total number of tornadoes recorded in all States during 2000. Dividing this total by 51 gives us a mean of 21.1. If we examine the mean in relation to all data values (Figure 4), we can see that the mean lies toward the lower end of the dispersion graph, which makes sense because this is where the majority of the data values are concentrated.

The mean represents a generalization of the data and therefore, interpretation of its value must be done with care or else the value can be misleading. The mean suggests that for any given State there were, on average, 21.1 tornadoes during 2000. A quick glance at Table 2 shows that no State had exactly 21.1 tornadoes—each State had either more or fewer tornadoes than the mean value. However, note that there are a few States with a number of tornadoes close to 21.1. Also note that the mean is influenced by extremes in the data. In other words, in a data set having extremely high or low data values, the mean tends to be 'pulled' in the direction of those outliers and therefore can misrepresent the data's central tendency. Thus, it should not be surprising that the mean for our tornado data is pulled to the right by the value for Texas (147).

Figure 4. A dispersion graph showing the position of the mean number of tornadoes by State for 2000.

Median
If we divide the data into two equal halves where each half contains 50% of the data, the numerical value where the data are divided is called the median. You can also think of the median as the 50th percentile or as the point that would perfectly balance the data if they were placed upon a balance scale. To compute the median, three steps are required. First, the data are ordered by rank, as has been done in Table 2. Second, the data position is calculated. This requires examining the data to determine if there are an even or odd number of data values. The tornado data set has 51 data values, which is an odd number. In this case, where there are an odd number of data values, the following equation is used:

(n + 1)/2 = Rp
where Rp is the rank-position of the median in the rank-ordered data and n represents the number of data values.

Using this equation, we can insert the appropriate values for our data set:

(51 + 1)/2 = 26
which gives the data position of the median in the ranked-order tornado data set, not the median value. Third, to find the median value, look at data position 26 in the rank-ordered data set, which is Virginia. The data value associated with the rank of 26 is 11, which is the median for the tornado data set. The median in this case equally divides the data into two halves, so that there are exactly 25 data values above and 25 data values below the median value of 11.

 Rank State Number of Tornadoes Rank State Number of Tornadoes 1 Alaska 0 27 Idaho 13 2 Arizona 0 28 Indiana 13 3 District of Columbia 0 29 South Dakota 18 4 Delaware 0 30 Wisconsin 18 5 Hawaii 0 31 South Carolina 20 6 New Hampshire 0 32 Kentucky 23 7 New Jersey 0 33 North Carolina 23 8 Vermont 0 34 Ohio 25 9 Connecticut 1 35 Mississippi 27 10 Massachusetts 1 36 Tennessee 27 11 Rhode Island 1 37 Georgia 28 12 Maine 2 38 Missouri 28 13 Nevada 2 39 North Dakota 28 14 Oregon 3 40 Minnesota 32 15 Utah 3 41 Arkansas 37 16 Washington 3 42 Louisiana 43 17 Michigan 4 43 Alabama 44 18 West Virginia 4 44 Oklahoma 44 19 New Mexico 5 45 Iowa 45 20 New York 5 46 Illinois 55 21 Pennsylvania 5 47 Kansas 59 22 Wyoming 5 48 Colorado 60 23 Maryland 8 49 Nebraska 60 24 California 9 50 Florida 77 25 Montana 10 51 Texas 147 26 Virginia 11 Table 2. The 2000 tornado data from Table 1 ranked in ascending order.

If we look at Figure 5, we see the tornado data median value of 11 on the dispersion graph. Note that for this data set, the median is positioned closer to the lower end of the data values than the mean. This shows that the median is not influenced by outliers as was the mean, but by the number of data values. When a data set has outliers, reporting the median as the central tendency of the data often gives a better 'typical' data value than the mean.

Figure 5. A dispersion graph comparing the median and mean values for the number of tornadoes by State for 2000.

How would you compute a median if there were an even number of data values? In the case where there is an even number of data values, the following equation is used:

Average [(n/2) + ((n/2) +1)] = Rp
where Rp is the rank-position of the median in the rank-ordered data and n represents the number of data values.

Unlike the first equation, when computing the median for an even number of data values, the rank position is the average of the two middle data values. To illustrate, assume, for example, that we removed the data value for Texas leaving us with only 50 data values. Next, begin with the data ranked in order as in Table 2. Substituting the appropriate values into the equation gives us the following rank positions: (50/2) = 25 and ((50/2) +1) = 26. In Table 2, Montana is ranked 25th with 10 tornadoes and Virginia is ranked 26th with 11 tornadoes. If we average the data values corresponding to the 25th and 26th ranks (10 and 11, respectively), we have a median value of 10.5. It is important to remember that, regardless of which equation is used, the resulting Rp number is not the median value, but the rank which can then be used to find the median value.

Mode
The mode is the data value that occurs the most frequently in a data set. Although not used as often as the mean and the median, by identifying the most commonly occurring data value the mode may suggest the central tendency of the data. For the tornado data, the mode is 0. There are eight States that did not experience any tornadoes in 2000. However, it would be misleading to suggest that the central tendency of this data set is 0, since it is obvious from the data values that the value of 0 is not 'central' to the range of values.

Numerical Summaries: Measures of Dispersions

While measures of central tendency summarize a data set with a single 'typical' number, it is also useful to describe the 'spread' of the data with a single number. Describing how a data set is distributed can be accomplished through one of the measures of dispersion: variance, standard deviation, or interquartile range.

Examine once again the dispersion graph in Figure 1. As mentioned earlier, a dispersion graph shows the distribution of the data along the number line. We described the tornado data as concentrated toward the lower end of the number line. However, the data ranges from a low of 0 to a high of 147, which may be considered to be quite a large range. Describing this spread with a single number rather than using words can be more convenient and is the basis of measures of dispersion.

Variance
One measure of dispersion is the variance. Suppose we subtracted each State's tornado data value from the mean (21.1). The resulting value is called a deviation score and tells us the numerical distance between the data value and the data's 'typical' value. Notice in Table 3 that the sum of all the deviation scores equals zero. This results because the data values above and below the mean have positive and negative deviation scores, respectively. In other words, the positive and negative deviation scores cancel each other out. To remove the negative values we can square the deviation scores, and the sum of the squared deviation scores (36,096.5) is called the sum of squares. If we divide the sum of squares by the number of data values (51) the resulting value produces the variance (707.8). The variance then, is the average of the sum of squared deviation scores. By itself, the variance is rarely reported, but is necessary to compute the standard deviation, which is a more meaningful measure of dispersion Table 3 lists the deviation scores and squared deviation scores for our tornado data.

 State Number of Tornadoes Deviation Scores Squared Deviation Scores Alaska 0 -21.1 445.21 Arizona 0 -21.1 445.21 District of Columbia 0 -21.1 445.21 Delaware 0 -21.1 445.21 Hawaii 0 -21.1 445.21 New Hampshire 0 -21.1 445.21 New Jersey 0 -21.1 445.21 Vermont 0 -21.1 445.21 Connecticut 1 -20.1 404.01 Massachusetts 1 -20.1 404.01 Rhode Island 1 -20.1 404.01 Maine 2 -19.1 364.81 Nevada 2 -19.1 364.81 Oregon 3 -18.1 327.61 Utah 3 -18.1 327.61 Washington 3 -18.1 327.61 Michigan 4 -17.1 292.41 West Virginia 4 -17.1 292.41 New Mexico 5 -16.1 259.21 New York 5 -16.1 259.21 Pennsylvania 5 -16.1 259.21 Wyoming 5 -16.1 259.21 Maryland 8 -13.1 171.61 California 9 -12.1 146.41 Montana 10 -11.1 123.21 Virginia 11 -10.1 102.01 Idaho 13 -8.1 65.61 Indiana 13 -8.1 65.61 South Dakota 18 -3.1 9.61 Wisconsin 18 -3.1 9.61 South Carolina 20 -1.1 1.21 Kentucky 23 1.9 3.61 North Carolina 23 1.9 3.61 Ohio 25 3.9 15.21 Mississippi 27 5.9 34.81 Tennessee 27 5.9 34.81 Georgia 28 6.9 47.61 Missouri 28 6.9 47.61 North Dakota 28 6.9 47.61 Minnesota 32 10.9 118.81 Arkansas 37 15.9 252.81 Louisiana 43 21.9 479.61 Alabama 44 22.9 524.41 Oklahoma 44 22.9 524.41 Iowa 45 23.9 571.21 Illinois 55 33.9 1149.21 Kansas 59 37.9 1436.41 Colorado 60 38.9 1513.21 Nebraska 60 38.9 1513.21 Florida 77 55.9 3124.81 Texas 147 125.9 15850.81 Sum=0.0 Sum=36096.5 Table 3. The 2000 tornado data's deviation scores, squared deviation scores, and their sums which are used to compute the variance and standard deviation.

Standard Deviation
If we take the square root of the variance, the resulting number is called the standard deviation (26.6). The standard deviation is a measure of dispersion and gives us a way to describe where any given data value is located with respect to the mean. Using the standard deviation of 26.6 for the tornado data, we can create bounds around the mean that describe data positions that are ±1, ±2, or ±3 standard deviations. Figure 6 shows the standard deviation bounds around the mean of the tornado data. For example, if we add one standard deviation to and subtract one standard deviation from the mean we arrive at 47.7 and -5.5, respectively. From Figure 6, we can see that most of the data fall within ±1 standard deviation of the mean, which suggests that the data are concentrated about the mean. Notice that as the number of standard deviations increases, fewer data values are found. In fact, only six data values are found beyond ±1 standard deviations from the mean. It is interesting to note that one data value is beyond ±3 standard deviations. When interpreting any standard deviation value it is important to keep in mind that the greater the value of the standard deviation, the more spread out or dispersed a data set is likely to be.

Figure 6. A dispersion graph showing �1, �2, and �3 standard deviations about the mean for the 2000 tornado data.

Interquartile Range
Another measure of dispersion is known as the interquartile range. To calculate the interquartile range, we need to first be familiar with the concept of a quartile. A quartile can be thought of as one of the classes created from the division of an ordered data set into four equally-sized groups. You are already familiar with the 50th quartile, which is median value and divides the data into two equal halves. The 25th quartile has 25% of the data falling below it and the 75th quartile has 75% of the data falling below it. The interquartile range describes the middle one-half (or 50%) of an ordered data set, so represents the range between the data value of the 25th quartile and the data value of the 75th quartile.

In calculating the interquartile range, the first step is to compute the 25th and 75th quartiles and then find the difference between these two quartile values. It is important to realize that when computing a quartile, like the median, the calculation results in a data position in a rank-ordered data set and is not the data value itself.

A quartile is found using the following equation:

(Qp/100) · (n+1)
where Qp is the quartile position value and n is the number of data values.

For example, using the tornado data, the 25th quartile position is 0.25(51+1) = 13 and the 75th quartile position is 0.75(51+1) = 39. Returning to our ranked data in Table 2, we find that the 13th data position is Nevada (2 tornadoes) and the 39th position is North Dakota (28 tornadoes). Having located the 25th and 75th quartiles, now we can compute the interquartile range. The interquartile range is simply the difference between the 75th and 25th quartile. For the tornado data, the difference between the 75th and 25th quartiles is (28-2) = 26. Figure 7 illustrates the bounds of the interquartile range for the tornado data.

Figure 7. The interquartile range for the 2000 tornado data.

A useful illustration of many of the concepts we have discussed in this section is shown in Figure 8, which is a box-and-whisker plot. The green-shaded box represents the interquartile range bounded by the data values that correspond to the 25th and 75th quartiles. Fifty percent of the data values fall within this box, and its length represents the interquartile range. The white line running though the green box is the median. The whiskers are the largest and smallest data values that are not outliers, where an outlier can be considered an atypical data value. Data values that are between 1.5 and 3 interquartile ranges below or above the 25th or 75th quartiles are considered outliers and are represented with an open circle. Data values that are more than 3 interquartile ranges below and above the 25th and 75th quartiles are called extreme values and are represented with an asterisk.

Using the box-and-whisker plot, you can see the position of the central tendency with respect to the interquartile range. In our case, the median is positioned toward the lower end of the data, which suggests that the data is positively skewed. You can also see the length of the interquartile range compared to the entire data set, and identify atypical data values and the degree to which those values are atypical. The numbers on top of the circle and asterisk indicate the rank of the value, and allow you to locate the specific data value in Table 2.

Figure 8. A box-and-whisker plot of the 2000 tornado data set.

Conclusion