audio read-through Measures of Spread (Part 2)

(This page is Part 2. Click here for Part 1.)

Continuation of Discussion of Deviation from the Mean

We don't usually care whether data elements lie above or below the mean; we're more interested simply in the distances from the mean.

A reasonable idea is to sum the absolute values of the deviations from the mean, $\,|x_i - \bar{x}|\,.$ However, the absolute value function is not particularly easy to work with mathematically.

Instead, we get a good measure of spread by summing the squares of the deviations from the mean, $\,(x_i -\bar{x})^2\,.$ There's just one little problem to resolve first.

DEFINITIONS population, sample
The entire collection of individuals or objects about which information is desired is called the population.

A nonempty proper subset (choosing some, but not all) of the population is called a sample.

The formulas for the population mean and the sample mean are identical: add up the numbers, and divide by how many there are.

The population mean is denoted by $\,\mu\,$ and a sample mean is denoted by $\,\bar{x}\,.$ In general, population statistics are reported using Greek letters, like $\,\mu\,$ (mu) and $\,\sigma\,$ (sigma). However, sample statistics are reported using Roman letters, like $\,x\,$ and $\,s\,.$

The common formulas for measures of spread are slightly different, depending upon whether you're looking at the entire population, or just a sample from this population, as shown next:

DEFINITIONS population variance, population standard deviation

Suppose a population has $\,N\,$ data values with mean $\,\mu\,.$

The variance of the population, denoted by $\,\sigma^2\,,$ is given by the formula $$ \cssId{s35}{\sigma^2 = \frac{\sum\ (x-\mu)^2}{N}}\ , $$ where the sum is over all data values $\,x\,$ in the population.

The standard deviation of the population, denoted by $\,\sigma\,,$ is the square root of the variance: $$ \cssId{s40}{\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum\ (x-\mu)^2}{N}}} $$

Thus, to find the variance of a population, you sum the squared deviations from the mean, and then divide by the number of data values.

DEFINITIONS sample variance, sample standard deviation

Suppose a sample has $\,n\,$ data values with sample mean $\,\bar{x}\,.$

The sample variance, denoted by $\,s^2\,,$ is given by the formula $$ \cssId{s51}{s^2 = \frac{\sum\ (x-\bar{x})^2}{n-1}}\ , $$ where the sum is over all data values $\,x\,$ in the sample.

The sample standard deviation, denoted by $\,s\,,$ is the square root of the sample variance: $$ \cssId{s56}{s} \cssId{s57}{= \sqrt{s^2}} \cssId{s58}{= \sqrt{\frac{\sum\ (x-\bar{x})^2}{n-1}}} $$

Observe the difference between population variance and sample variance: for the sample variance, you divide by one less than the number of data values , instead of the actual number of data values. Why is this?

Here's one way to understand why: if you randomly choose a sample from a population, what's the likelihood that you'll choose both the greatest and the least values, to represent the true variability in the data set? Not much!

A sample tends to underestimate the true variability in a population.

To compensate, we divide by $\,n-1\,$ instead of $\,n\,$; dividing by a smaller number adjusts the result so it's a bit larger. (The precise reason that you divide by $\,n-1\,$ is explored in a college-level statistics course.)

Standard deviation has a couple advantages over variance:

To conclude, let's return to the three simple data sets presented at the start of this lesson, and compute their measures of spread.

When computing sample variance and sample standard deviation, assume that the given data is part of some (unknown) larger population.

Examples

Question:  Consider the data set $\,\{1,1,1,1,1\}\,.$ $$ \begin{gather} \cssId{s82}{\text{mean }} \cssId{s83}{= \frac{1+1+1+1+1}5} \cssId{s84}{= \color{red}{1}}\cr\cr \cssId{s85}{\text{range }} \cssId{s86}{= 1 - 1} \cssId{s87}{= 0}\cr\cr \end{gather} $$ $$ \begin{align} &\cssId{s88}{\text{population variance }} \cr &\quad \cssId{s89}{= \sigma^2} \cr &\quad \cssId{s90}{= {\textstyle\frac{(1-\color{red}{1})^2 + (1-\color{red}{1})^2 + (1-\color{red}{1})^2 + (1-\color{red}{1})^2 + (1-\color{red}{1})^2}5}}\cr &\quad \cssId{s91}{= 0}\cr\cr &\cssId{s92}{\text{population standard deviation }} \cr &\quad \cssId{s93}{= \sigma}\cr &\quad \cssId{s94}{= \sqrt{0}}\cr &\quad \cssId{s95}{= 0}\cr\cr \end{align} $$ $$ \begin{align} &\cssId{s96}{\text{sample variance }}\cr &\quad \cssId{s97}{= s^2}\cr &\quad \cssId{s98}{= {\textstyle\frac{(1-\color{red}{1})^2 + (1-\color{red}{1})^2 + (1-\color{red}{1})^2 + (1-\color{red}{1})^2 + (1-\color{red}{1})^2}4}}\cr &\quad \cssId{s99}{= 0}\cr\cr &\cssId{s100}{\text{sample standard deviation }}\cr &\quad \cssId{s101}{= s}\cr &\quad \cssId{s102}{= \sqrt{0}}\cr &\quad \cssId{s103}{= 0} \end{align} $$

Notice that the variance, standard deviation, and range all reflect the fact that there is absolutely no variability about the mean in this data set.

Question:  Consider the data set $\,\{-1,0,1,2,3\}\,.$ $$ \begin{gather} \cssId{s107}{\text{mean }} \cssId{s108}{= \frac{-1+0+1+2+3}5} \cssId{s109}{= \frac{5}{5}} \cssId{s110}{= \color{red}{1}}\cr\cr \cssId{s111}{\text{range }} \cssId{s112}{= 3 - (-1)} \cssId{s113}{= 4}\cr\cr \end{gather} $$ $$ \begin{align} &\cssId{s114}{\text{population variance }}\cr &\quad \cssId{s115}{= \sigma^2}\cr &\quad \cssId{s116}{= {\textstyle\frac{(-1-\color{red}{1})^2 + (0-\color{red}{1})^2 + (1-\color{red}{1})^2 + (2-\color{red}{1})^2 + (3-\color{red}{1})^2}5}}\cr &\quad \cssId{s117}{= \frac{10}{5}}\cr &\quad \cssId{s118}{= 2}\cr\cr &\cssId{s119}{\text{population standard deviation }} \cr &\quad \cssId{s120}{= \sigma}\cr &\quad \cssId{s121}{= \sqrt{2}}\cr &\quad \cssId{s122}{\approx 1.4}\cr\cr \end{align} $$ $$ \begin{align} &\cssId{s123}{\text{sample variance }}\cr &\quad \cssId{s124}{= s^2}\cr &\quad \cssId{s125}{= {\textstyle\frac{(-1-\color{red}{1})^2 + (0-\color{red}{1})^2 + (1-\color{red}{1})^2 + (2-\color{red}{1})^2 + (3-\color{red}{1})^2}4}}\cr &\quad \cssId{s126}{= \frac{10}{4}}\cr &\quad \cssId{s127}{= 2.5}\cr\cr &\cssId{s128}{\text{sample standard deviation }}\cr &\quad \cssId{s129}{= s}\cr &\quad \cssId{s130}{= \sqrt{2.5}}\cr &\quad \cssId{s131}{\approx 1.6} \end{align} $$
Question:  Consider the data set $\,\{-1,-1,1,3,3\}\,.$ $$ \begin{gather} \cssId{s134}{\text{mean }} \cssId{s135}{= \frac{-1+(-1)+1+3+3}5} \cssId{s136}{= \frac{5}{5}} \cssId{s137}{= \color{red}{1}}\cr\cr \cssId{s138}{\text{range }} \cssId{s139}{= 3 - (-1)} \cssId{s140}{= 4}\cr\cr \end{gather} $$ $$ \begin{align} &\cssId{s141}{\text{population variance }}\cr &\quad \cssId{s142}{= \sigma^2}\cr &\quad \cssId{s143}{= {\textstyle\frac{(-1-\color{red}{1})^2 + (-1-\color{red}{1})^2 + (1-\color{red}{1})^2 + (3-\color{red}{1})^2 + (3-\color{red}{1})^2}5}}\cr &\quad \cssId{s144}{= \frac{16}{5}}\cr &\quad \cssId{s145}{= 3.2}\cr\cr &\cssId{s146}{\text{population standard deviation }}\cr &\quad \cssId{s147}{= \sigma}\cr &\quad \cssId{s148}{= \sqrt{3.2}}\cr &\quad \cssId{s149}{\approx 1.8}\cr\cr \end{align} $$ $$ \begin{align} &\cssId{s150}{\text{sample variance }} \cr &\quad \cssId{s151}{= s^2}\cr &\quad \cssId{s152}{= {\textstyle\frac{(-1-\color{red}{1})^2 + (-1-\color{red}{1})^2 + (1-\color{red}{1})^2 + (3-\color{red}{1})^2 + (3-\color{red}{1})^2}4}} \cr &\quad \cssId{s153}{= \frac{16}{4}}\cr &\quad \cssId{s154}{= 4}\cr\cr &\cssId{s155}{\text{sample standard deviation }}\cr &\quad \cssId{s156}{= s} \cr &\quad \cssId{s157}{= \sqrt{4}}\cr &\quad \cssId{s158}{= 2} \end{align} $$

Jump up to wolframalpha.com and type in, say:

{-1,-1,1,3,3}

Voila! Instant statistics! How easy is that?

Concept Practice