Continuation of
Discussion of Deviation from the Mean
We don't usually care whether data
elements lie above or below the mean;
we're more interested simply
in the distances from the mean.
A reasonable idea is to sum the
absolute values of the deviations from the mean,
$\,|x_i - \bar{x}|\,.$
However, the absolute value
function is not particularly easy
to work with mathematically.
Instead, we get a good measure of spread
by summing the squares of the deviations from
the mean,
$\,(x_i -\bar{x})^2\,.$There's just one little problem to resolve first.
DEFINITIONSpopulation, sample
The entire collection of individuals or objects
about which information is desired
is called the population.
A nonempty proper subset
(choosing some, but not all) of the population
is called a sample.
The formulas for the population mean
and the sample mean are identical:
add up the numbers, and divide
by how many there are.
The population mean is denoted by$\,\mu\,$and a sample mean is denoted by$\,\bar{x}\,.$
In general, population statistics
are reported using Greek letters, like
$\,\mu\,$ (mu)and $\,\sigma\,$ (sigma).
However, sample statistics
are reported using Roman letters,
like $\,x\,$ and $\,s\,.$
The common formulas for measures
of spread are slightly different,
depending upon whether you're looking
at the entire population,
or just a sample from this population,as shown next:
DEFINITIONSpopulation variance, population standard deviation
Suppose a population has$\,N\,$ data valueswith mean $\,\mu\,.$
The variance of the population,denoted by $\,\sigma^2\,,$is given by the formula
$$
\cssId{s35}{\sigma^2 = \frac{\sum\ (x-\mu)^2}{N}}\ ,
$$
where the sum is over all data values
$\,x\,$ in the population.
The standard deviation
of the population,
denoted by $\,\sigma\,,$is the square root of the variance:
$$
\cssId{s40}{\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum\ (x-\mu)^2}{N}}}
$$
Thus, to find the variance
of a population,
you sum the squared deviations
from the mean,
and then divide by the
number of data values.
DEFINITIONSsample variance, sample standard deviation
Suppose a sample has $\,n\,$ data valueswith sample mean $\,\bar{x}\,.$
The sample variance,denoted by $\,s^2\,,$is given by the formula
$$
\cssId{s51}{s^2 =
\frac{\sum\ (x-\bar{x})^2}{n-1}}\ ,
$$
where the sum is over all
data values $\,x\,$ in the sample.
The sample standard deviation,denoted by $\,s\,,$
is the square root of the
sample variance:
$$
\cssId{s56}{s}
\cssId{s57}{= \sqrt{s^2}}
\cssId{s58}{= \sqrt{\frac{\sum\ (x-\bar{x})^2}{n-1}}}
$$
Observe the difference between
population variance and sample variance:
for the sample variance,
you divide by
one less than
the number of data values
,
instead of the actual number of data values.Why is this?
Here's one way to understand why:
if you randomly choose a sample from
a population,
what's the likelihood that you'll
choose both the greatest and the least values,
to represent the true variability
in the data set?
Not much!
A sample tends to underestimate
the true variability in a population.
To compensate, we divide by $\,n-1\,$
instead of
$\,n\,$;
dividing by a smaller number
adjusts the result so it's a bit larger.
(The precise reason that you
divide by $\,n-1\,$ is explored
in a college-level statistics course.)
Standard deviation has a couple
advantages over variance:
standard deviation has the same
units as the data values
standard deviation may be informally
interpreted as the size of a ‘typical’
deviation from the mean
To conclude, let's return to the three simple
data sets presented at the start of this lesson,
and compute their measures of spread.
When computing sample variance
and sample standard deviation,
assume that the given data
is part of some (unknown) larger population.