## Saturday, September 14, 2013

### BASICS OF STATISTICS for ADVANCED ALGEBRA STUDENTS

Some of the concepts in your first Statistics test are already part of your mathematical skill set.  You already know, for example, how to calculate average or mean, median, and mode.  These are measures of central tendency.  They tell typical values like (respectively) the score that everyone would have gotten on that test if all points earned had been evenly distributed among the students, the middle of the scores where half of the students got a higher grade and half a lower one, or the most frequent score received.

You probably understand the concept of range - the spread between two values, the difference between the highest score on a test and the lowest.  You may even have drawn box and whiskers diagrams and frequency graphs as long ago as elementary school.

In higher level Algebra courses, we add a few details to make statistical measures and displays even more useful.

Some useful SYMBOLS

(Pronounced "mew") This Greek letter stands for the POPULATION MEAN.

(Called "ex bar")  This symbol represents the SAMPLE MEAN.

(Small sigma from the Greek alphabet)  Sigma is the symbol for for POPULATION STANDARD DEVIATION.

The small ess is used to signify the SAMPLE POPULATION STANDARD DEVIATION.

Using the MEDIAN

Median is used to construct a BOX and WHISKERS PLOT.

The middlemost element in a collection tells us where 50% of the subjects or items are higher and 50% lower.  The middle element (median) of the lower half and the middle element (median) of the upper half are the first and third quartiles (Q1 and Q3).

Quartiles divide the data into quarters, so at Q1, 1/4 of the scores are lower, while at Q3, 1/4 are higher.  The difference between Q1 and Q3 is the interquartile range, IQR.  It describes the location of the middle 50% of items in the group and is less influenced by extreme values than some other measures.  In fact, we use the IQR to determine whether values are too far from the central tendency to be worthy of consideration.

Say we want to know how many students are likely to attend the basketball game this weekend.  We’ve kept count of the number of attendees at the past 15 games and the interquartile range is 100.  But last month, there was one game that was attended by 700 fans!  That statistic is  so high, it could throw off measures of central tendency like the mean, so how do we tell for sure if we should count it as the highest value element or something else?

Identifying OUTLIERS

Using the IQR, we can calculate the limits beyond which certain values are too extreme to be of importance.  We call these outliers.

The limit of “reasonable” element values is 1.5 times the IQR either above Q3 or below Q1.

In our basketball game example, we can arbitrarily say that 100 is our mean.  the reasonable upper limit would be 1.5 X 100, or 150 above the Q3 count of 200 (another arbitrary example for demonstration sake), or 350 fans.  That 700 game was definitely an outlier.

On the box and whiskers plot, outliers are marked, maybe by an X but OUTSIDE the whiskers.

The measures forming the box plot (excluding any outliers) are sometimes called the “5 Number Summary.”

Using the MEAN

We’ve been calculating the arithmetic mean (average) since grammar school.  Almost every 5th grader will tell you that to figure out the average, “add ’um all up and divide by the number of ‘um.”  The use of the mean in statistics is yet another example of my guiding principle: you learned everything there is to know in math before you entered middle school.  From then on, we just use those things in new ways.

So let’s use the mean to figure out how things are distributed around that central tendancy.  Let’s first calculate the VARIANCE by finding the distance of each value from the mean...

then squaring it to remove the implication of more than or less than the average...

Now divide by the number of items and we have an average of the squares...

In these equations, I've used the symbols for SAMPLE MEAN.  To calculate the POPULATION MEAN, use mu.  Some texts distinguish the Sample Mean from the Population Mean by using n-1 as the denominator in the former.  I've never made that distinction, but check with your text to see what your class requires.

Either Variance equation used gives us a measure of how spread out the items are, but squaring the differences has provided a number that isn’t the same “weight” as the individual items.  To find the STANDARD DEVIATION, we need to undue those squares...

Use this equation for standard deviation when you are able to survey every member of the population.  If a smaller sample is used, errors could effect the standard deviation.  Experience has suggested a modest alteration in the equation in order to gain a more reliable result:

Use this equation for standard deviation when you only tested a sample of the entire population.  This minor alteration in the denominator was suggested after statisticians found that it reduces some of the inaccuracy created when a small sample is used to extrapolate to a full population.

Using Standard Deviation with a Normal Bell-shaped Curve

Given a normal curve (that bell-shaped graph that puts the mean in the middle along with median and mode and tapers off at both ends), we can predict that roughly 68% of the values will fall within one standard deviation of the mean, 95% within two standard deviations, and almost every value (99%) within 3 standard deviations.

Are you wondering where the percentages came from?  So did I.  It seems that statisticians for many, many years have collected and analyzed data and actually “discovered” that for things that approach a normal curve (like heights of adult males) the 68-95-99 standards are the areas under the curve at points very close to 1-2-3 standard deviations from the mean.  I’m just taking their word for it because I am not willing to do as much arithmetic as it would take to verify the data.  I’m accepting the percents as a postulate.

Using a Z score

When I want to compare two different surveys, like a student’s results on two separate Math tests, it would be handy to have a single measure of comparison.  Here’s where the Z-score comes in handy.  It tells how each score deviates from the mean of that particular test.

A Z-score of -1 is one standard deviation below the mean, while +2 means the grade was 2 standard deviations above.  To calculate the Z-score, find the difference between the score and the mean, and divide by the standard deviation...

These 5 equations are probably more than will be covered in the first Statistics test.  Much of what comes later in the course also relies on equations, so I suggest that you start a note card now, before too much information gets out of control.

Coming up, we’ll look at correlation coefficient, confidence intervals, and probability indexes.  Go figure!