Published on *Course-Notes.Org* (http://www.course-notes.org)

Statisticians utilize various kinds of measurements based on

the collected data as an initial step towards developing

inferences on the population from which observations were taken.

Some measures reflect, in a sense, the center or middle point of

a set of data; others provide a measure of the variability of the

data. These measures can apply to either the population as a

whole or to a sample taken from the population.

Subject:

Statistics [1]

Subject X2:

Statistics [1]

Chebyshev's Theorem

The Russian mathematician P. L. Chebyshev (1821- 1894) discovered that the fraction of observations falling between two distinct values, whose differences from the mean have the same absolute value, is related to the variance of the population. Chebyshev's Theorem gives a conservative estimate to the above percentage.

For any population or sample, at least (1 - (1 / k)^{2}) of the observations in the data set fall within k standard deviations of the mean, where k ³ 1.

Using the concept of z scores, we can restate Chebyshev's Theorem to say that for any population or sample, the proportion of all observations, whose z score has an absolute value less than or equal to k, is no less than

(1 - (1 / k^{2} )). For k = 1, this theorem states that the fraction of all observations having a z score between -1 and 1 is (1 - (1 / 1))^{2} = 0; of course, this is not a very helpful statement. But for k ³ 1, Chebyshev's Theorem provides a lower bound to the proportion of measurements that are within a certain number of standard deviations from the mean. This lower bound estimate can be very helpful when the distribution of a particular population is unknown or mathematically intractable.

EX Chebyshev's Theorem can be utilized for the following values of k:

k = 1.5 1 - (1 / 1.5^{2}) = 0.5556 of all observations fall within 1.5s of m.

k = 2.0 1 - (1 / 2.0^{2}) = 0.7500 of all observations fall within 2.0s of m.

k = 2.5 1 - (1 / 2.5^{2}) = 0.8400 of all observations fall within 2.5s of m.

k = 3.0 1 - (1 / 3.0^{2}) = 0.8889 of all observations fall within 3.0s of m.

Subject:

Statistics [1]

Subject X2:

Statistics [1]

Measures of Variation

Statistical measures of variation** **are numerical values that indicate the variability inherent in a set of data measurements. The most common measures of variation are the range, variance and standard distribution.

Range

The range** **of a set of observations is the absolute value of the difference between the largest and smallest values in the set. It measures the size of the smallest contiguous interval of real numbers that encompasses all the data values.

EX. Given the following sorted data:

1.2, 1.5, 1.9, 2.4, 2.4, 2.5, 2.6, 3.0, 3.5, 3.8

The range of this set of data is 3.8 - 1.2 = 2.6.

Â

Variance and Standard Deviation

The** **variance of a set of data is a cumulative measure of the squares of the difference of all the data values from the mean.

The population and sample variance are calculated as follows:

Given the set of data values x

_{1}, x_{2}, .... x_{N}from a finite population of size N, the population variance is calculated asÂ

Given the set of data values x

_{1}, x_{2}, .... x_{n}from a sample of size n, the sample variance s_{2}is calculated as

Note that the population variance is simply the arithmetic mean of the squares of the difference between each data value in the population and the mean. On the other hand, the formula for the sample variance is similar to the formula for the population variance, except that the denominator in the fraction is (n-1) instead of n. Using the above formula, the sample variance is statistically proven to be a most effective estimator for the variance of the population to which the sample belongs.

The standard deviation of a set of data is the positive square root of the variance.

EX. Given the following sorted data:

1.2, 1.5, 1.9, 2.4, 2.4, 2.5, 2.6, 3.0, 3.5, 3.8

x = 2.48 as computed earlier

s^{2} = ( 1 / (10-1)) Ã— ( (1.2 - 2.48)^{2} + (1.5 - 2.48)^{2} + (1.9 - 2.48)^{2} + (2.4 - 2.48)^{2} + (2.4 - 2.48)^{2} + (2.5 - 2.48)^{2} + (2.6 - 2.48)^{2} + (3.0 - 2.48)^{2} + (3.5 - 2.48)^{2 }+ (3.8 - 2.48)^{2} )

= (1 / 9) Ã— (1.6384 + 0.9604 + 0.3364 + 0.0064 + 0.0064 + 0.0004 + 0.0144 + 0.2704 + 1.0404 + 1.7424)

= 0.6684

s = ( 0.6684 )^{1/2} = 0.8176

The sample variance can also be calculated as follows:

EX. Given the above data, we can calculate s^{2} using the above formula:

= 1.44 + 2.25 + 3.61 + 5.76 + 5.76 + 6.25 + 6.76 + 9.00 + 12.25 + 14.44

= 67.52

s^{2 }= ( 1 / (10 Ã— 9) ) Ã— ( 10 Ã— 67.52 - (24.8)^{2} )

= 0.6684

Subject:

Statistics [1]

Subject X2:

Statistics [1]

Parameters and Statistics

A** **parameter** **is a numerical quantity that describes some characteristic of a population. Parameters are often estimated since their value is generally unknown, especially when the population is large enough that it is impossible or impractical to obtain measurements for all observations. Parameters are normally represented by Greek letters. The most common parameters are the population mean and variance, represented by the Greek letters *m* and *s*^{2} , respectively.

A** **statistic** **is a quantitative value that is calculated from the observations in a sample. They are usually represented by lowercase English letters with other symbols. The sample mean and variance, two of the most common statistics derived from samples, are denoted by the symbols x and s^{2}, respectively.

Subject:

Statistics [1]

Subject X2:

Statistics [1]

Measures of Central Tendency

Statistical** **measures of central tendency or central location** **are numerical values that are indicative of the central point or the greatest frequency concerning a set of data. The most common measures of central location are the mean, median and mode.

Mean

The statistical** **mean of a set of observations is the average of the measurements in a set of data. The population mean and sample mean are defined as follows:

Given the set of data values x_{1}, x_{2}, .... x_{N} from a finite

population of size N, the population mean m is calculated as

Given the set of data values x_{1}, x_{2}, .... x_{n} from a sample of

size n, the sample mean is calculated as:

The sample mean is often used as an estimator of the mean of the population from whence the sample was taken. In fact, the sample mean is statistically proven to be a most effective estimator for the population mean.

Â

Median

The median of a set of observations is that value that, when the observations are arranged in an ascending or descending order, satisfies the following condition:

- If the number of observations is odd, the median is the middle value.
- If the number of observations is even, the median is the average of the two middle values.

The median is the same as the 50th percentile of a set of data. It is denoted by .

Mode

The mode of a set of observations is the specific value that occurs with the greatest frequency. There may be more than one mode in a set of observations, if there are several values that all occur with the greatest frequency. A mode may also not exist; this is true if all the observations occur with the same frequency.

Another measure of central location that is occasionally used is the midrange. It is computed as the average of the smallest and largest values in a set of data.

Example of Central Tendency

EX. Given the following set of data

1.2, 1.5, 2.6, 3.8, 2.4, 1.9, 3.5, 2.5, 2.4, 3.0

It can be sorted in ascending order:

1.2, 1.5, 1.9, 2.4, 2.4, 2.5, 2.6, 3.0, 3.5, 3.8

The mean, median and mode are computed as follows:

= (1 / 10) Â· (1.2 + 1.5 + 2.6 + 3.8 + 2.4 + 1.9 + 3.5 + 2.5 + 2.4 + 3.0)

= 2.48

= (2.4 + 2.5) / 2

= 2.45

The mode is 2.4, since it is the only value that occurs twice.

The midrange is (1.2 + 3.8) / 2 = 2.5.

Note that the mean, median and mode of this set of data are very close to each other. This suggests that the data is very symmetrically distributed.

Subject:

Statistics [1]

Subject X2:

Statistics [1]

Symmetry and Skewness

A set of observations is symmetrically distributed** **if its graphical representation (histogram, bar chart) is symmetric with respect to a vertical axis passing through the mean. For a symmetrically distributed population or sample, the mean, median and mode have the same value. Half of all measurements are greater than the mean, while half are less than the mean.

A set of observations that is not symmetrically distributed is said to be skewed. It is positively skewed if a greater proportion of the observations are less than or equal to (as opposed to greater than or equal to) the mean; this indicates that the mean is larger than the median. The histogram of a positively skewed distribution will generally have a long right tail; thus, this distribution is also known as being skewed to the right.

On the other hand, a** **negatively skewed distribution has more observations that are greater than or equal to the mean. Such a distribution has a mean that is less than the median. The histogram of a negatively skewed distribution will generally have a long left tail; thus, the phrase** **skewed to the left is applied here.

The Pearson coefficient of skewness provides a numerical measure of the skewness of a distribution. Denoted by SK, it is calculated as follows:

SK = 3 ( m - m ) / s for a population

= 3 ( - ) / s for a sample

For a perfectly symmetric distribution, the mean and median will have the same value, and SK will have the value of 0. A distribution that is skewed to the right will have a mean that is larger than the median, and thus SK will have a positive value; thus, the distribution is also known as being positively skewed. A distribution that is skewed to the left will have a mean that is less than the median, and so SK will have a negative value; thus, the phrase "negatively skewed". In general, the values of SK will vary between -3 and 3.

EX. Given the following sorted data:

1.2, 1.5, 1.9, 2.4, 2.4, 2.5, 2.6, 3.0, 3.5, 3.8

= 2.48 as computed earlier

= 2.45 as computed earlier

s = 0.8176

SK = 3 (2.48 - 2.45) / 0.8176

= 0.1101

Subject:

Statistics [1]

Subject X2:

Statistics [1]

Z Scores

The** **z score** **of an observation x_{i}, taken from a population with mean m and standard deviation s, is denoted by z and is calculated as follows:

z = (x_{i} - m) / s

The z score is a measure of the number of standard deviations that an observation is above or below the mean. Since s is never negative, a positive z score indicates that the observation is above the mean, a negative z score that the observation is below the mean. Note that z is a dimensionless value, and is therefore a useful measure by which to compare data values from two different populations (and thus possibly measured by different units).

Subject:

Statistics [1]

Subject X2:

Statistics [1]

**Links:**

[1] http://www.course-notes.org/Subject/Math/Statistics

[2] http://www.course-notes.org/user/login?destination=node/493%23comment-form

[3] http://www.course-notes.org/user/register?destination=node/493%23comment-form

[4] http://www.course-notes.org/user/login?destination=node/491%23comment-form

[5] http://www.course-notes.org/user/register?destination=node/491%23comment-form

[6] http://www.course-notes.org/user/login?destination=node/495%23comment-form

[7] http://www.course-notes.org/user/register?destination=node/495%23comment-form