 Statistical
Inference and Hypothesis Testing
I. Statistical Inference: predicting population values,
or PARAMETERS, from sample data:
 Sampling theory
 If N cases are randomly
drawn from some unknown population, one can estimate
characteristics of that population with stated levels
of accuracy.
 First step is to compute the
sample statistic  examples:
 Mean
 Standard
deviation
 Correlation between two
variables in the sample
 Regression
coefficient
 Intercept
 Second step is to determine its
sampling distribution.
 Conceive of drawing an
infinite number of samples of size N which
you compute the sample statistic.
 Due to sampling variation,
these statistics would not all have the same values
but would distribute symmetrically and usually
"normally" around some "expected" value  the
mean of the distribution of sample
values.
 The exact shape of the
sampling distribution depends on
 the sample statistic
involved mean, correlation, etc.
 the number of cases in
the sample  N
 the variation among the
values in the population
 Third step is to compute the
sampling distribution's standard error.
 The dispersion of values
around the sampling mean can be measured by the
standard deviation, which becomes known as
the standard error of the sampling
distribution.
 For all statistics of
interest to us, formulas exist for calculating the
standard deviations of hypothetical sampling
distributions  which means that we can calculate
their standard errors.
 For example, the central
limit theorem states:
 If repeated samples of N
observations are drawn from a population with mean
µ, and variance ,
then as N grows large, the sample means will become
normally distributed with mean µ, variance
^{2}/N,
and standard deviation .
 Thus, the formula for the
standard error of the mean is
_____.

 Knowledge of the standard error
of a statistic allows us to place bounds on the
estimates of population parameters from sample data.
 Point v. interval
estimates of population parameters:
 Point estimates are simply the
best guess of the parameter, and they are based
on the observed sample statistic itself.
 If the sample statistic is
unbiased, the mean of the sampling
distribution equals the population paramenter and
thus the observed statistic is the best estimate of
the population parameter  Example: the
mean
 Bias in a sample
statistic exists when the mean of the sampling
distribution does not equal the population
parameter  Example: thestandard
deviuation, which tends to underestimate the
population standard deviation and thus must be
corrected by using N1 in the
denominator.
 Corrections for bias are
routinely incorporated into the statistical
formulas that you encounter.
 Interval estimates state the
likely range of values in the population at a
chosen level of confidence  e.g., .95,
.99.
 A typical example of an
interval estimate in political research is
predicting to the likely proportion of vote for a
candidate  e.g., 95% sure that Reagan will win
52% of the vote, plus or minus 3 percentage points,
which yields a 95% confidence interval of 49% to
55% of the vote.
 Factors that determine the
width of the confidence interval:
 The chosen level of
confidence: 1  alpha
 The standard error
of the observed sample statistic  in the case
of predicting to the population proportion (a
special case of predicting to the mean)  the
s.e. depends on
 The sample
size, N
 The variation
in the population
 Sample size and
population variance are in the formula
for the standard error of the mean.
 The proportion
that the sample is of the population is not a
major factor in accuracy of a
sample.
 Assuming sampling
without replacement, the complete
formulat for the standard error of the sample
mean, includes a "correction factor" to
adjust for sample size relative to the size
of the population.
 The correction
factor is
 p = the
proportion that the sample is of the
population
 the correction
factor is always less than
1.
 So multiplying
by the correction factor reduces
the standard errror.
 But it is not
important unless the sample approaches
20% of the population, when the
"correction factor" begins to reduce
the s.e. in any substantial way.
 For
more information go here.
 A "Rule of Thumb" for
interval estimates:
 Most sample statistics
distribute normally, or approximately so, and
standard errors can thus be interpreted as
zscores in a table of areas under the normal
curve.
 Plus or minus one s.e.
would embrace about 68% of the occurrences in
the hypothetical sampling
distribution.
 Plus or minus two s.e.
would embrace about 95% of the sampling
distribution.
 Thus, doubling the
standard error on either side of the mean
approximates the 95% confidence interval for
estimating the population mean from sample
data.
 This rule applies as well
for other sample statistics e.g., estimating
the confidence interval of a bcoefficient in a
regression equation from knowledge of the s.e.
of b.
II. Hypothesis testing
 Refer to the distinction between
the research and the test or null
hypothesis.
 This has been discussed in
several places, see 2/3
Review for
one.
 Hypothesis testing typically
translates into testing whether the observed value
differs "significantly" from some specified value
 How big is the
difference in comparison with sampling
fluctuation?
 How does the test statistic
distribute (i.e., what's its standard
error)?
 How "significant" a different
will you accept (i.e., your alpha
value)?
 Are you making a
onetailed or a twotailed test?
 C. General procedure in testing
significance of a statistic:
 Look at the value that you
observe in the sample.
 Subtract the value that you
expected to find.
 Compute a test statistic
(e.g., zscore or tvalue) according to the
appropriate formula, usually dividing by the standard
error of the statistic.
 Illustrative standard errors
(enter the formulas):
 Mean



 Proportion




 Difference between
means





 Correlation
Coefficient (Typically, another approach is
used to test for r. Where: rho is
assumed to be 0.0 (i.e., tested against the null
hypothesis of no correlation in the population),
the ttest is used.

 t=


 bcoefficient (our
texts did not discuss the s.e. of b, but SPSS 10
calculated it).

 Enter the appropriate table
of distribution of the statistic to determine its
likelihood of occurrence (its
significance).
 Distribution of test
statistics:
 Normal
distribution
 tdistribution (df based
on N1 for means, or N2 for r)
 Fdistribution (df based
on k1 and Nk)
 chisquare distribution
(df based on table size)
