Bootstrap estimate of Standard Error
The name for this idea comes from the idiom "to pull oneself up by one's
bootstraps," which connotes getting out of a hole without anything to stand on.
The idea of the bootstrap is to assume, for the purposes of estimating uncertainties,
that the sample is the population, then use the SE for sampling from the
sample to estimate the SE of sampling from the population.
For sampling from a box of numbers,
the SD of the sample is the bootstrap estimate of the SD of the box from which the
sample is drawn.
For sample percentages, this takes a particularly
simple form:
the SE of the sample percentage
of n
draws from a box, with replacement, is
SD(box)/n^{½},
where for a box that contains only zeros and ones, SD(box) = ((fraction
of ones in box)×(fraction of zeros in box) )^{½}.
The bootstrap estimate
of the SE of the sample percentage
consists of estimating SD(box) by ((fraction of ones in sample)×(fraction
of zeros in sample))^{½}.
When the sample size is large, this approximation is
likely to be good.
top
Catchment Area (Health)
A geographic area defined and served by a health program or institution.
Causality
The relating of causes to the effects they produce. Causes are termed necessary
when they must always precede an effect and sufficient when they initiate or
produce an effect. Any of several factors may be associated with the potential
disease causation or outcome, including predisposing factors, enabling factors,
precipitating factors, reinforcing factors, and risk factors.
Cause of Death
Factors which produce cessation of all vital bodily functions. They can be
analyzed from an epidemiologic viewpoint.
Censuses
Enumerations of populations usually recording identities of all persons in every
place of residence with age or date of birth, sex, occupation, national origin,
language, marital status, income, relation to head of household, information on
the dwelling place, education, literacy, healthrelated data (e.g., permanent
disability), etc.
ChiSquare Distribution
A distribution in which a variable is distributed like the sum of the the
squares of any given independent random variable, each of which has a normal
distribution with mean of zero and variance of one. The chisquare test is a
statistical test based on comparison of a test statistic to a chisquare
distribution. The oldest of these tests are used to detect whether two or more
population distributions differ from one another.
ChiSquare Methods
A group of
qualitative variable techniques whose results
are compared to values found in a theoretical
Chisquare distribution table.
Clinical Trials
Preplanned studies of the safety, efficacy, or optimum dosage schedule (if
appropriate) of one or more diagnostic, therapeutic, or prophylactic drugs,
devices, or techniques selected according to predetermined criteria of
eligibility and observed for predefined evidence of favorable and unfavorable
effects. This concept includes clinical trials conducted both in the U.S. and in
other countries.
Clinical Trials, Phase I
Studies performed to evaluate the safety of diagnostic, therapeutic, or
prophylactic drugs, devices, or techniques in healthy subjects and to determine
the safe dosage range (if appropriate). These tests also are used to determine
pharmacologic and pharmacokinetic properties (toxicity, metabolism, absorption,
elimination, and preferred route of administration). They involve a small number
of persons and usually last about 1 year. This concept includes phase I studies
conducted both in the U.S. and in other countries.
Clinical Trials, Phase II
Studies that are usually controlled to assess the effectiveness and dosage (if
appropriate) of diagnostic, therapeutic, or prophylactic drugs, devices, or
techniques. These studies are performed on several hundred volunteers, including
a limited number of patients with the target disease or disorder, and last about
two years. This concept includes phase II studies conducted in both the U.S. and
in other countries.
Clinical Trials, Phase III
Comparative studies to verify the effectiveness of diagnostic, therapeutic, or
prophylactic drugs, devices, or techniques determined in phase II studies.
During these trials, patients are monitored closely by physicians to identify
any adverse reactions from longterm use. These studies are performed on groups
of patients large enough to identify clinically significant responses and
usually last about three years. This concept includes phase III studies
conducted in both the U.S. and in other countries.
Clinical Trials, Phase IV
Planned postmarketing studies of diagnostic, therapeutic, or prophylactic
drugs, devices, or techniques that have been approved for general sale. These
studies are often conducted to obtain additional data about the safety and
efficacy of a product. This concept includes phase IV studies conducted in both
the U.S. and in other countries.
CochranMantelHaenzel Method
A
Chisquare method that permits statistical
comparison of odds ratios across subgroups and
also allows differences in those ratios to be
adjusted.
Controlled Clinical Trials
Clinical trials involving one or more test treatments, at least one control
treatment, specified outcome measures for evaluating the studied intervention,
and a biasfree method for assigning patients to the test treatment. The
treatment may be drugs, devices, or procedures studied for diagnostic,
therapeutic, or prophylactic effectiveness. Control measures include placebos,
active medicines, notreatment, dosage forms and regimens, historical
comparisons, etc. When randomization using mathematical techniques, such as the
use of a random numbers table, is employed to assign patients to test or control
treatments, the trials are characterized as randomized controlled trials.
However, trials employing treatment allocation methods such as coin flips,
oddeven numbers, patient social security numbers, days of the week, medical
record numbers, or other such pseudo or quasirandom processes, are simply
designated as controlled clinical trials.
Correlation Coefficient
In linear
regression, a measure of the closeness of data
points to the bestfit line. It can assume a
value between 1 and +1; the nearer the value to
either 1 or +1, the nearer are the points to
the line.
Cox Regression Method
An analytical
method in which event data for each group under
comparison are transformed to fit a linear
model. Models for each group are then compared
to determine whether they are equal. This method
assumes that hazard rates for each group are at
least proportional to each other.
Cluster Analysis
A set of statistical methods used to group variables or observations into
strongly interrelated subgroups. In epidemiology, it may be used to analyze a
closely grouped series of events or cases of disease or other healthrelated
phenomenon with welldefined distribution patterns in relation to time or place
or both.
Confidence Intervals
A range of values for a variable of interest, e.g., a rate, constructed so that
this range has a specified probability of including the true value of the
variable.
Confounding Factors (Epidemiology)
Factors that can cause or prevent the outcome of interest, are not intermediate
variables, and are not associated with the factor(s) under investigation. They
give rise to situations in which the effects of two processes are not separated,
or the contribution of causal factors cannot be separated, or the measure of the
effect of exposure or risk is distorted because of its association with other
factors influencing the outcome of the study.
Comorbidity
The presence of coexisting or additional diseases with reference to an initial
diagnosis or with reference to the index condition that is the subject of study. Comorbidity may affect the ability of affected individuals to function and also
their survival; it may be used as a prognostic indicator for length of hospital
stay, cost factors, and outcome or survival.
Cross Sectional Study
In survey research, a study in which data are obtained only once.
Contrast with longitudinal studies in which a panel
of individuals is interviewed repeatedly over a period of time. Note that a
cross sectional study can ask questions about previous periods of time,
though.
Categorical Variable
A variable whose value ranges over categories, such as {red,
green, blue}, {male, female}, {Arizona, California, Montana, New York}, {short, tall},
{Asian, AfricanAmerican, Caucasian, Hispanic, Native American, Polynesian}, {straight,
curly}, etc. Some categorical variables are ordinal. The
distinction between categorical variables and
qualitative variables
is a bit blurry. C.f. quantitative variable.
Causation, causal relation
Two variables are causally related if changes in the value of one cause the other to
change. For example, if one heats a rigid container filled with a gas, that causes the
pressure of the gas in the container to increase.
Two variables can be associated without
having any causal relation, and even if two
variables have a causal relation, their correlation can be
small or zero.
Central Limit Theorem
The central limit theorem states that the probability
histograms of the sample mean
and sample sum of n draws with replacement
from a box of labeled tickets converge to a
normal curve as the
sample size n grows, in the following sense:
As n grows, the area of the probability histogram for any
range of values approaches the area under the normal curve
for the same range of values, converted to standard units.
See also the normal approximation.
Certain Event
An event is certain if its
probability is 100%.
Even if an event is certain, it might not occur.
However, by the complement rule,
the chance that it does not occur is 0%.
Chance variation, chance error
A random variable can be decomposed into
a sum of its expected value and chance variation around
its expected value. The expected value of the chance variation is zero; the
standard error of the chance variation is the same as the
standard error of the random variablethe size of a
"typical" difference between the random variable
and its expected value.
See also sampling error.
Chebychev's Inequality
For lists: For every number k>0, the fraction of elements in a list that are
k SD's or further from the
arithmetic mean of
the list is at most 1/k^{2}.
For random variables:
For every number k>0, the
probability that a random variable X is k SEs or further from its expected value is at
most 1/k^{2}.
Chisquare curve
The chisquare curve is a family of curves that depend on a parameter called
degrees of freedom (d.f.).
The chisquare curve is an approximation to the
probability histogram of the
chisquare statistic
for multinomial model if the
expected number of outcomes in each category is
large.
The chisquare curve is positive, and its total area is 100%, so we can think of
it as the probability histogram of a random variable.
The balance point of the curve is d.f., so the expected value of the
corresponding random variable would equal d.f..
The standard error of the corresponding random variable would be
(2×d.f.)^{½}.
As d.f. grows, the shape of the chisquare curve approaches the shape of
the normal curve.
Chisquare Statistic
The chisquare statistic is used to measure the agreement between
categorical data and a
multinomial model that predicts
the relative frequency of outcomes in each possible category.
Suppose there are n independent trials,
each of which can result in one of k possible outcomes.
Suppose that in each trial, the probability that outcome
i occurs is p_{i},
for i = 1, 2, . . . , k,
and that these probabilities are the same in every trial.
The expected number of times outcome 1 occurs in the n trials is
n×p_{1}; more generally, the expected number of
times outcome i occurs is
expected_{i} = n×p_{i}. If the model
is correct, we would expect the n trials to result in outcome
i about n×p_{i} times, give or take
a bit.
Let observed_{i} denote the number of times an outcome of type
i
occurs in the n trials, for i = 1, 2,
. . . , k.
The chisquared statistic summarizes the discrepancies between the
expected number of times each outcome occurs (assuming that the model is true)
and the observed number of times each outcome occurs, by summing
the squares of the discrepancies, normalized by the expected numbers, over all
the categories:
chisquared =
(observed_{1}  expected_{1})^{2}/expected_{1}
+
(observed_{2}  expected_{2})^{2}/expected_{2}
+
. . .
+
(observed_{k}  expected_{k})^{2}/expected_{k}.
As the sample size n increases, if the model is correct,
the sampling distribution of the chisquared statistic
is approximated increasingly well by the chisquared curve with
(#categories  1) = k  1
degrees of
freedom (d.f.), in the sense that the chance that the chisquared statistic
is in any given range grows closer and closer to the area under the ChiSquared curve over
the same range.
Class Boundary
A point that is the left endpoint of one class interval,
and the right endpoint of another class interval.
Class Interval
In plotting a histogram, one starts by dividing the range of
values into a set of nonoverlapping intervals, called class intervals, in such a
way that every datum is contained in some class interval.
See the related entries class boundary and
endpoint
convention.
Cluster Sample
In a cluster sample, the sampling unit is a
collection of population units, not single population units.
For example, techniques for adjusting the U.S. census start with a sample of
geographic blocks, then
(try to) enumerate all inhabitants of the blocks in the sample to obtain a sample
of people.
This is an example of a cluster sample.
(The blocks are chosen separately from different strata, so the overall design is a
stratified cluster sample.)
Combinations
The number of combinations of n things taken k at a time is the number
of ways of picking a subset of k of the n things, without replacement,
and without regard to the order in which the elements of the subset are picked.
The number
of such combinations is _{n}C_{k} =
n!/(k!(nk)!),
where k! (pronounced "k factorial")
is k×(k1)×(k2)× · · · × 1.
The numbers _{n}C_{k}
are also called the Binomial coefficients. From a set that has n
elements one can form a total of 2^{n} subsets of all sizes. For example,
from the set {a, b, c}, which has 3 elements, one can form the 2^{3} = 8 subsets
{}, {a}, {b}, {c}, {a,b}, {a,c}, {b,c}, {a,b,c}.
Because the number of subsets with k
elements one can form from a set with n
elements is _{n}C_{k},
and the total number of subsets of a set is the sum of the numbers of possible subsets of
each size, it follows that
_{n}C_{0}+_{n}C_{1}+_{n}C_{2}+
. . . +_{n}C_{n} = 2^{n}.
The calculator
has a button (nCm) that lets you compute the number of combinations of
m things chosen from a set of n things.
To use the button, first
type the value of n, then push the nCm button, then type the value of m,
then press the "=" button.
Complement
The complement of a subset of a given set is
the collection of all elements of the set that
are not elements of the subset.
Complement rule
The probability of the complement of an event
is 100% minus the probability of the event: P(A^{c}) = 100%  P(A).
Conditional Probability
Suppose we are interested in the probability that some event A
occurs, and we learn that the event B occurred. How should we update
the probability of A to reflect this new knowledge? This is what the conditional
probability does: it says how the additional knowledge that B occurred should affect the
probability that A occurred quantitatively. For example, suppose that A and B are
mutually exclusive. Then if B occurred, A did not, so the
conditional
probability that A occurred given that B occurred is zero. At the other extreme,
suppose that B is a subset of A, so that A must occur whenever B
does. Then if we learn that B occurred, A must have occurred too, so the conditional
probability that A occurred given that B occurred is 100%. For inbetween cases,
where A and B intersect, but B is not a subset of A, the conditional
probability of A given B is a number between zero and 100%. Basically, one
"restricts" the outcome space S to
consider only the part of S that is in B, because we know that B
occurred. For A to have happened given that B happened requires that
AB happened, so we are interested in the event
AB. To have a legitimate probability requires that
P(S)
= 100%, so if we are restricting the outcome space to B, we need to divide by the
probability of B to make the probability of this new S be 100%. On this
scale, the probability that AB happened is P(AB)/P(B). This is the definition of the
conditional probability of A given B, provided P(B) is not zero (division by zero is
undefined). Note that the special cases AB = {} (A and B are mutually
exclusive) and AB = B (B is a subset of A) agree with our
intuition as described at the top of this paragraph. Conditional probabilities satisfy the
axioms of probability, just as ordinary probabilities
do.
Confidence Interval
A confidence interval for a parameter is a random interval
constructed from data in such a way that the probability that the interval contains the
true value of the parameter can be specified before the data are collected.
Confidence Level
The confidence level of a confidence interval is the
chance that the interval that will result once data are collected will contain the
corresponding parameter. If one computes confidence intervals
again and again from independent data, the longterm limit of the fraction of intervals
that contain the parameter is the confidence level.
Confounding
When the differences between the treatment and
control groups other than the treatment produce differences in
response that are not distinguishable from the effect of the
treatment,
those differences between the groups are said to be confounded with the effect of
the treatment (if any). For example, prominent statisticians questioned whether
differences between individuals that led some to smoke and others not to (rather than the
act of smoking itself) were responsible for the observed difference in the frequencies
with which smokers and nonsmokers contract various illnesses. If that were the case,
those factors would be confounded with the effect of smoking. Confounding is quite likely
to affect observational studies and
experiments
that are not randomized.
Confounding tends to be decreased by randomization.
See also Simpson's Paradox.
Continuity Correction
In using the normal approximation to the
binomial probability histogram,
one can get more accurate answers by finding the area under the normal curve corresponding
to halfintegers, transformed to standard units.
This is clearest if we are seeking the chance of a particular number of successes.
For example, suppose we seek to approximate the chance of 10 successes in 25
independent
trials, each with probability p = 40% of success.
The number of successes in this
scenario has a binomial distribution with parameters n =
25 and p = 40%. The expected
number of successes is np
= 10, and the standard error is
(np(1p))^{½}
= 6^{½} = 2.45. If we consider the area under the
normal
curve at the point 10 successes, transformed to standard
units, we get zero: the area under a point is always zero. We get a better
approximation by considering 10 successes to be the range from 9 1/2 to 10 1/2 successes.
The only possible number of successes between 9 1/2 and 10 1/2 is 10, so this is exactly
right for the binomial distribution. Because the
normal curve is continuous
and a binomial random variable
is discrete, we need to "smear out"
the binomial
probability over an appropriate range. The lower endpoint of the range, 9 1/2 successes,
is (9.5  10)/2.45 = 0.20 standard units. The upper
endpoint of the range, 10 1/2 successes, is (10.5  10)/2.45 = +0.20
standard units.
The area under the normal
curve between 0.20 and +0.20 is about 15.8%.
The true binomial
probability is
_{25}C_{10}×(0.4)^{10}×(0.6)^{15}
= 16%. In a similar way, if we seek the normal
approximation to the probability that a binomial random variable is in the range from
i successes to k
successes, inclusive, we should find the area under the normal
curve from i1/2 to k+1/2 successes, transformed to
standard units.
If we seek the probability of more than i
successes and fewer than k successes, we should find the area under
the normal curve corresponding to the range
i+1/2 to k1/2
successes, transformed to standard units. If we seek the
probability of more than i but no more than k successes, we should find
the area under the normal curve corresponding to
the range i+1/2
to k+1/2 successes, transformed to
standard units.
If we seek the probability of at least i but fewer than k successes, we
should find the area under the normal curve corresponding to
the range i1/2 to k1/2 successes, transformed to
standard units.
Including or excluding the halfinteger ranges
at the ends of the interval in this manner is called the continuity correction.
Continuous Variable
A quantitative variable is continuous if its set of
possible values is uncountable. Examples include temperature, exact height, exact age
(including parts of a second). In practice, one can never measure a continuous variable to
infinite precision, so continuous variables are sometimes approximated by
discrete variables.
A random variable
X is also called continuous if its set of possible values is uncountable, and the
chance that it takes any particular value is zero (in symbols, if P(X = x) = 0
for every real number x). A random variable is continuous if and
only if its cumulative probability distribution function
is a continuous function (a function with no jumps).
Contrapositive
If p and q are two logical propositions,
then the contrapositive of the proposition
(p IMPLIES q)
is the proposition
((NOT q) IMPLIES
(NOT p) ).
The contrapositive is logically equivalent to the original proposition.
Control
There are at least three senses of "control" in statistics: a
member of the control group, to whom no treatment is given;
a controlled experiment, and to
control
for a possible confounding variable.
Controlled experiment
An experiment that uses the method of
comparison to evaluate the effect of a
treatment
by comparing treated subjects with a control group, who do not
receive the treatment.
Controlled, randomized experiment
A controlled experiment in which the
assignment of subjects to the treatment
group or control group is done at random, for example,
by tossing a coin.
Control for a variable
To control for a variable is to try to separate its effect from the treatment
effect, so it will not confound with the treatment.
There are many methods that try to control for variables.
Some are based on matching individuals between treatment and control; others
use assumptions about the nature of the effects of the variables to try
to model the effect mathematically, for example, using regression.
Control group
The subjects in a controlled
experiment who do not receive the treatment.
Convenience Sample
A sample drawn because of its convenience; not a
probability
sample. For example, I might take a sample of opinions in Columbus (where I live) by
just asking my 10 nearest neighbors. That would be a sample of convenience, and would be
unlikely to be representative of all of Columbus. Samples of convenience are not typically
representative, and it is not typically possible to quantify how unrepresentative results
based on samples of convenience will be.
Converge, convergence
A sequence of numbers x_{1}, x_{2},
x_{3}
. . . converges if there is a number
x such that for any number
E>0,
there is a number k (which can depend on E) such that
x_{j}  x < E whenever j >
k. If such a number x exists, it is called the
limit of the sequence x_{1},
x_{2}, x_{3} . . . .
Convergence in probability
A sequence of random variables
X_{1}, X_{2}, X_{3}
. . . converges in probability if there is a random
variable X such that for any number E>0, the sequence of numbers
P(X_{1}  X < e), P(X_{2}  X < e),
P(X_{3}  X < e),
. . .
converges to 100%.
Converse
If p and q are two logical propositions,
then the converse of the proposition
(p IMPLIES q)
is the proposition (q IMPLIES p).
Correlation
A measure of linear association
between two (ordered) lists.
Two variables can be strongly correlated without having any causal
relationship, and two variables can have a causal
relationship and yet be uncorrelated.
Correlation coefficient
The correlation coefficient r is a measure of how nearly a
scatterplot
falls on a straight line. The correlation coefficient is always between 1 and +1. To
compute the correlation coefficient of a list of pairs of measurements (X,Y),
first transform X and Y individually into
standard
units.
Multiply corresponding elements of the transformed pairs to get a single list
of numbers.
The correlation coefficient is the mean of that list of
products.
Countable Set
A set is countable if its elements can be put in onetoone correspondence with a subset
of the integers. For example, the sets {0, 1, 7, 3}, {red, green, blue},
{ . . . ,2, 1, 0,
1, 2, . . . }, {straight, curly}, and the set of all fractions,
are countable.
If a set is not countable, it is uncountable.
The set of all real numbers is uncountable.
Cover
A confidence interval is said to cover if
the interval contains the true value of the parameter. Before the
data are collected, the chance that the confidence interval will contain the parameter
value is the coverage probability,
which equals the confidence level
after the data are collected and the
confidence interval is actually computed.
Coverage probability
The coverage probability of a procedure for making
confidence intervals is the chance that the
procedure produces an interval that covers the truth.
Critical value
The critical value in an hypothesis test
is the value of the test statistic beyond which we
would reject the null hypothesis.
The critical value is set so that the probability that the
test statistic is beyond the critical value is
at most equal to the significance level if the
null hypothesis be true.
Crosssectional study
A crosssectional study compares different individuals to each
other at the same timeit looks at a crosssection of a population. The differences
between those individuals can confound with the effect being
explored. For example, in trying to determine the effect of age on sexual promiscuity, a
crosssectional study would be likely to confound
the effect of
age with the effect of the mores the subjects were taught as children: the older
individuals were probably raised with a very different attitude towards promiscuity than
the younger subjects.
Thus it would be imprudent to attribute differences in promiscuity
to the aging process. C.f. longitudinal study.
Cumulative Probability Distribution Function (cdf)
The cumulative distribution function of a random variable
is the chance that the random variable is less than or equal to x, as a function
of x. In symbols, if F is the cdf of the
random
variable X, then F(x) = P( X <= x). The cumulative
distribution function must tend to zero as x approaches minus infinity, and must
tend to unity as x approaches infinity.
It is a positive function, and increases monotonically:
if y > x, then
F(y) >= F(x).
The cumulative distribution function completely characterizes the
probability distribution of a
random variable.
top
Data Collection
Systematic gathering of data for a particular purpose from various sources,
including questionnaires, interviews, observation, existing records, and
electronic devices. The process is usually preliminary to statistical analysis
of the data.
Data Interpretation, Statistical
Application of statistical procedures to analyze specific observed or assumed
facts from a particular study.
Death Certificates
Official records of individual deaths including the cause of death certified by
a physician, and any other required identifying information.
Demography
Statistical interpretation and description of a population with reference to
distribution, composition, or structure.
Density, Density Scale
The vertical axis of a histogram has units of percent per unit of the horizontal axis.
This is called a density scale; it measures how "dense" the observations are in
each bin. See also probability density.
Dental Health Surveys
A systematic collection of factual data pertaining to dental or oral health
and disease in a human population within a given geographic area.
Dependent Events, Dependent
Random Variables
Two events or random variables are
dependent if they are not independent.
Dependent Variable
In regression, the variable whose values are supposed to be
explained by changes in the other variable
(the the independent
or explanatory variable). Usually one regresses the
dependent variable on the independent variable.
Deviation
A deviation is the difference between a datum and some reference value, typically the
mean
of the data. In computing the SD, one finds the rms
of the deviations from the mean, the differences between the
individual data and the mean of the data.
Diet Surveys
Systematic collections of factual data pertaining to the diet of a human
population within a given geographic area.
Discrete Variable
A quantitative variable whose set of possible
values is countable. Typical examples of discrete
variables are variables
whose possible values are a subset of the integers, such as Social Security numbers, the
number of people in a family, ages rounded to the nearest year, etc. Discrete
variables are "chunky." C.f. continuous
variable.
A discrete random variable is one whose set of possible
values is countable. A random variable is discrete if and only if
its cumulative probability distribution function is a stairstep
function; i.e., if it is piecewise constant and only increases by jumps.
Discriminant Analysis
A statistical analytic technique used with discrete dependent variables,
concerned with separating sets of observed values and allocating new values. It
is sometimes used instead of regression analysis.
DiseaseFree Survival
Period after successful treatment in which there is no appearance of the
symptoms or effects of the disease.
Disease Notification
Notification or reporting by a physician or other health care provider of the
occurrence of specified contagious diseases such as tuberculosis and HIV
infections to designated public health agencies. The United States system of
reporting notifiable diseases evolved from the Quarantine Act of 1878, which
authorized the US Public Health Service to collect morbidity data on cholera,
smallpox, and yellow fever; each state in the U.S. (as well as the USAF) has its own list of notifiable
diseases and depends largely on reporting by the individual health care
provider.
Disease Outbreaks
Sudden increase in the incidence of a disease. The concept includes
epidemics.
Disease Transmission
The transmission of infectious disease or pathogens. When transmission is
within the same species, the mode can be horizontal or vertical.
Disease Transmission, Horizontal
The transmission of infectious disease or pathogens from one individual to
another in the same generation.
Disease Transmission, PatienttoProfessional
The transmission of infectious disease or pathogens from patients to health
professionals or health care workers. It includes transmission via direct or
indirect exposure to bacterial, fungal, parasitic, or viral agents.
Disease Transmission, ProfessionaltoPatient
The transmission of infectious disease or pathogens from health professional or
health care worker to patients. It includes transmission via direct or indirect
exposure to bacterial, fungal, parasitic, or viral agents
Disease Transmission, Vertical
The transmission of infectious disease or pathogens from one generation to
another. It includes transmission in utero or intrapartum by exposure to blood
and secretions, and postpartum exposure via breastfeeding.
Disease Vectors
Invertebrates or nonhuman vertebrates which transmit infective organisms from
one host to another.
Disjoint or Mutually Exclusive
Events
Two events are disjoint or mutually exclusive if the occurrence of
one is incompatible with the occurrence of the other; that is, if they can't both happen
at once (if they have no outcome in common). Equivalently, two events
are disjoint if their intersection is the
empty set.
Distribution
The distribution of a set of numerical data is how their values are distributed over the
real numbers. It is completely characterized by the empirical distribution
function. Similarly, the probability distribution of
a random variable is completely characterized by its probability
distribution function. Sometimes the word "distribution" is used as a
synonym for the empirical distribution function or the
probability
distribution function.
Distribution Function, Empirical
The empirical (cumulative) distribution function of a set of numerical data is, for each
real value of x, the fraction of observations that are less than or equal to
x.
A plot of the empirical distribution function is an uneven set of stairs. The width of the
stairs is the spacing between adjacent data; the height of the stairs depends on how many
data have exactly the same value. The distribution function is zero for small enough
(negative) values of x, and is unity for large enough values of x. It
increases monotonically:
if y > x, the empirical distribution function
evaluated at y is at least as large as the empirical distribution function
evaluated at x.
Distribution (or Probability
Distribution)
A mathematical function
characterized by constants, called parameters,
that relate the values that a variable can
assume to the probability that a particular
value will occur.
DoubleBlind, DoubleBlind Experiment
In a doubleblind experiment, neither the subjects nor the people
evaluating the subjects knows who is in the treatment group
and who is in the control group.
This mitigates the placebo effect and guards
against conscious and unconscious
prejudice for or against the treatment on the part of the evaluators.
DoubleBlind Method
A method of studying a drug or procedure in which both the subjects and
investigators are kept unaware of who is actually getting which specific
treatment.
top
Ecological Correlation
The correlation between
averages of groups of individuals, instead of individuals.
Ecological correlation can be misleading about the association of individuals.
Effect Modifiers (Epidemiology)
Factors that modify the effect of the putative causal factor(s) under study.
Empirical Law of Averages
The Empirical Law of Averages lies at the base of the
frequency
theory of probability. This law, which is, in fact, an assumption about how the world
works, rather than a mathematical or physical law, states that if one repeats a
random experiment
over and over, independently and under
"identical" conditions, the fraction of trials that result in a given outcome
converges to a limit as the number of trials grows without bound.
Empty Set
The empty set, denoted {} or Ø, is the set that
has no members.
Endpoint Convention
In plotting a histogram, one must decide whether to include a
datum that lies at a class boundary with the class interval
to the left or the right of the boundary. The rule for making this assignment is called an
endpoint convention. The two standard endpoint conventions are (1) to include the
left endpoint of all class intervals and exclude the right, except for the rightmost class
interval, which includes both of its endpoints, and (2) to include the right endpoint of
all class intervals and exclude the left, except for the leftmost interval, which includes
both of its endpoints.
Estimator
An estimator is a rule for "guessing" the value of a population
parameter based on a random sample
from the population. An estimator is a random variable,
because its value depends on which particular sample is obtained, which is random.
A canonical example of an estimator is the sample mean,
which is an estimator of the population mean.
Event
An event is a subset of
outcome space.
An event determined by a random variable
is an event of the form A=(X is in A). When the random variable X is observed, that
determines
whether or not A occurs: if the value of X happens to be in A, A occurs; if
not, A does not occur.
Exhaustive
A collection of events {A_{1}, A_{2}, A_{3},
. . . }
is exhaustive if at least one of them must occur; that is, if
S = A_{1} U A_{2}
U A_{3} U . . .
where S is the outcome space.
A collection of subsets exhausts another set if that set is contained in the ]
union of the collection.
Expectation, Expected Value
The expected value of a random variable is the longterm
limiting average of its values in independent repeated experiments. The expected value of
the random variable X is denoted EX or E(X). For a discrete random variable (one that has
a countable number of possible values) the expected value is the
weighted average of its possible values, where the weight assigned to each possible value
is the chance that the random variable takes that value. One can think of the expected
value of a random variable as the point at which its
probability
histogram would balance, if it were cut out of a uniform material. Taking the expected
value is a linear operation: if X and Y are two random variables,
the expected value of their sum is the sum of their expected values (E(X+Y) = E(X) +
E(Y)), and the expected value of a constant a times a random variable X is the
constant times the expected value of X (E(a×X ) =
a× E(X)).
Experiment
What distinguishes an experiment from an observational study is
that in an experiment, the experimenter decides who receives the
treatment.
Explanatory Variable
In regression, the explanatory or independent variable
is the one that is supposed to "explain" the other. For example, in examining
crop yield versus quantity of fertilizer applied, the quantity of fertilizer would be the
explanatory or independent variable, and the crop
yield would be the dependent variable. In
experiments, the explanatory variable is the one that is
manipulated; the one that is observed is the dependent
variable.
Extrapolation
See interpolation.
top
Factor Analysis, Statistical
A set of statistical methods for analyzing the correlations among several
variables in order to estimate the number of fundamental dimensions that
underlie the observed data and to describe and measure those dimensions. It is
used frequently in the development of scoring systems for rating scales and
questionnaires.
Factorial
For an integer k that is greater than or equal to 1, k! (pronounced
"k factorial") is
k×(k1)×(k2)×
. . . ×1. By convention, 0! = 1. There are k!
ways of ordering k
distinct objects. For example, 9! is the number of batting orders of 9 baseball players,
and 52! is the number of different ways a standard deck of playing cards
can be ordered. The calculator above has a button to compute
the factorial of a number. To compute k!, first type the value of k,
then press the button labeled "!".
False Discovery Rate
In testing a collection of hypotheses, the false discovery rate is the fraction of
rejected null hypotheses that are rejected erroneously (the number of Type I errors
divided by the number of rejected null hypotheses), with the convention that if no
hypothesis is rejected, the false discovery rate is zero.
Family Characteristics
Size and composition of the family.
Fatal Outcome
Death resulting from the presence of a disease in an individual, as shown by
a single case report or a limited number of patients. This should be
differentiated from death, the physiological cessation of life and from
mortality, an epidemiological or statistical concept.
Finite Population Correction
When sampling without replacement, as in a simple random
sample, the SE of sample sums and sample means depends on the
fraction of the population that is in the sample: the greater the fraction, the smaller
the SE. Sampling with replacement is like sampling from an infinitely
large population. The adjustment to the SE for sampling without replacement is called the
finite population correction. The SE for sampling without replacement is
smaller than the SE for sampling with replacement by the finite
population correction factor ((N n)/(N 
1))^{½}. Note that for sample size n=1,
there is
no difference between sampling with and without replacement; the finite population
correction is then unity. If the sample size is the entire population of N units,
there is no variability in the result of sampling without replacement (every member of the
population is in the sample exactly once), and the SE should be zero.
This is indeed what the finite population correction gives (the numerator vanishes).
Fisher's exact test (for the equality of two
percentages)
Consider two populations of zeros and ones.
Let p_{1} be the proportion of ones in the first population,
and let p_{2} be the proportion of ones in the second population.
We would like to test the null hypothesis that
p_{1} = p_{2}
on the basis of a simple random sample
from each population.
Let n_{1} be the size of the sample from population 1, and
let n_{2} be the size of the sample from population 2.
Let G be the total number of ones in both samples.
If the null hypothesis be true, the two samples are like one larger sample from
a single population of zeros and ones.
The allocation of ones between the two samples would be expected
to be proportional to the relative sizes of the samples, but would have
some chance variability.
Conditional on G and the two
sample sizes, under the null hypothesis, the tickets in the first sample are like
a random sample of size n_{1} without replacement from a collection of
N = n_{1} + n_{2} units of
which G are labeled with ones.
Thus, under the null hypothesis, the number of tickets labeled with ones
in the first sample has (conditional on G)
an hypergeometric distribution
with parameters N, G, and n_{1}.
Fisher's exact test uses this distribution to set the ranges of observed values of
the number of ones in the first sample for which we would reject the null hypothesis.
FootballShaped Scatterplot
In a footballshaped scatterplot, most of the points lie within a tilted oval, shaped
moreorless like a football. A footballshaped scatterplot is one in which the
data are homoscedastically
scattered about a straight
line.
Frame, sampling frame
A sampling frame is a collection of units from which
a sample will be drawn. Ideally, the frame is identical to the
population we want to learn about; more typically, the frame
is only a subset of the
population of interest. The difference between the
frame and the population can be a source of
bias in sampling design, if the parameter
of interest has a different value for the frame than it does for the
population. For example, one might desire to estimate
the current annual average income of 1998 graduates of the University of California
at Berkeley. I propose to use the sample mean income
of a sample of graduates drawn at random. To facilitate taking the sample and contacting
the graduates to obtain income information from them,
I might draw names at random from the list of 1998 graduates for whom the alumni
association has an accurate current address.
The population is the collection of 1998 graduates; the frame is those graduates
who have current addresses on file with the alumni association.
If there is a tendency for graduates with higher incomes to have uptodate
addresses on file with the alumni association,
that would introduce a positive bias into the annual average
income estimated from the sample by the sample mean.
Frequency theory of probability
See Probability, Theories of.
Frequency table
A table listing the frequency (number) or relative frequency (fraction or percentage) of
observations in different ranges, called
class intervals.
Fundamental Rule of Counting
If a sequence of experiments or trials T_{1}, T_{2}, T_{3},
. . . , T_{k} could result, respectively, in n_{1},
n_{2}
n_{3}, . . . , n_{k }possible outcomes, and the
numbers n_{1},
n_{2} n_{3}, . . . , n_{k }do not depend on
which outcomes actually occurred, the entire sequence of k experiments has
n_{1}× n_{2} ×
n_{3}×
. . . × n_{k} possible outcomes.
top
Genetic Screening
Searching a population or individuals for persons possessing certain genotypes
or karyotypes that: (1) are already associated with disease or predispose to
disease; (2) may lead to disease in their descendants; or (3) produce other
variations not known to be associated with disease. Genetic screening may be
directed toward identifying phenotypic expression of genetic traits. It includes
prenatal genetic screening.
Geometric Distribution
The geometric distribution describes the number of trials up to and including the first
success, in independent trials with the same probability of success. The geometric
distribution depends only on the single parameter p, the probability of success in
each trial. For example, the number of times one must toss a fair coin until the first
time the coin lands heads has a geometric distribution with parameter p = 50%.
The geometric distribution assigns probability
p×(1  p)^{k1}to
the event that it takes k trials to the first success.
The expected
value of the geometric distribution is 1/p, and its SE is
(1p)^{½}/p.
Geometric Mean
The geometric mean of n numbers {x_{1},
x_{2},
x_{3}, . . . , x_{n}}
is the nth root of their product:
(x_{1}×x_{2}×x_{3}×
. . .
×x_{n})^{1/n}.
Geriatric Assessment
Evaluation of the level of physical, physiological, or mental functioning in
the older population group.
Graph of Averages
For bivariate data, a graph of averages is a plot of the
average values of one variable (say y) for small ranges of values of the other
variable (say x), against the value of the second variable (x) at the
midpoints of the ranges.
Gravidity
The number of pregnancies, complete or incomplete, experienced by a female. It
is different from parity, which is the number of offspring born.
top
Health Status
The level of health of the individual, group, or population as subjectively
assessed by the individual or by more objective measures.
Health Status Indicators
The measurement of the health status for a given population using a variety
of indices, including morbidity, mortality, and available health resources.
Health Surveys
A systematic collection of factual data pertaining to health and disease in
a human population within a given geographic area.
Health Transition
Demographic and epidemiologic changes that have occurred in the last five
decades in many developing countries and that are characterized by major growth
in the number and proportion of middleaged and elderly persons and in the
frequency of the diseases that occur in these age groups. The health transition
is the result of efforts to improve maternal and child health via primary care
and outreach services and such efforts have been responsible for a decrease in
the birth rate; reduced maternal mortality; improved preventive services;
reduced infant mortality, and the increased life expectancy that defines the
transition.
Heteroscedasticity
"Mixed scatter." A scatterplot or
residual plot shows heteroscedasticity if the scatter in
vertical slices through the plot depends on where you take the slice.
Linear regression is not usually a good idea if the data are
heteroscedastic.
Histogram
A histogram is a kind of plot that summarizes how data are distributed. Starting with a
set of class intervals, the histogram is a set of rectangles
("bins") sitting on the horizontal axis. The bases of the
rectangles are the class intervals, and their heights are
such that their areas are proportional to the fraction of observations in the
corresponding class intervals. That is, the height of a
given rectangle is the fraction of observations in the corresponding
class interval, divided by the length of the corresponding
class interval. A histogram does not need a vertical scale,
because the total area of the histogram must equal 100%. The units of the vertical axis
are percent per unit of the horizontal axis. This is called the density scale.
The horizontal axis of a histogram needs a scale. If any observations coincide with the
endpoints of class intervals, the
endpoint convention is important.
Historical Controls
Sometimes, the a treatment group is compared with
individuals from another epoch who did not receive the treatment; for example, in studying
the possible effect of fluoridated water on childhood cancer, we might compare cancer
rates in a community before and after fluorine was added to the water supply. Those
individuals who were children before fluoridation started would comprise an historical
control group. Experiments and studies with historical controls tend to be more
susceptible to confounding than those with contemporary controls, because many factors
that might affect the outcome other than the treatment tend to
change over time as well. (In this example, the level of other potential carcinogens in
the environment also could have changed.)
Homoscedasticity
"Same scatter." A scatterplot or
residual plot shows homoscedasticity if the scatter
in vertical slices through the plot does not depend much on where you take the slice.
C.f. heteroscedasticity.
Hospital Mortality
A vital statistic measuring or recording the rate of death from any cause in
hospitalized populations.
Hospital Records
Compilations of data on hospital activities and programs; excludes patient
medical records.
Hypergeometric Distribution
The hypergeometric distribution with parameters N, G and
n is the distribution of the number of "good"
objects in a simple random sample of size n
(i.e., a
random sample without replacement in which every subset of size n has the same
chance of occurring) from a population of N objects of which
G are "good."
The chance of getting exactly g good objects in such a sample is:
_{G}C_{g} ×
_{NG}C_{ng}/_{N}C_{n},
provided g <= n, g <= G, and
n  g <= N  G.
(The probability is zero otherwise.)
The expected value of the hypergeometric distribution is n×G/N,
and its standard error is:
((Nn)/(N1))^{½}
× (n ×
G/N × (1G/N)
)^{½}.
Hypothesis testing
Statistical hypothesis testing is formalized as making a decision between rejecting or
not rejecting a null hypothesis, on the basis of a set of
observations.
Two types of errors can result from any decision rule (test): rejecting the
null hypothesis when it is true (a Type I error), and failing to
reject the null hypothesis when it is false (a Type II error).
For any hypothesis, it is possible to develop many different decision rules (tests).
Typically, one specifies ahead of time the chance of a Type I error one is willing to
allow.
That chance is called the significance level of the
test or decision rule.
For a given significance level, one way of deciding which decision
rule is best is to pick the one that has the smallest chance of a Type II error when a
given alternative hypothesis is true.
The chance of correctly
rejecting the null hypothesis when a given alternative hypothesis is true is
called the power of the test against that alternative hypothesis.
top
IFF, if and only if
If p and q are two logical propositions,
then(p IFF q) is a proposition that is true when
both p and q are true, and when both p and q are
false. If is logically equivalent to the proposition: ( (p
IMPLIES q)
AND
(q IMPLIES p) )
and to the proposition
( (p AND q)
OR ((NOT
p) AND (NOT q)) ).
Implies, logical implication
Logical implication is an operation on two logical propositions.
If p and q are two logical propositions,
(p IMPLIES q) is a logical proposition that is
true if p is false, or if both p and q are true.
The proposition (p IMPLIES q) is
logically equivalent to the proposition
((NOT p)
OR q).
Infant Mortality
Perinatal, neonatal, and infant deaths in a given population.
Incidence
The number of new cases of a given disease during a given period in a specified
population. It also is used for the rate at which new events occur in a defined
population. It is differentiated from prevalence; which refers to all cases, new
or old, in the population at a given time.
Independent and identically distributed (iid)
A collection of two or more random variables {X_{1}, X_{2},
. . . , }
is independent and identically distributed if the variables have the same
probability distribution,
and are independent.
Independent, independence
Two events A and B are (statistically) independent if the chance
that they both happen simultaneously is the product of the chances that each occurs
individually; i.e., if P(AB) = P(A)P(B). This is essentially equivalent to saying
that learning that one event occurs does not give any information about whether the other
event occurred too: the conditional probability of A given B is the same as the
unconditional probability of A, i.e., P(AB) = P(A). Two random variables X and Y are independent if all events
they
determine are independent, for example, if the event
{a < X <= b}
is independent of the event {c < Y <= d} for
all
choices of a, b, c, and d.
A collection of more than two random variables is independent if for every proper subset
of the variables, every event determined
by that subset of the variables is independent of every event determined by the variables
in the complement of the subset. For example, the three random variables X, Y, and Z are
independent if every event determined by X is independent of every event
determined by Y and
every event determined by X is independent of every event determined by Y and Z
and every event determined by Y is
independent of every event determined by X and Z and every event determined by Z
is independent of every event determined by X and Y.
Independent Variable
In regression, the independent variable is the one that is
supposed to explain the other; the term is a synonym for "explanatory variable."
Usually, one regresses the "dependent variable" on the "independent
variable." There is not always a clear choice of the independent variable. The
independent variable is usually plotted on the horizontal axis. Independent in this
context does not mean the same thing as
statistically independent.
Indicator
Random Variable
The indicator [random variable] of the
event A, often written 1_{A}, is the
random variable that
equals unity if A occurs, and zero if A does not occur.
The expected
value of the indicator of A is the probability of A, P(A), and the
standard error of the indicator of A is
(P(A)×(1P(A))^{½}.
The sum
1_{A} + 1_{B} + 1_{C} +
. . .
of the indicators of a
collection of events {A, B, C, . . . }
counts how many of the
events {A, B, C, . . . } occur in a given
trial.
The product of the indicators of a collection of events is the indicator of the
intersection of the events (the product equals one if and only if all of
indicators equal one).
The maximum of the indicators of a collection of events is the indicator
of the union of the events (the maximum equals one if any of the indicators equals one).
Insect Vectors
Insects that transmit infective organisms from one host to another or from an
inanimate reservoir to an animate host.
Interquartile Range (IQR)
The interquartile range of a list of numbers is the upper
quartile
minus the lower quartile.
Interpolation
Given a set of bivariate data (x, y), to
impute a value of y corresponding to some value of x at which there is
no measurement of y is called interpolation, if the value of x is within
the range of the measured values of x. If the value of x is outside the
range of measured values, imputing a corresponding value of y is called
extrapolation.
Intersection
The intersection of two or more sets is the set of elements that all the sets have in
common; the elements contained in every one of the sets.
The intersection of the events A and B is written
"A and B" and "AB." C.f.
union. See also Venn diagrams.
Intervention Studies
Epidemiologic investigations designed to test a hypothesized causeeffect
relation by modifying the supposed causal factor(s) in the study population.
Interviews
Conversations with an individual or individuals held in order to obtain
information about their background and other personal biographical data, their
attitudes and opinions, etc. It includes school admission or job interviews.
top
Joint Probability Distribution.
If X_{1}, X_{2}, . . . ,
X_{k} are
random variables,
their joint probability distribution gives the probability
of events determined by the collection of random variables:
for any collection of sets of numbers
{A_{1}, . . . , A_{k}},
the joint probability distribution determines
P( (X_{1} is in A_{1}) and
(X_{2} is in A_{2}) and . . . and
(X_{k} is in A_{k})
).
top
KaplanMeier Method (or Product Limit
Method).
A method for analyzing survival
data, based on the distribution of variable time
periods between events (or deaths).
Karnofsky Performance Status
A performance measure for rating the ability of a person to perform usual
activities, evaluating a patient's progress after a therapeutic procedure, and
determining a patient's suitability for therapy. It is used most commonly in the
prognosis of cancer therapy, usually after chemotherapy and customarily
administered before and after therapy.
top
Law of Averages
The Law of Averages says that the average of
independent
observations of random variables
that have the same probability distribution is
increasingly likely to be close
to the expected value of the
random
variables as the number of observations grows.
More precisely, if X_{1}, X_{2},
X_{3}, . . . , are independent
random variables with
the same probability distribution, and E(X) is their
common expected value, then for
every number E > 0,
P{(X_{1} + X_{2} + . . . +
X_{n})/n
 E(X)  < E}
converges to 100% as n grows.
This is equivalent to saying that the sequence of sample means
X_{1}, (X_{1}+X_{2})/2,
(X_{1}+X_{2}+X_{3})/3, . . .
converges in probability to E(X).
Law of Large Numbers
The Law of Large Numbers says that in repeated, independent
trials with the same probability p of success in each trial, the percentage of
successes is increasingly likely to be close to the chance of success as the number of
trials increases. More precisely, the chance that the percentage of successes differs from
the probability p by more than a fixed positive amount, E > 0,
converges to zero as the number of trials n goes to infinity, for every number
e
> 0. Note that in contrast to the difference between the percentage of
successes and the probability of success, the difference between the number of
successes and the expected number of successes,
n×p,
tends to grow as n grows.
The following tool illustrates the law of large numbers; the button toggles between
displaying the difference between the number of successes and the expected number of
successes, and the difference between the percentage of successes and the expected
percentage of successes.
Life Expectancy
A figure representing the number of years, based on known statistics, to which
any person of a given age may reasonably expect to live.
Life Tables
Summarizing techniques used to describe the pattern of mortality and survival in
populations. These methods can be applied to the study not only of death, but
also of any defined endpoint such as the onset of disease or the occurrence of
disease complications.
Life Table Method
A method for
analyzing survival data, based on the proportion
of study subjects surviving to fixed time
intervals after treatment or study initiation.
LeastSquares Analysis
A principle of estimation in which the estimates of a set of parameters in a
statistical model are those quantities minimizing the sum of squared differences
between the observed values of a dependent variable and the values predicted by
the model.
Likelihood Functions
Functions constructed from a statistical model and a set of observed data which
give the probability of that data for various values of the unknown model
parameters. Those parameter values that maximize the probability are the maximum
likelihood estimates of the parameters.
Limit
See converge.
Linear association
Two variables are linearly associated if a change in one is associated with a
proportional change in the other, with the same constant of proportionality throughout the
range of measurement. The correlation coefficient measures
the degree of linear association on a scale of 1 to 1.
Linear Models
Statistical models in which the value of a parameter for a given value of a
factor is assumed to be equal to a + bx, where a and b are constants. The models
predict a linear regression.
Linear Operation
Suppose f is a function or operation that acts on things we shall denote
generically by the lowercase Roman letters x and y. Suppose it makes
sense to multiply x and y by numbers (which we denote by
a),
and that it makes sense to add things like x and y together. We say that
f is linear if for every number a and every value of
x
and y for which f(x) and f(y) are defined,
(i) f( a×x ) is defined and equals
a×f(x),
and (ii) f( x + y ) is defined and equals
f(x)
+ f(y). C.f. affine.
Linear Regression Method
For a single
item, a method for determining the bestfit line
through points representing the paired values of
two measurement systems (one representing a
dependent variable and the other representing an
independent variable). Under certain conditions,
statistical tests of the slope and intercept can
be made, and confidence intervals about the line
can be computed.
Location, Measure of
A measure of location is a way of summarizing what a "typical" element of a
list isit is a onenumber summary of a distribution. See
also arithmetic mean, median, and
mode.
LogLinear Modeling Techniques
Methods for analyzing qualitative data in which
a function of the probability that a particular
event will occur is logarithmically transformed
to fit a linear model.
Logistic Models
Statistical models which describe the relationship between a qualitative
dependent variable (that is, one which can take only certain discrete values,
such as the presence or absence of a disease) and an independent variable. A
common application is in epidemiology for estimating an individual's risk
(probability of a disease) as a function of a given risk factor.
Logistic Regression Method
A
specialized loglinear modeling technique in
which the logarithm of the proportion of a group
having a particular characteristic, divided by
one minus that proportion, is fit into a
multiple regression linear model.
Longitudinal study
A study in which individuals are followed over time, and compared
with themselves at different times, to determine, for example, the effect of aging on some
measured variable. Longitudinal studies provide much more persuasive
evidence about the effect of aging than do crosssectional
studies.
Margin of error
A measure of the uncertainty in an estimate of a
parameter; unfortunately, not everyone
agrees what it should mean.
The margin of error of an estimate is typically
one or two times the estimated standard error of the estimate.
Markov's Inequality
For lists: If a list contains no negative numbers, the fraction of numbers in the list
at least as large as any given constant a>0 is no larger than the
arithmetic mean of the list, divided by a.
For random variables: if a random variable X must be
nonnegative, the chance that X exceeds any given constant a>0 is no larger than
the expected value of X, divided by a.
Mass Screening
Organized periodic procedures performed on large groups of people for the
purpose of detecting disease.MatchedPair Analysis
A type of analysis in which subjects in a study group and a comparison group are
made comparable with respect to extraneous factors by individually pairing study
subjects with the comparison group subjects (e.g., agematched controls).
Maternal Mortality
Maternal deaths resulting from complications of pregnancy and childbirth in a
given population.
Maximum Likelihood Estimate (MLE)
The maximum likelihood estimate of a parameter from data is the
possible value of the parameter for which the chance of observing
the data largest. That is, suppose that the parameter is p,
and that we observe data x. Then the maximum likelihood estimate of
p is:
estimate p by the value q that makes P(observing x when the
value of p is q) as large as possible.
For example, suppose we are trying to estimate the chance that a (possibly biased) coin
lands heads when it is tossed. Our data will be the number of times x the coin
lands heads in n independent tosses of the coin. The distribution of the number
of times the coin lands heads is binomial with
parameters n (known) and p (unknown). The chance
of observing x heads in n trials if the chance of heads in a given trial
is q is
_{n}C_{x} q^{x}(1q)^{nx}.
The maximum likelihood estimate of p would be the value of q that
makes that chance largest. We can find that value of q explicitly using calculus;
it turns out to be q = x/n, the fraction of times the coin is
observed to land heads in the n tosses. Thus the maximum likelihood estimate of
the chance of heads from the number of heads in n independent tosses of the coin
is the observed fraction of tosses in which the coin lands heads.
Mean, Arithmetic mean
The sum of a list of numbers, divided by the number of numbers.
See also average.
Mean Squared Error (MSE)
The mean squared error of an estimator of a
parameter is the expected value of the
square of the difference between the estimator and the parameter. In symbols, if X is an
estimator of the parameter t, then
MSE(X) = E( (Xt)^{2} ).
The MSE measures how far the estimator is off from what it is trying to estimate, on the
average in repeated experiments. It is a summary measure of the accuracy of the estimator.
It combines any tendency of the estimator to overshoot or undershoot the truth
(bias), and the variability of the estimator (SE).
The MSE can be written in terms of the bias and
SE of the estimator:
MSE(X) = (bias(X))^{2} +
(SE(X))^{2}.
Median
"Middle value" of a list. The smallest number such that at least half the
numbers in the list are no greater than it. If the list has an odd number of entries, the
median is the middle entry in the list after sorting the list into increasing order. If
the list has an even number of entries, the median is the smaller of the two middle
numbers after sorting. The median can be estimated from a histogram by finding the
smallest number such that the area under the histogram to the left of that
number is 50%.
Medical Records
Recording of pertinent information concerning patient's illness or illnesses.
Member of a set
Something is a member (or element) of a
set if it is one of the
things in the set.
Method of Comparison
The most basic and important method of determining whether a
treatment
has an effect: compare what happens to individuals who are treated
(the treatment group) with what happens to
individuals who are not
treated (the control group).
Mode
For lists, the mode is a most common (frequent) value. A list can have more than one
mode. For histograms, a mode is a relative maximum
("bump").
Models, Statistical
Statistical formulations or analyses which, when applied to data and found to
fit the data, are then used to verify the assumptions and parameters used in the
analysis. Examples of statistical models are the linear model, binomial model,
polynomial model, twoparameter model, etc.
Moment
The kth moment of a list is the average value of the elements raised to
the kth power; that is, if the list consists of the N elements
x_{1}, x_{2}, . . . ,
x_{N},
the kth moment of the list is: ( x_{1}^{k} +
x_{2}^{k} +
x_{N}^{k} )/N. The kth moment of a
random variable X is
the expected value of X^{k},
E(X^{k}).
Morbidity
The proportion of patients with a particular disease during a given year per
given unit of population.
Mortality
All deaths reported in a given population.
Multimodal Distribution
A distribution with more than one
mode.
Multinomial Distribution
Consider a sequence of n
independent trials,
each of which can result in an outcome in any of k categories.
Let p_{j} be the probability that each trial results
in an outcome in category j, j = 1, 2, . . . ,
k,
so
p_{1} + p_{2} + . . . +
p_{k}
= 100%. The number of outcomes of each type has a
multinomial distribution.
In particular, the probability that the n trials result in
n_{1}
outcomes of sub> outcomes of type 2,
. . . , and
n_{k} outcomes of type k is... n!/(n_{1}! ×
n_{2}! ×
. . . × n_{k}!) ×
p_{1}^{n1} ×
p_{2}^{n2} ×
. . . ×
pk^{nk},
if n_{1}, . . . , n_{k} are
nonnegative integers that sum to n; the chance is zero otherwise.
Multiphasic Screening
The simultaneous use of multiple laboratory procedures for the detection of
various diseases. These are usually performed on groups of people.
Multiple Regression Analysis
A
multivariate extension of linear regression in
which two or more independent variables are fit
into a best linear model of a dependent
variable.
Multiplication rule
The chance that events A and B both occur (i.e.,
that event AB occurs), is the
conditional probability that A occurs given that B
occurs, times the unconditional probability that B occurs.
Multiplicity in hypothesis tests
In hypothesis testing, if more than one hypothesis is tested, the actual
significance level of the combined tests is not equal to the nominal
significance level of the individual tests.
Multivariate Analysis
A set of techniques used when variation in several variables has to be studied
simultaneously. In statistics, multivariate analysis is interpreted as any
analytic method that allows simultaneous study of two or more dependent
variables
Multivariate Data
A set of measurements of two or more variables per individual.
See bivariate.
Mutually Exclusive
Two events are mutually exclusive if the occurrence of
one is incompatible with the occurrence of the other; that is, if they can't both happen
at once (if they have no outcome in common). Equivalently, two events
are disjoint if their intersection is the
empty set.
Nearly normal distribution
A population of numbers (a list of numbers) is said to have a
nearly normal
distribution if the histogram of its values in
standard units nearly
follows a normal curve.
More precisely, suppose that the mean of the
list is µ and the standard deviation
of the list is SD.
Then the list is nearly normally distributed if, for every two numbers
a <
b, the fraction of numbers in the list that are
between
a and
b is approximately equal to the area under the normal
curve between (
a  µ)/SD and
(
a  µ)/SD.
Negative Binomial Distribution
Consider a sequence of independent trials with the same
probability
p of success in each trial. The number of trials up to and including
the
rth success has the negative Binomial distribution with parameters
n
and
r. If the random variable N has the negative
binomial distribution with parameters
n and
r, then
P(N=
k) =
_{k1}C_{r}_{1} × p^{r} ×
(1
p)
^{kr},
for
k =
r,
r+1,
r+2, . . . , and zero for
k
<
r, because there must be at least
r trials to have
r
successes. The negative binomial distribution is derived as follows: for the
rth
success to occur on the
kth trial, there must have been
r1 successes in
the first
k1 trials, and the
kth trial must result in success. The
chance of the former is the chance of
r1 successes in
k1
independent trials with the same probability of success in each
trial, which, according to the Binomial distribution with
parameters
n=
k1 and
p, has probability
_{k1}C_{r}_{1} ×
p^{r}^{1} × (1
p)
^{kr}.
The chance of the latter event is
p, by assumption. Because the trials are
independent, we can find the chance that both
events
occur by multiplying their chances together, which gives the expression for P(N=
k)
above.
Neonatal Screening
The identification of selected parameters in newborn infants by various tests,
examinations, or other procedures. Screening may be performed by clinical or
laboratory measures. A screening test is designed to sort out healthy neonates
from those not well, but the screening test is not intended as a diagnostic
device, rather instead as epidemiologic.
Nonlinear Association
The relationship between two variables is nonlinear if a change in one is associated
with a change in the other that is depends on the value of the first; that is, if the
change in the second is not simply proportional to the change in the first, independent of
the value of the first variable.
Nonparametric Statistics
A class of statistical methods applicable to a large set of probability
distributions used to test for correlation, location, independence, etc. In most
nonparametric statistical tests, the original scores or observations are
replaced by another variable containing less information. An important class of
nonparametric tests employs the ordinal properties of the data. Another class of
tests uses information about whether an observation is above or below some fixed
value such as the median, and a third class is based on the frequency of the
occurrence of runs in the data.
Nonparametric Tests
Hypothesis tests
that do not require data to be consistent with
any particular theoretical distribution, such as
normal distribution.
Nonresponse
In surveys, it is rare that everyone who is ``invited'' to participate (everyone whose
phone number is called, everyone who is mailed a questionnaire, everyone an interviewer
tries to stop on the street . . . ) in fact responds. The difference between the
"invited" sample sought, and that obtained, is the nonresponse.
Nonresponse bias
In a survey, those who respond may differ from those who do not, in ways that are
related to the effect one is trying to measure. For example, a telephone survey of how
many hours people work is likely to miss people who are working late, and are therefore
not at home to answer the phone. When that happens, the survey may suffer from nonresponse
bias. Nonresponse bias makes the result of a survey differ systematically
from the truth.
Nonresponse rate
The fraction of nonresponders in a survey:
the number of nonresponders divided by the number of people invited to participate
(the number sent questionnaires, the number of interview attempts, etc.)
If the nonresponse rate is appreciable, the survey suffer from large
nonresponse bias.
Normal approximation
The normal approximation to data is to approximate areas under the
histogram
of data, transformed into standard units, by the
corresponding areas under the normal curve.
Many probability distributions can be approximated by a normal distribution, in the
sense that the area
under the probability histogram is close to the area under a corresponding part of the
normal curve. To find the corresponding part of the normal curve, the range must be
converted to standard units, by subtracting the expected value
and dividing by the standard error.
For example, the area under the binomial
probability histogram for
n = 50 and
p =
30% between 9.5 and 17.5 is 74.2%. To use the normal approximation, we transform
the endpoints to standard units, by subtracting the
expected value (for the Binomial
random variable,
n×
p = 15
for these values of
n and
p) and dividing the
result by the standard error
(for a Binomial,
(n ×
p ×
(1
p)
)^{1/2}
= 3.24 for these values of
n and
p).
The area normal approximation is the area under the normal curve between
(9.5  15)/3.24 = 1.697 and (17.5  15)/3.24 = 0.772; that area is 73.5%, slightly
smaller than the corresponding area under the binomial histogram. See also the
continuity
correction.
Normal curve
The normal curve is the familiar
"bell curve." The mathematical expression for the normal curve is
y = (2×pi)
^{½}E
^{x2/2},
where pi is the ratio of the circumference of a circle to its diameter
(3.14159265 . . . ),
and E is the base
of the natural logarithm (2.71828 . . . ).
The normal curve is symmetric around the point
x=0, and
positive for every value of
x. The area under the normal curve is unity, and the
SD of the normal curve, suitably defined, is also unity. Many (but not most)
histograms, converted into
standard units,
approximately follow the normal curve.
Normal distribution
A random variable X has a normal distribution with mean
m and
standard error
s if for every pair of numbers
a <=
b, the chance that
a < (Xm)/s <
b is...
P(
a < (Xm)/s <
b) = area under the normal curve between
a
and
b.
If there are numbers
m and
s such that X has a normal
distribution with mean
m and standard error
s, then X is said to have
a normal distribution or to be normally distributed. If X has a normal
distribution with mean
m=0 and standard error
s=1, then X is said
to have a standard normal distribution. The notation X~N(m,s
^{2}) means that
X has a normal distribution with mean
m and
standard error
s; for example, X~N(0,1), means X has a standard normal distribution.
Normal Distribution
Continuous frequency distribution of infinite range. Its properties are as
follows: 1) continuous, symmetrical distribution with both tails extending to
infinity; 2) arithmetic mean, mode, and median identical; and 3) shape
completely determined by the mean and standard deviation.
NOT, Negation, Logical Negation
The negation of a logical proposition
p,
NOT p, is a proposition that is the logical opposite of
p.
That is, if
p is true,
NOT p is false, and
if
p is false,
NOT p is true. Negation takes
precedence over other logical operations.
Number Needed to Treat (NNT)
The number of patients who need to be treated to prevent 1 adverse outcome.
Null hypothesis
In
hypothesis testing, the hypothesis we wish to falsify
on the basis of the data. The null hypothesis is typically that something is not present,
that there is no effect, or that there is no difference between treatment and control.
Observational Study
C.f. controlled experiment.
Observer Variation
The failure by the observer to measure or identify a phenomenon accurately,
which results in an error. Sources for this may be due to the observer's missing
an abnormality, or to faulty technique resulting in incorrect test measurement,
or to misinterpretation of the data. Two varieties are interobserver variation
(the amount observers vary from one another when reporting on the same material)
and intraobserver variation (the amount one observer varies between
observations when reporting more than once on the same material).
Odds
The
odds in favor of an event is the ratio
of the probability that the event occurs to the
probability that the
event does not occur. For example, suppose an experiment can result in any of
n
possible outcomes, all equally likely, and that
k of the outcomes result in a
"win" and
n
k result in a "loss." Then the chance of
winning is
k/
n; the chance of not winning is
(
n
k)/
n;
and the odds in favor of winning are
(
k/
n)/
((
n
k)/
n)
=
k/(
nk), which is the number of favorable outcomes divided by the
number of unfavorable outcomes. Note that odds are not synonymous with probability, but
the two can be converted back and forth. If the odds in favor of an event are
q,
then the probability of the event is
q/(1+
q). If the probability of an
event is
p, the odds in favor of the event are
p/(1
p) and the
odds against the event are (1
p)/
p.
Onesided Test
C.f. twosided test.
A hypothesis test of the null hypothesis
that the value of a parameter, µ, is equal to
a null value, µ
_{0}, designed to have power against either
the alternative hypothesis that µ < µ
_{0}
or the alternative µ > µ
_{0} (but not both).
For example, a significance level 5%, onesided
z test
of the null hypothesis that the mean of a population equals zero against the alternative
that it is greater than zero, would reject the null hypothesis for values of


sample mean 


z 
= 
 
> 
1.64. 


SE(sample
mean) 


OR, Disjunction, Logical Disjunction
An operation on two logical propositions.
If
p and
q are two propositions, (
p OR)
q is a proposition that is true if
p is true or
if
q is true (or both); otherwise, it is false. That is,
(
p OR) is true unless both
p and
q
are false. C.f.
exclusive disjunction,
XOR.
Ordinal Variable
A variable whose possible values have a natural order, such as
{short, medium, long}, {cold, warm, hot}, or {0, 1, 2, 3, . . . }. In contrast, a variable
whose possible values are {straight, curly} or {Arizona, California, Montana, New York}
would not naturally be ordinal. Arithmetic with the possible values of an ordinal variable
does not necessarily make sense, but it does make sense to say that one possible value is
larger than another.
Outcome Space
The outcome space is the set of all possible outcomes of a given
random experiment. The outcome space is often denoted
by the capital letter
S.
Outlier
An outlier is an observation that is many SD's from the
mean. It is sometimes tempting to discard outliers, but this is imprudent
unless the cause of the outlier can be identified, and the outlier is determined to be
spurious. Otherwise, discarding outliers can cause one to underestimate the true
variability of the measurement process.