A New View of Statistics |
|
Sample
Size for a Cross-Sectional Study
Just as reliability affected sample
size in experimental or longitudinal studies, validity impacts sample
size in descriptive or cross-sectional studies. In such studies, you
measure each variable only once, and your outcomes are relationships
between the variables. The lower the validity, the more the
relationships are degraded, so the bigger the sample size you need to
characterize them. For this application it's easier to discuss the
effects of validity by considering the validity correlation rather
than the typical error of the estimate.
The effect on the magnitude of the relationship between variables is proportional to the validity correlations of each variable. For example, suppose you are interested in the relationship between physical activity and health, and suppose that the true underlying relationship corresponds to a correlation of 0.50. If your measure of physical activity has a validity correlation of 0.7, then in your study of health and physical activity you will observe a correlation of only 0.5x0.7, or 0.35 (plus or minus sampling error, of course). The sample size required to detect a degraded relationship is inversely proportional the square of the validity correlation coefficient of each variable in the relationship. In our example, 1/0.702 = 2.0, so you have to double the number of subjects. That's bad news, because most psychometric and subjective behavioral measures appear to have validities of 0.7 at best. Objective measures taken on lab instruments or in the field usually have validities of 0.8-0.9 or better, so you can often ignore the effect of validity of such variables on the magnitude of effects and the required sample size. Go to the section on sample size for cross-sectional studies for more information about the actual sample sizes you need.
Assessing
an Individual
When you use a prediction equation to estimate
a criterion value from a practical value (e.g., body fat from a sum of skinfolds),
you should take into account the typical error of the estimate in much the same
way as you do the typical error of measurement for a single measurement. You use
the same factors to generate the likely range of the predicted value of the criterion
(the factors for a single measurement in the table),
but you multiply them by the typical error of the estimate. If the typical error
is based on a study of less than 50 subjects, you will need to use a new-prediction
error instead of the typical error, as explained earlier.
The calculations are in the appropriate section of the spreadsheet
for assessing an individual.
Example: You measure a client's skinfolds. You dig around in the literature and find an estimation equation that was developed for predicting body fat as a percent of body mass (%BM) in a large number of subjects similar to your client. The client's predicted body fat is 26.4 %BM, and the typical error of the estimate for the equation based on a large sample of similar subjects is 2.1 %BM. From the table in the section on reliability, the factor to multiply by the typical error for an 80% likely range is 1.28, which makes the limits 26.4 ± 1.28x2.1, or 23.7 to 29.1. You say to the client: "Your predicted body fat is 26.4 %BM, but the odds are 4 to 1 that your true (DEXA) body fat is somewhere between 24 and 29 %BM." Use the spreadsheet to generate these limits, and also the likelihood that the client's true value is greater than some reference value. For example, the likelihood that her true body fat is greater than 25 %BM is 74%, or odds of 3 to 1.
Comparing
Validity of Measures
Just as the typical error of measurement was
the best measure for comparing reliability of instruments, operators, or protocols,
the typical error of the estimate is the best measure for comparing their validity,
Do not compare the new-prediction errors, however derived: these are appropriate
only for assessing individuals. As I explained with comparing
measures of reliability, use the spreadsheet
for confidence limits to calculate 80% or 90% likely ranges for the ratio
of typical errors determined with different subjects, and to get likelihoods for
the true ratio being greater that a reference ratio. Get an expert to use mixed
modeling to estimate likely ranges when the same subjects are used to determine
the typical errors.
Validity
for Monitoring Changes
Our discussion of validity thus far
has been concerned with the validity of a single measurement on an
individual. But we often use a practical measure to monitor for
changes in a criterion measure. For example, we use changes in
skinfolds to infer that there have been changes in a subject's body
fat. You might think that changes in skinfolds would have to reflect
changes in body fat, but what if the amount of non-fat tissue in a
skinfold is affected substantially by the subject's state of
hydration or the menstrual cycle? In this situation a change in
skinfold thickness may or may not represent a change in body fat, so
skinfold thickness would no longer be a trustworthy measure for
tracking body fat.
How do we decide whether skinfolds or some other practical measure is trustworthy? There are three approaches: correlation of spontaneous changes, correlation of induced changes, and correlation of original variables. The reliability of the practical and criterion measures usually has to be taken into account, so the statistics get quite complex. That might explain why no-one has yet published an adequate account of any of these approaches. I will therefore restrict this section to a qualitative overview.
Correlation of Spontaneous
Changes
The obvious way to see how well
changes in a practical measure track changes in a criterion measure
is to measure some subjects, wait long enough for spontaneous changes
to occur in some of them, measure them again, then plot the changes
in the criterion measure against changes in the practical measure. If
you get a very strong correlation (>0.95) you know the practical
measure is trustworthy. The trouble is, you usually get a low
correlation. Why? Because the real changes between measurements are
usually of the same order of magnitude as the noise (the typical
errors) in each measurement. The change scores for each measure
therefore have a big contribution from the typical errors, which are
random and uncorrelated, so the correlated true changes get lost in
the noise in your plot of the change scores. You can estimate what
the true correlation would be with the typical errors out of the
picture, but if the observed correlation is poor, you will need
hundreds of subjects to get enough precision for the estimate of the
true correlation to decide whether the practical measure is any
good.
Correlation of Induced
Changes
Another approach is to make large
changes happen by giving some kind of treatment to half your
subjects. You then see how well the practical measure tracks the
criterion measure in that half relative to the other half by
correlating the change scores of all the subjects together. Even if
you are successful in finding an effective treatment and subjects
willing to undergo the treatment, you will have validated the
practical measure only for changes induced by that particular
treatment. In other words, you still won't know whether the practical
measure is good for tracking spontaneous changes or changes brought
about by other treatments.
Correlation of Original
Variables
The third approach is to analyze
data from a standard validity or calibration study. If the
correlation between the practical measure and the criterion measure
is near enough to perfect (>0.95), the two measures are
effectively identical, so changes in the practical measure must track
changes in the criterion. All the previous remarks about the
correlation between change scores apply to the correlation between
raw scores: the observed correlation will usually be a lot less than
0.95, because the correlation between the true values of the
practical and criterion measures is degraded by the typical errors;
you can estimate the true validity correlation by taking the
concurrent retest reliability correlations into account; the true
correlation needs to be greater than 0.95; and if the typical errors
have a large degrading effect on the correlation, you will need
hundreds of subjects in the validity and reliability studies to make
a firm conclusion. You also need a reasonably good validity
correlation to start with, which you won't get if your subjects are a
homogeneous subgroup. Another problem is that even the true
correlation between the measures may turn out to be less than 0.95,
yet the practical measure will still track changes well. For example,
the amount of non-fat tissue in skinfolds might vary between
individuals with the same body fat (resulting in a relatively poor
correlation between skinfolds and body fat), but the amount of
non-fat tissue might not change with hydration status (so changes in
skinfolds will still mirror changes in fat). This problem does not
arise with the first two approaches, because the constant amount of
non-fat tissue in each subject's skinfolds disappears from the change
in skinfolds.
Each of these three approaches has its strengths and weaknesses. The third approach is best for a heterogeneous group of subjects, but only if it produces a very high and precise estimate of the true correlation. If the group is homogeneous, or if the true correlation is poor, you will have to use one of the two change-score approaches. Inducing changes with an appropriate treatment may give you a good estimate of the correlation between the change scores, but you end up validating the practical measure only for the treatment you used. The greatest strength of the first approach is that it validates the practical measure for tracking the changes that occur in the normal course of events, but the validation won't be clear cut if the changes are too small.
Sample
Size for Validity Studies
As with reliability, sample size
for estimation of validity is dictated by the need for precision. In
this case precision of the typical error of the estimate or the new
prediction error is the main consideration. You don't have the option
of performing more than two tests; instead, you have to get adequate
precision by increasing the number of subjects. For a reliability
study involving a noisy measure, I
recommended a minimum of 50 subjects tested three times. In terms
of degrees of freedom (which dictate the precision of estimates of
typical error), that is equivalent to about 100 subjects tested
twice, so that is the preferred minimum sample size for a validity
study of a noisy practical measure.
When there are several independent variables (regressors) in the
prediction equation, an important consideration is ensuring that the
typical error is uniform across the range of the regressor (or
between subgroups represented by the regressor). Extrapolating from
what I said about sample size for
comparison of typical errors of measurement, I suggest adding 100
subjects for each extra regressor. (After all, if there are
substantial differences in the typical error of the estimate between
subgroups, and if the differences are resistant to transformation,
you will have to perform separate analyses for each subgroup, each of
which will require 100 subjects.) Many published validity studies
with multiple regressors have involved several hundred subjects, but
I don't think the choice of sample size in those studies was driven
by consideration of uniformity of error. Another important
consideration is keeping the new-prediction error from increasing
substantially. It's easy to show (using Item 3 of the spreadsheet
for a subject's true value) that increasing the number of
subjects by 50 for each regressor after the first will ensure the
new-prediction error is no more than 1% larger than the typical
error. No worry there, if you use 100 subjects per
regressor.
Go to: Next
· Previous
· Contents ·
Search
· Home
webmaster=AT=sportsci.org
· Sportsci
Homepage
Last updated 20 Aug 01