A nonparametric two-sample comparison for skewed data with unequal
variances
Eva Skovlund*
University of Oslo, School of Pharmacy, Department of Pharmaceutical Biosciences, PO Box 1068, Blindern, 0316 Oslo, Norway
Accepted 28 September 2009
For practical situations, the pure shift model is probably
not very realistic, and it is thus important to assess the robust-
ness of different tests against deviations from this model. It is
well established that heteroscedasticity (unequal variances)
is at least as deleterious for the properties of the Wilcoxon-
Mann-Whitney (WMW) test as for the t-test [1] and that
the Welch test should replace the t-test when distributions
are approximately normal and variances unequal. It is also
worth remembering that the WMW test does not share the
asymptotic robustness properties of the t-test. In addition, un-
equal variance is not the only problem frequently encoun-
tered. Distributions can also be skewed, and the skewness
of the two distributions may differ. Unfortunately, but not
unexpectedly, even the Welch test is unable to maintain the
nominal significance level when distributions are skewed [1].
Stochastic simulation is a valuable tool to compare test
properties under different conditions. Based on simulation
studies, Neuha
¨
user [2] proposes that the modified or gener-
alized Wilcoxon test by Brunner and Munzel (BM) [3]
should be applied when it cannot be assumed that variances
are equal and distributions symmetric. In such situations,
Skovlund and Fenstad [1] have shown that the properties
of the most commonly used classical two-sample tests
(t-test, WMW, or Welch test) are not acceptable and thus
suggested that transformations are necessary. It will in prac-
tice of course often be difficult to come up with satisfactory
transformations that will render an approximately symmet-
ric distribution. It is thus of great value to identify relevant
alternative tests and assess their properties. Based on simu-
lation models with data from normal and gamma distribu-
tions, Neuha
¨
user concludes that the properties of the BM
test are acceptable in situations with skewed distributions
and heteroscedasticity. He also refers to similar findings
from other simulation studies. Unfortunately, the literature
is not unanimous on this issue. Fagerland and Sandvik [4]
have examined the significance level of six different tests,
including the BM test, and have shown that nonrobustness
is a serious problem with all tests if distributions with
different skewness are compared.
The main message given by Neuha
¨
user [2] is that the BM
test controls type I error and should be applied when nonsym-
metrical distributions with unequal variance are compared.
Preservation of the type I error is a crucial property of a signif-
icance test, and based on his findings, this seems not to be
a bad idea. However, according to Fagerland and Sandvik
[4], even the BM test fails to maintain the nominal signifi-
cance level in many situations. For most settings studied,
even small differences in variance are shown to lead to non-
robust significance levels if the degree of skewness is large. In
addition to shape and scale, an important factor is whether the
two samples have different size. When sample sizes are
equal, their simulation studies show that the parametric tests
are superior to the rank-based tests under the null hypothesis
of equal means but not under the hypothesis of equality of
medians. However, among the rank-based tests, the BM test
is certainly better than the WMW and seems to be slightly
better than other modifications.
A common question posed in clinical research is
whether the effect of one treatment is different from an-
other. Perhaps not obvious to all clinicians, this question
is far from specific. Any statistician would of course imme-
diately ask by what kind of metric or location parameter the
treatments differ. The paper by Neuha
¨
user implicitly
touches on this question. According to the author, testing
for the relative effect may be more appropriate than com-
paring location parameters. Although he may be right that
the hypothesis is appropriate and although the parameter
P(X ! Y ) can be regarded as well established in the non-
parametric setting, it might prove a challenge to clarify
the correct interpretation to nonstatisticians. There are
many situations when either means or medians are equal,
but a rank-based test will reject the null hypothesis. The
interpretation of a significant P-value is then not straight-
forward. The only conclusion that can be drawn is that
the two treatments differ in some way, but it might be
difficult to come up with a reasonable efficacy estimate
* Corresponding author. Tel: þ472-285-6172; fax þ472-285-4402.
E-mail address: eva.skovlund@farmasi.uio.no
0895-4356/$ e see front matter Ó 2010 Elsevier Inc. All rights reserved.
doi: 10.1016/j.jclinepi.2009.09.011
Journal of Clinical Epidemiology 63 (2010) 594e595