偏差

维基百科，自由的百科全书

偏差正在翻译。欢迎您积极翻译与修订。

在统计学中，偏差可以用于两个不同的概念，即有偏采样与有偏估计。一个有偏采样是对总样本集非平等采样，而一个有偏估计则是指高估或低估要估计的量。

偏差不一定有害。尽管一个有偏采样会难以分析或引起不准确甚至错误的推断，但是有偏估计在某些情况下也有一些好的特性，例如较小的方差。

Self-selected opinion polls (SLOP - voluntary) create samples that are biased.(偏差概念)

[编辑] 有偏采样

A sample is biased if some members of the population are more likely to be chosen in the sample than others. A biased sample will generally give you a misestimate of the quantity being estimated. For example, if your sample contains members with a higher or lower value of the quantity being estimated, the outcome will be higher or lower than the true value.

A famous case of what can go wrong when using a biased sample is found in the 1936 US presidential election polls. The Literary Digest held a poll that forecast that Alfred M. Landon would defeat Franklin Delano Roosevelt by 57% to 43%. George Gallup, using a much smaller sample (300,000 rather than 2,000,000), predicted Roosevelt would win, and he was right. What went wrong with the Literary Digest poll? They had used lists of telephone and automobile owners to select their sample. In those days, these were luxuries, so their sample consisted mainly of middle- and upper-class citizens. These voted in majority for Landon, but the lower classes voted for Roosevelt. Because their sample was biased towards wealthier citizens, their result was incorrect.

This kind of bias is usually regarded as a worse problem than statistical noise: Problems with statistical noise can be lessened by enlarging the sample, but a biased sample will not go away that easily. In particular, a meta-analysis will distill good data from studies that themselves suffer from statistical noise, but a meta-analysis of biased studies will be biased itself.

[编辑] 有偏估计

Another kind of bias in statistics does not involve biased samples, but does involve the use of a statistic whose expectation differs from the value of the quantity being estimated. Suppose we are trying to estimate the parameter $θ$ using an estimator $\hat{\theta}$ (that is, some function of the observed data). Then the bias of $\hat{\theta}$ is defined to be

$\operatorname{E}(\hat{\theta})-\theta.$

In words, this would be "the expected value of the estimator $\hat{\theta}$ minus the true value $θ$ ". This may be rewritten as

$\operatorname{E}(\hat{\theta}-\theta).$

which would read "the expected value of the difference between the estimator and the true value" (the expected value of $θ$ is $θ$ ).

For example, suppose X₁, ..., X_n are independent and identically distributed random variables with expectation μ and variance σ². Let

$\overline{X}=(X_1+\cdots+X_n)/n$

be the "sample average", and let

$S^2=\frac{1}{n}\sum_{i=1}^n(X_i-\overline{X}\,)^2$

be a "sample variance". Then S² is a "biased estimator" of σ² because

$\operatorname{E}(S^2)=\frac{n-1}{n}\sigma^2\neq\sigma^2.$

Note that when a transformation is applied to an unbiased estimator, the result is not necessarily itself an unbiased estimate of its corresponding population statistic. That is, for a non-linear function f and an unbiased estimator U of a parameter p, f(U) is usually not an unbiased estimator of f(p). For example the square root of the unbiased estimator of the population variance is not an unbiased estimator of the population standard deviation.

Bias is not the only consideration when choosing a statistic, however. Bias refers to the central tendency of the sampling distribution of a statistic, but the variance of the sampling distribution can also be an important consideration. Specifically, statistics with smaller sampling variances will yield greater statistical power. For example, while S² above is more biased than the traditional sample calculation

$S_{sample}^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\overline{X}\,)^2$ ,

S² has a lower estimation variability than S_sample². Therefore, for some applications (where the amount of bias can be equated between groups/conditions) the biased estimator will prove to be a more powerful, and therefore useful, statistic.

A far more extreme case of a biased estimator being better than any unbiased estimator is well-known: Suppose X has a Poisson distribution with expectation λ. It is desired to estimate

$\operatorname{P}(X=0)^2=e^{-2\lambda}.\quad$

The only function of the data constituting an unbiased estimator is

$\delta(X)=(-1)^X.\quad$

If the observed value of X is 100, then the estimate is 1, although the true value of the quantity being estimated is obviously very likely to be near 0, which is the opposite extreme. And if X is observed to be 101, then the estimate is even more absurd: it is −1, although the quantity being estimated obviously must be positive. The (biased) maximum-likelihood estimator

$e^{-2X}\quad$

is better than this unbiased estimator in the sense that the mean squared error

$e^{-4\lambda}-2e^{\lambda(1/e^2-3)}+e^{\lambda(1/e^4-1)}$

is smaller. Compare the unbiased estimator's MSE of

1 - e - 4λ

The MSE is a function of the true value λ. The bias of the maximum-likelihood estimator is:

$e^{-2\lambda}-e^{\lambda(1/e^2-1)}$ .

The bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 through to n are placed in a box and one is selected at random, giving a value X. If n is unknown, then the maximum-likelihood estimator of n is X, even though the expectation of X is only n/2; we can only be certain that n is at least X and is probably more. In this case, the natural unbiased estimator is 2X − 1.