2013-07-11

線形回帰の信頼区間

線形回帰信頼区間 ggplot2 R

適当に点をとって、信頼区間を計算する関数predict()の出力を確認してみる
線形回帰の結果の中身を覗く

lm.out <- lm(y~x)
summary(lm.out)

> summary(lm.out)

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6354 -0.2825  0.1845  0.5036  1.6517 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.8206     0.2625  18.365 4.18e-13 ***
x             0.3212     0.3395   0.946    0.357    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.084 on 18 degrees of freedom
Multiple R-squared:  0.04737,   Adjusted R-squared:  -0.005552 
F-statistic: 0.8951 on 1 and 18 DF,  p-value: 0.3566

切片に着目する。切片とはx=0をよぎるところのこと。点推定値(Estimate)が4.8206で標準偏差が0.2625とある。検定はt検定量を使っている

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.8206     0.2625  18.365 4.18e-13 ***

95%信頼区間を考えるとき、下限・上限のクオンタイルは0.025と0.975
このt分布の自由度は18(とsummary(lm.out)にも書いてあるし、点の数が20なので、20-1-1=18)である
自由度18のt分布の0.025,0.975クオンタイル値は

> qt(c(0.025,0.975),18)
[1] -2.100922  2.100922

今、切片の標準偏差が0.2625となっているから

> summary(lm.out)$coefficients
             Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 4.8206304  0.2624963 18.3645642 4.178112e-13
x           0.3212255  0.3395294  0.9460904 3.566309e-01
> summary(lm.out)$coefficients[1,1]
[1] 4.82063
> summary(lm.out)$coefficients[1,2]
[1] 0.2624963
> summary(lm.out)$coefficients[1,2]*qt(c(0.025,0.975),18)
[1] -0.5514843  0.5514843
> summary(lm.out)$coefficients[1,1] + summary(lm.out)$coefficients[1,2]*qt(c(0.025,0.975),18)
[1] 4.269146 5.372115

切片の信頼区間がどのような計算になっているかの様子が見えた
では、切片以外の場所のそれ(以降に示した図)でも同様になっているかを覗いてみよう
以下の信頼区間の描図ではpredict()を使っている。実際にはlm()の出力オブジェクト用のpredict()ということでpredict.lm()関数が使われている。関数の中を覗くと、t分布を用いて計算していることがわかるが、それは見ればわかるので置いておいて、実際、出力が上の計算と合っているかどうかを調べてみよう

# 回帰式に則った推定値を知りたいxの値を決める
new <- data.frame(x = seq(-2, 2, 0.1))
# 知りたいxの値について、"confidence"を指定することで信頼区間も出させる(信頼区間のレベルは95%がデフォルト値)
pred.w.clim <- predict(lm(y ~ x), new, interval = "confidence")

点推定値と上下限値との３列行列が返ってきている

> head(pred.w.clim)
       fit      lwr      upr
1 4.178179 2.860550 5.495809
2 4.210302 2.958156 5.462448
3 4.242424 3.055088 5.429761
4 4.274547 3.151230 5.397864
5 4.306670 3.246437 5.366902
6 4.338792 3.340533 5.337051

今、y切片はx=0だから、newとして指定したxの値が0になっているのはnewの何番目の要素かを調べて、それに対応するpred.w.climを取り出してみよう

> pred.w.clim[which(new==0),]
     fit      lwr      upr 
4.820630 4.269146 5.372115

確かに、summary(lm(y~x))で求めたy切片の信頼区間に一致している
２通りで線形回帰結果の図を描いてみる

# データ点作成
x <- rnorm(20)
y <- 0.2*x + rnorm(20) + 5

xy <- data.frame(x,y)

# ggplot2を使う
library(ggplot2)
c <- ggplot(xy, aes(x,y))
c + stat_smooth(method = "lm") + geom_point()

# ggplot2を使わない
dev.new()
# 信頼区間を計算するにはpredict() (実体はpredict.lm())を使う
predict(lm(y ~ x))
# 推定値・信頼区間値を計算するx軸の値を決める
new <- data.frame(x = seq(-2, 2, 0.1))
# ２つの方法で推定する
pred.w.plim <- predict(lm(y ~ x), new, interval = "prediction")
pred.w.clim <- predict(lm(y ~ x), new, interval = "confidence")
matplot(new$x, cbind(pred.w.clim, pred.w.plim[,-1]),
        lty = c(1,2,2,3,3), type = "l", ylab = "predicted y")predict(lm(y ~ x))
abline(v=seq(from=-2,to=2,by=0.5))
abline(h=seq(from=3,to=7,by=0.5))