回帰メモ
- Rmd
回帰手習い: Introduction to Regression
- 作者: ryamada
- 発売日: 2016/07/21
- メディア: Kindle版
- この商品を含むブログを見る
--- title: "Regression 回帰" author: "ryamada" date: "Thursday, July 21, 2016" output: html_document --- # 1. Linear regression 線形回帰 ## Log-normal distribution. 対数正規分布 ```{r} # sample size n <- 10000 # y = a x + b + error # error ~ normal distribution a <- 2 b <- 0.5 # x log normal distribution x <- exp(rnorm(n,1,0.5)) hist(x) ``` ## Linear model with Gaussian error 正規分布誤差を伴う線形モデル $ y = ax + b + \epsilon$, $\epsilon ~ N(0,sd)$ ```{r} y.theory <- a * x + b err <- rnorm(n,0,3) hist(err) y.obs <- y.theory + err plot(x,y.obs,pch=20) abline(b,a,col=2) ``` lm() function estimates coefficients a and b. lm()関数で係数a,bを推定する。 ```{r} lm.out <- lm(y.obs~x) lm.out ``` ## Linear algebra gives the answer 線形代数で答えが出る $\beta = (X^T X)^{-1} X^T y$ Our x object has values of a variable that are different among sample. In order to estimate the coefficient of the intercept, we need a dummy variable for which everybody has the same value. 説明変数オブジェクトxの値はサンプルごとに違う。 切片の係数を求めるためには、すべてのサンプルが同じ値を持つダミー変数を立てる必要がある。 ```{r} x.mat <- cbind(rep(1,n),x) # dummy column is added. head(x.mat) # Linear algebraic solution beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs beta lm.out ``` #2. Likelihood when error in normal distribution 正規乱数仮定での尤度 ## Linear model without intercept. 切片なしのモデル $ y = ax + \epsilon$, $\epsilon ~ N(0,sd)$ ```{r} # No intercept assumption # y = a x + error # No dummy variable x.mat <- matrix(x,ncol=1) beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs beta plot(x,y.obs) abline(0,beta,col="red") abline(v=0) abline(h=0) ``` ```{r} # "No intercept" has to be specified by "-1" lm.out <- lm(y.obs ~ x -1) lm.out beta ``` ## Probability Density Function (pdf) of Normal Distribution $$\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-m)^2}{2\sigma^2}}$$ For each sample, there are two y values; one is the observed and the other is estimaetd in the linear model $\hat{y} = a x$. m and x in the formula above represent estimated(believed to be true) and observed, respectively. サンプルごとに2つのyの値がある。1つは観測値、もう一つは線形回帰モデルで推定した値。 上式のm,xは推定値(真値と信じている値)と観測値に対応する。 The deviation between the two values are assumed in normal distribution. 2つの値のずれは正規分布に従っていると仮定している。 The $/sigma$ of the normal distribution is not known. 正規乱数の$\sigma$は不明。 The slope 'a' and $\sigma$ are both unknown variables. 傾き'a' と$\sigma$とは両方とも値が不明な変数。 We can calculate likelihood to observe the x and y for pairs of a and $\sigma$. これらを推定対象として、観測データに照らして尤度がどうなるを調べてみる。 Grid search is simple and easy. Log-likelihood is much easier to calculate. グリッドサーチは簡単だし、対数尤度の計算はさらに簡単。 ```{r} # Want to know "a" and error's distribution (mean = 0, sd ??) # Grid of slope 'a' and grid of sigma as <- seq(from = -3, to =3, length=50) sds <- seq(from = 0.1, to = 1.5, length=50) # Their combination as.sds <- expand.grid(as,sds) log.like <- rep(0,length(as.sds[,1])) log.like.2 <- log.like for(i in 1:length(log.like)){ a <- as.sds[i,1] s <- as.sds[i,2] y.pred <- x * a diff.y <- y.obs - y.pred # deviation from expected log.like[i] <- sum(log(1/sqrt(2*pi*s^2)) + (-diff.y^2/(2*s^2))) # we can do this way as well; dnorm() function is pdf of normal distribution log.like.2[i] <- sum(dnorm(diff.y,0,s,log=TRUE)) } plot(log.like,log.like.2) abline(0,1) ``` ## Visualize the logilikelihood ```{r setup,rgl=TRUE} library(rgl) library(knitr) knit_hooks$set(rgl = hook_rgl) ``` ```{r,rgl=TRUE} plot3d(cbind(as.sds,log.like)) ``` ```{r,rgl=TRUE} # One more logarithm may be better to grab the perspective plot3d(cbind(as.sds,-log(-log.like))) ``` Another way to visualize. 別の表示 Add a line of slope coefficient.傾きの推定値を表す直線を加える ```{r} image(as,sds,matrix(log(-log.like),ncol=length(sds))) abline(v=beta) ``` Regardless of $\sigma$ value, the maximum likelihood estimate under a fixed $\sigma$ value is the same as the linear regression estimate. Linear regression minimuzes sum of squared distance from the estimates. Maximum likelihood estimates based on the normal error sum up log-likelihood of normal distribution pdf that are proportinal to squared distance between the estimate and the observed. $\simga$の値によらず、$\sigma$の値を固定して考えれば、最尤値を与える傾き係数の値は同じ。 線形回帰では、推定値と観測値の違いの二乗和を最小にしている。 正規乱数を仮定した最尤推定法では、正規分布の確率密度の対数尤度を最大にしているが、それは、推定値と観測値との差の二乗和に比例している。 # 3. Logistic regression ## Risk is a funciton of explanatory variable. リスクは説明変数の関数 Outcome is binary. 従属変数が2値 Assume an explanatory variable x determines the probability to cause '1', which is somewhat away form sigmoidcurve as below. 説明変数1個が従属変数を1にする確率を決めるものとする。ただしその決め方は、シグモイドカーブから逸脱したものとして以下のように作成する。 ```{r} n <- 100 # x log normal distribution x <- exp(rnorm(n,1,0.5)) # probability to have a phenotype 1 depends on x. # prob depends on x monotonically but not in the shape of S. # sample size prob.x <- function(x){ p <- 0.5*x-1 p[which(p<0.2)] <- 0.2 p[which(p>0.8)] <- 0.8 return(p) } pr <- prob.x(x) plot(x,pr,ylim=c(0,1)) ``` ## Stochastic phenomenon simulation 確率事象データのシミュレーション作成 Assign phenotype based on the probability simply. ```{r} r <- runif(length(x)) y <- as.numeric(r < pr) # Stochastic assignment of y=1 plot(x,y,pch=20,cex=0.1) # A bit difficult to see... ``` ```{r} plot(x,pr,ylim=c(0,1)) # jitter() function adds small deviations for easier diplay. points(x,jitter(y,0.1),pch=20,cex=0.1) ``` ## glm() function : Logistic regression of R ```{r} # glm() function is designed for various regressions # therefore "logistic" should be specified with family option glm.out <- glm(y~x,family=binomial) glm.out glm.out$coefficients ``` Strength of risk is linear. ```{r} tmp <- glm.out$coefficients[1] + x*glm.out$coefficients[2] ``` probability is $\frac{p}{1-p} = exp(tmp)$. $p = \frac{exp(tmp)}{1+exp(tmp)}$. ```{r} pr.est <- exp(tmp)/(1+exp(tmp)) plot(x,pr.est,ylim=c(0,1)) points(x,jitter(y,0.1),pch=20,cex=0.1) points(x,pr,col="red") ```
- 1. Linear regression
# sample size n <- 10000 # y = a x + b + error # error ~ normal distribution a <- 2 b <- 0.5 # x log normal distribution x <- exp(rnorm(n,1,0.5)) hist(x) y.theory <- a * x + b err <- rnorm(n,0,3) hist(err) y.obs <- y.theory + err plot(x,y.obs,pch=20) abline(b,a,col=2) lm.out <- lm(y.obs~x) lm.out
-
- Coefficient for the intercept is one for the explanatory variable the value of which are the same throughout the samples.
x.mat <- cbind(rep(1,n),x) head(x.mat) beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs beta
- When the true value is p and the observed value is q, and when you assume error in normal distribution with mean 0 and vaiance , likelihood of the deviation from p to q is
- 2. Likelihood when error in normal distribution
- Model: y=ax + error
# No intercept assumption # y = a x + error x.mat <- matrix(x,ncol=1) beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs beta plot(x,y.obs) abline(0,beta,col="red") abline(v=0) abline(h=0) lm.out <- lm(y.obs ~ x -1) lm.out
-
- Grid search for the slope a and sd of normal distribution of error
- Regardless of the sd of error, maximum likelihood estimate of slope a is the same.
- Normal distribution's probability density function has , therefore the minimization of sum of squared distance and maximum likelihood estimate based on normal distribution error correspond each other.
# Want to know "a" and error's distribution (mean = 0, sd ??) as <- seq(from = -3, to =3, length=50) sds <- seq(from = 0.1, to = 1.5, length=50) as.sds <- expand.grid(as,sds) log.like <- rep(0,length(as.sds[,1])) for(i in 1:length(log.like)){ a <- as.sds[i,1] s <- as.sds[i,2] y.pred <- x * a diff.y <- y.obs - y.pred log.like[i] <- sum(log(1/sqrt(2*pi*s^2)) + (-diff.y^2/(2*s^2))) } library(rgl) plot3d(cbind(as.sds,log.like)) plot3d(cbind(as.sds,-log(-log.like))) image(as,sds,matrix(log(-log.like),ncol=length(sds))) abline(v=beta)
- 3. Logistic regression
# probability to have a phenotype 1 depends on x. # prob depends on x monotonically but not in the shape of S. # sample size n <- 100 # y = a x + b + error # error ~ normal distribution a <- 2 b <- 0.5 # x log normal distribution x <- exp(rnorm(n,1,0.5)) prob.x <- function(x){ p <- 0.5*x-1 p[which(p<0.2)] <- 0.2 p[which(p>0.8)] <- 0.8 return(p) } pr <- prob.x(x) plot(x,pr,ylim=c(0,1))
-
- Assign phenotype based on the probability simply.
r <- runif(length(x)) y <- as.numeric(r < pr) plot(x,y,pch=20,cex=0.1) # A bit difficult to see... plot(x,pr,ylim=c(0,1)) points(x,jitter(y,0.1),pch=20,cex=0.1)
-
- Logistic regression using glm() funciton
glm.out <- glm(y~x,family=binomial) glm.out glm.out$coefficients
-
- Strength of risk is linear.
tmp <- glm.out$coefficients[1] + x*glm.out$coefficients[2]
-
- probability is
pr.est <- exp(tmp)/(1+exp(tmp)) plot(x,pr.est,ylim=c(0,1)) points(x,jitter(y,0.1),pch=20,cex=0.1) points(x,pr,col="red")