# 回帰メモ

• Rmd

もしくはこちら

---
title: "Regression 回帰"
date: "Thursday, July 21, 2016"
output: html_document
---

# 1. Linear regression 線形回帰

## Log-normal distribution. 対数正規分布

{r}
# sample size
n <- 10000
# y = a x + b + error
# error ~ normal distribution
a <- 2
b <- 0.5
# x log normal distribution

x <- exp(rnorm(n,1,0.5))
hist(x)


## Linear model with Gaussian error 正規分布誤差を伴う線形モデル

$y = ax + b + \epsilon$, $\epsilon ~ N(0,sd)$

{r}
y.theory <- a * x + b
err <- rnorm(n,0,3)
hist(err)
y.obs <- y.theory + err

plot(x,y.obs,pch=20)
abline(b,a,col=2)


lm() function estimates coefficients a and b. lm()関数で係数a,bを推定する。

{r}
lm.out <- lm(y.obs~x)
lm.out


## Linear algebra gives the answer 線形代数で答えが出る
$\beta = (X^T X)^{-1} X^T y$

Our x object has values of a variable that are different among sample.
In order to estimate the coefficient of the intercept, we need a dummy variable for which everybody has the same value.

{r}
x.mat <- cbind(rep(1,n),x) # dummy column is added.

# Linear algebraic solution
beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs
beta
lm.out


#2. Likelihood when error in normal distribution 正規乱数仮定での尤度

## Linear model without intercept. 切片なしのモデル

$y = ax + \epsilon$, $\epsilon ~ N(0,sd)$

{r}
# No intercept assumption
# y = a x + error

# No dummy variable
x.mat <- matrix(x,ncol=1)

beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs
beta

plot(x,y.obs)
abline(0,beta,col="red")
abline(v=0)
abline(h=0)

{r}
# "No intercept" has to be specified by "-1"
lm.out <- lm(y.obs ~ x -1)
lm.out
beta


## Probability Density Function (pdf) of Normal Distribution

$$\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-m)^2}{2\sigma^2}}$$

For each sample, there are two y values; one is the observed and the other is estimaetd in the linear model $\hat{y} = a x$.
m and x in the formula above represent estimated(believed to be true) and observed, respectively.

サンプルごとに2つのyの値がある。1つは観測値、もう一つは線形回帰モデルで推定した値。

The deviation between the two values are assumed in normal distribution.

2つの値のずれは正規分布に従っていると仮定している。

The $/sigma$ of the normal distribution is not known.

The slope 'a' and $\sigma$ are both unknown variables.

We can calculate likelihood to observe the x and y for pairs of a and $\sigma$.

これらを推定対象として、観測データに照らして尤度がどうなるを調べてみる。

Grid search is simple and easy.
Log-likelihood is much easier to calculate.

グリッドサーチは簡単だし、対数尤度の計算はさらに簡単。

{r}
# Want to know "a" and error's distribution (mean = 0, sd ??)
# Grid of slope 'a' and grid of sigma
as <- seq(from = -3, to =3, length=50)
sds <- seq(from = 0.1, to = 1.5, length=50)
# Their combination
as.sds <- expand.grid(as,sds)

log.like <- rep(0,length(as.sds[,1]))
log.like.2 <- log.like
for(i in 1:length(log.like)){
a <- as.sds[i,1]
s <- as.sds[i,2]
y.pred <- x * a
diff.y <- y.obs - y.pred # deviation from expected
log.like[i] <- sum(log(1/sqrt(2*pi*s^2)) + (-diff.y^2/(2*s^2)))
# we can do this way as well; dnorm() function is pdf of normal distribution
log.like.2[i] <- sum(dnorm(diff.y,0,s,log=TRUE))
}
plot(log.like,log.like.2)
abline(0,1)


## Visualize the logilikelihood
{r setup,rgl=TRUE}
library(rgl)
library(knitr)
knit_hooks$set(rgl = hook_rgl)  {r,rgl=TRUE} plot3d(cbind(as.sds,log.like))  {r,rgl=TRUE} # One more logarithm may be better to grab the perspective plot3d(cbind(as.sds,-log(-log.like)))  Another way to visualize. 別の表示 Add a line of slope coefficient.傾きの推定値を表す直線を加える {r} image(as,sds,matrix(log(-log.like),ncol=length(sds))) abline(v=beta)  Regardless of$\sigma$value, the maximum likelihood estimate under a fixed$\sigma$value is the same as the linear regression estimate. Linear regression minimuzes sum of squared distance from the estimates. Maximum likelihood estimates based on the normal error sum up log-likelihood of normal distribution pdf that are proportinal to squared distance between the estimate and the observed.$\simga$の値によらず、$\sigma$の値を固定して考えれば、最尤値を与える傾き係数の値は同じ。 線形回帰では、推定値と観測値の違いの二乗和を最小にしている。 正規乱数を仮定した最尤推定法では、正規分布の確率密度の対数尤度を最大にしているが、それは、推定値と観測値との差の二乗和に比例している。 # 3. Logistic regression ## Risk is a funciton of explanatory variable. リスクは説明変数の関数 Outcome is binary. 従属変数が2値 Assume an explanatory variable x determines the probability to cause '1', which is somewhat away form sigmoidcurve as below. 説明変数１個が従属変数を1にする確率を決めるものとする。ただしその決め方は、シグモイドカーブから逸脱したものとして以下のように作成する。 {r} n <- 100 # x log normal distribution x <- exp(rnorm(n,1,0.5)) # probability to have a phenotype 1 depends on x. # prob depends on x monotonically but not in the shape of S. # sample size prob.x <- function(x){ p <- 0.5*x-1 p[which(p<0.2)] <- 0.2 p[which(p>0.8)] <- 0.8 return(p) } pr <- prob.x(x) plot(x,pr,ylim=c(0,1))  ## Stochastic phenomenon simulation 確率事象データのシミュレーション作成 Assign phenotype based on the probability simply. {r} r <- runif(length(x)) y <- as.numeric(r < pr) # Stochastic assignment of y=1 plot(x,y,pch=20,cex=0.1) # A bit difficult to see...  {r} plot(x,pr,ylim=c(0,1)) # jitter() function adds small deviations for easier diplay. points(x,jitter(y,0.1),pch=20,cex=0.1)  ## glm() function : Logistic regression of R {r} # glm() function is designed for various regressions # therefore "logistic" should be specified with family option glm.out <- glm(y~x,family=binomial) glm.out glm.out$coefficients


Strength of risk is linear.
{r}
tmp <- glm.out$coefficients[1] + x*glm.out$coefficients[2]


probability is $\frac{p}{1-p} = exp(tmp)$.

$p = \frac{exp(tmp)}{1+exp(tmp)}$.

{r}
pr.est <- exp(tmp)/(1+exp(tmp))

plot(x,pr.est,ylim=c(0,1))
points(x,jitter(y,0.1),pch=20,cex=0.1)
points(x,pr,col="red")


• 1. Linear regression
# sample size
n <- 10000
# y = a x + b + error
# error ~ normal distribution
a <- 2
b <- 0.5
# x log normal distribution

x <- exp(rnorm(n,1,0.5))
hist(x)

y.theory <- a * x + b
err <- rnorm(n,0,3)
hist(err)
y.obs <- y.theory + err

plot(x,y.obs,pch=20)
abline(b,a,col=2)

lm.out <- lm(y.obs~x)
lm.out

• Coefficient for the intercept is one for the explanatory variable the value of which are the same throughout the samples.
x.mat <- cbind(rep(1,n),x)

beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs
beta

• When the true value is p and the observed value is q, and when you assume error in normal distribution with mean 0 and vaiance , likelihood of the deviation from p to q is
• 2. Likelihood when error in normal distribution
• Model: y=ax + error
# No intercept assumption
# y = a x + error

x.mat <- matrix(x,ncol=1)

beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs
beta

plot(x,y.obs)
abline(0,beta,col="red")
abline(v=0)
abline(h=0)

lm.out <- lm(y.obs ~ x -1)
lm.out

• Grid search for the slope a and sd of normal distribution of error
• Regardless of the sd of error, maximum likelihood estimate of slope a is the same.
• Normal distribution's probability density function has , therefore the minimization of sum of squared distance and maximum likelihood estimate based on normal distribution error correspond each other.
# Want to know "a" and error's distribution (mean = 0, sd ??)

as <- seq(from = -3, to =3, length=50)
sds <- seq(from = 0.1, to = 1.5, length=50)

as.sds <- expand.grid(as,sds)

log.like <- rep(0,length(as.sds[,1]))
for(i in 1:length(log.like)){
a <- as.sds[i,1]
s <- as.sds[i,2]
y.pred <- x * a
diff.y <- y.obs - y.pred
log.like[i] <- sum(log(1/sqrt(2*pi*s^2)) + (-diff.y^2/(2*s^2)))
}

library(rgl)
plot3d(cbind(as.sds,log.like))
plot3d(cbind(as.sds,-log(-log.like)))

image(as,sds,matrix(log(-log.like),ncol=length(sds)))
abline(v=beta)

• 3. Logistic regression
# probability to have a phenotype 1 depends on x.
# prob depends on x monotonically but not in the shape of S.
# sample size
n <- 100
# y = a x + b + error
# error ~ normal distribution
a <- 2
b <- 0.5
# x log normal distribution

x <- exp(rnorm(n,1,0.5))

prob.x <- function(x){
p <- 0.5*x-1
p[which(p<0.2)] <- 0.2
p[which(p>0.8)] <- 0.8
return(p)
}

pr <- prob.x(x)
plot(x,pr,ylim=c(0,1))

• Assign phenotype based on the probability simply.
r <- runif(length(x))
y <- as.numeric(r < pr)
plot(x,y,pch=20,cex=0.1) # A bit difficult to see...
plot(x,pr,ylim=c(0,1))
points(x,jitter(y,0.1),pch=20,cex=0.1)

• Logistic regression using glm() funciton
glm.out <- glm(y~x,family=binomial)
glm.out
glm.out$coefficients  • Strength of risk is linear. tmp <- glm.out$coefficients[1] + x*glm.out\$coefficients[2]

• probability is
pr.est <- exp(tmp)/(1+exp(tmp))

plot(x,pr.est,ylim=c(0,1))
points(x,jitter(y,0.1),pch=20,cex=0.1)
points(x,pr,col="red")