2016-07-21

回帰メモ

線形代数回帰

作者: ryamada
発売日: 2016/07/21
メディア: Kindle版
この商品を含むブログを見る

もしくはこちら

---
title: "Regression 回帰"
author: "ryamada"
date: "Thursday, July 21, 2016"
output: html_document
---

# 1. Linear regression 線形回帰

## Log-normal distribution. 対数正規分布


```{r}
# sample size
n <- 10000
# y = a x + b + error
# error ~ normal distribution
a <- 2
b <- 0.5
# x log normal distribution

x <- exp(rnorm(n,1,0.5))
hist(x)
```

## Linear model with Gaussian error 正規分布誤差を伴う線形モデル

$ y = ax + b + \epsilon$, $\epsilon ~ N(0,sd)$

```{r}
y.theory <- a * x + b
err <- rnorm(n,0,3)
hist(err)
y.obs <- y.theory + err

plot(x,y.obs,pch=20)
abline(b,a,col=2)
```

lm() function estimates coefficients a and b. lm()関数で係数a,bを推定する。

```{r}
lm.out <- lm(y.obs~x)
lm.out
```

## Linear algebra gives the answer 線形代数で答えが出る
$\beta = (X^T X)^{-1} X^T y$

Our x object has values of a variable that are different among sample. 
In order to estimate the coefficient of the intercept, we need a dummy variable for which everybody has the same value.

説明変数オブジェクトxの値はサンプルごとに違う。
切片の係数を求めるためには、すべてのサンプルが同じ値を持つダミー変数を立てる必要がある。

```{r}
x.mat <- cbind(rep(1,n),x) # dummy column is added.

head(x.mat)
# Linear algebraic solution
beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs
beta
lm.out
```

#2. Likelihood when error in normal distribution 正規乱数仮定での尤度

## Linear model without intercept. 切片なしのモデル

$ y = ax + \epsilon$, $\epsilon ~ N(0,sd)$

```{r}
# No intercept assumption
# y = a x + error

# No dummy variable
x.mat <- matrix(x,ncol=1)

beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs
beta

plot(x,y.obs)
abline(0,beta,col="red")
abline(v=0)
abline(h=0)
```
```{r}
# "No intercept" has to be specified by "-1"
lm.out <- lm(y.obs ~ x -1)
lm.out
beta
```

## Probability Density Function (pdf) of Normal Distribution 

$$\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-m)^2}{2\sigma^2}}$$

For each sample, there are two y values; one is the observed and the other is estimaetd in the linear model $\hat{y} = a x$.
m and x in the formula above represent estimated(believed to be true) and observed, respectively.

サンプルごとに2つのyの値がある。1つは観測値、もう一つは線形回帰モデルで推定した値。
上式のm,xは推定値(真値と信じている値)と観測値に対応する。

The deviation between the two values are assumed in normal distribution.

2つの値のずれは正規分布に従っていると仮定している。

The $/sigma$ of the normal distribution is not known.
正規乱数の$\sigma$は不明。

The slope 'a' and $\sigma$ are both unknown variables.
傾き'a' と$\sigma$とは両方とも値が不明な変数。

We can calculate likelihood to observe the x and y for pairs of a and $\sigma$.

これらを推定対象として、観測データに照らして尤度がどうなるを調べてみる。

Grid search is simple and easy. 
Log-likelihood is much easier to calculate.

グリッドサーチは簡単だし、対数尤度の計算はさらに簡単。


```{r}
# Want to know "a" and error's distribution (mean = 0, sd ??)
# Grid of slope 'a' and grid of sigma
as <- seq(from = -3, to =3, length=50)
sds <- seq(from = 0.1, to = 1.5, length=50)
# Their combination
as.sds <- expand.grid(as,sds)

log.like <- rep(0,length(as.sds[,1]))
log.like.2 <- log.like
for(i in 1:length(log.like)){
  a <- as.sds[i,1]
	s <- as.sds[i,2]
	y.pred <- x * a
	diff.y <- y.obs - y.pred # deviation from expected
	log.like[i] <- sum(log(1/sqrt(2*pi*s^2)) + (-diff.y^2/(2*s^2)))
  # we can do this way as well; dnorm() function is pdf of normal distribution
  log.like.2[i] <- sum(dnorm(diff.y,0,s,log=TRUE))
}
plot(log.like,log.like.2)
abline(0,1)
```

## Visualize the logilikelihood
```{r setup,rgl=TRUE}
library(rgl)
library(knitr)
knit_hooks$set(rgl = hook_rgl)
```

```{r,rgl=TRUE}
plot3d(cbind(as.sds,log.like))
```
```{r,rgl=TRUE}
# One more logarithm may be better to grab the perspective
plot3d(cbind(as.sds,-log(-log.like)))
```

Another way to visualize. 別の表示

Add a line of slope coefficient.傾きの推定値を表す直線を加える

```{r}
image(as,sds,matrix(log(-log.like),ncol=length(sds)))
abline(v=beta)
```

Regardless of $\sigma$ value, the maximum likelihood estimate under a fixed $\sigma$ value is the same as the linear regression estimate.

Linear regression minimuzes sum of squared distance from the estimates.

Maximum likelihood estimates based on the normal error sum up log-likelihood of normal distribution pdf that are proportinal to squared distance between the estimate and the observed.

$\simga$の値によらず、$\sigma$の値を固定して考えれば、最尤値を与える傾き係数の値は同じ。

線形回帰では、推定値と観測値の違いの二乗和を最小にしている。

正規乱数を仮定した最尤推定法では、正規分布の確率密度の対数尤度を最大にしているが、それは、推定値と観測値との差の二乗和に比例している。


# 3. Logistic regression

## Risk is a funciton of explanatory variable. リスクは説明変数の関数
Outcome is binary. 従属変数が2値

Assume an explanatory variable x determines the probability to cause '1', which is somewhat away form sigmoidcurve as below.

説明変数１個が従属変数を1にする確率を決めるものとする。ただしその決め方は、シグモイドカーブから逸脱したものとして以下のように作成する。


```{r}
n <- 100
# x log normal distribution
x <- exp(rnorm(n,1,0.5))

# probability to have a phenotype 1 depends on x.
# prob depends on x monotonically but not in the shape of S.
# sample size
prob.x <- function(x){
	p <- 0.5*x-1
	p[which(p<0.2)] <- 0.2
	p[which(p>0.8)] <- 0.8
	return(p)
}

pr <- prob.x(x)
plot(x,pr,ylim=c(0,1))
```

## Stochastic phenomenon simulation 確率事象データのシミュレーション作成

Assign phenotype based on the probability simply.
```{r}
r <- runif(length(x))
y <- as.numeric(r < pr) # Stochastic assignment of y=1

plot(x,y,pch=20,cex=0.1) # A bit difficult to see...
```
```{r}
plot(x,pr,ylim=c(0,1))
# jitter() function adds small deviations for easier diplay.
points(x,jitter(y,0.1),pch=20,cex=0.1)
```

## glm() function : Logistic regression of R

```{r}
# glm() function is designed for various regressions
# therefore "logistic" should be specified with family option 
glm.out <- glm(y~x,family=binomial)
glm.out
glm.out$coefficients
```


Strength of risk is linear.
```{r}
tmp <- glm.out$coefficients[1] + x*glm.out$coefficients[2]
```

probability is $\frac{p}{1-p} = exp(tmp)$. 

$p = \frac{exp(tmp)}{1+exp(tmp)}$.


```{r}
pr.est <- exp(tmp)/(1+exp(tmp))

plot(x,pr.est,ylim=c(0,1))
points(x,jitter(y,0.1),pch=20,cex=0.1)
points(x,pr,col="red")
```

1. Linear regression

# sample size
n <- 10000
# y = a x + b + error
# error ~ normal distribution
a <- 2
b <- 0.5
# x log normal distribution

x <- exp(rnorm(n,1,0.5))
hist(x)

y.theory <- a * x + b
err <- rnorm(n,0,3)
hist(err)
y.obs <- y.theory + err

plot(x,y.obs,pch=20)
abline(b,a,col=2)

lm.out <- lm(y.obs~x)
lm.out

- Coefficient for the intercept is one for the explanatory variable the value of which are the same throughout the samples.

x.mat <- cbind(rep(1,n),x)

head(x.mat)

beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs
beta

When the true value is p and the observed value is q, and when you assume error in normal distribution with mean 0 and vaiance , likelihood of the deviation from p to q is
- $\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(q-p)^2}{2\sigma^2}}$
2. Likelihood when error in normal distribution
- Model: y=ax + error

# No intercept assumption
# y = a x + error

x.mat <- matrix(x,ncol=1)

beta <- solve(t(x.mat) %*% x.mat) %*% t(x.mat) %*% y.obs
beta

plot(x,y.obs)
abline(0,beta,col="red")
abline(v=0)
abline(h=0)

lm.out <- lm(y.obs ~ x -1)
lm.out

- Grid search for the slope a and sd of normal distribution of error
- Regardless of the sd of error, maximum likelihood estimate of slope a is the same.
- Normal distribution's probability density function has $e^{\hat{y}-y}^2}$ , therefore the minimization of sum of squared distance and maximum likelihood estimate based on normal distribution error correspond each other.

# Want to know "a" and error's distribution (mean = 0, sd ??)

as <- seq(from = -3, to =3, length=50)
sds <- seq(from = 0.1, to = 1.5, length=50)

as.sds <- expand.grid(as,sds)

log.like <- rep(0,length(as.sds[,1]))
for(i in 1:length(log.like)){
	a <- as.sds[i,1]
	s <- as.sds[i,2]
	y.pred <- x * a
	diff.y <- y.obs - y.pred
	log.like[i] <- sum(log(1/sqrt(2*pi*s^2)) + (-diff.y^2/(2*s^2)))
}

library(rgl)
plot3d(cbind(as.sds,log.like))
plot3d(cbind(as.sds,-log(-log.like)))

image(as,sds,matrix(log(-log.like),ncol=length(sds)))
abline(v=beta)

3. Logistic regression

# probability to have a phenotype 1 depends on x.
# prob depends on x monotonically but not in the shape of S.
# sample size
n <- 100
# y = a x + b + error
# error ~ normal distribution
a <- 2
b <- 0.5
# x log normal distribution

x <- exp(rnorm(n,1,0.5))


prob.x <- function(x){
	p <- 0.5*x-1
	p[which(p<0.2)] <- 0.2
	p[which(p>0.8)] <- 0.8
	return(p)
}

pr <- prob.x(x)
plot(x,pr,ylim=c(0,1))

- Assign phenotype based on the probability simply.

r <- runif(length(x))
y <- as.numeric(r < pr)
plot(x,y,pch=20,cex=0.1) # A bit difficult to see...
plot(x,pr,ylim=c(0,1))
points(x,jitter(y,0.1),pch=20,cex=0.1)

- Logistic regression using glm() funciton

glm.out <- glm(y~x,family=binomial)
glm.out
glm.out$coefficients

- Strength of risk is linear.

tmp <- glm.out$coefficients[1] + x*glm.out$coefficients[2]

- probability is $\frac{p}{1-p} = exp(tmp)$

pr.est <- exp(tmp)/(1+exp(tmp))

plot(x,pr.est,ylim=c(0,1))
points(x,jitter(y,0.1),pch=20,cex=0.1)
points(x,pr,col="red")