即効データシミュレーション実習 - ryamadaの遺伝学・遺伝統計学メモ

Rの初心者だけれど、「もう大人」なので、なんとか使えるようになろうという意識は高い人には、「つまらない」例ではなくて、少々わかりにくくても、「できるといいな」という例がよい
こちらのJuly15-の週の" Click to download today's text File "
こんなことも、あんなことも…というのはあるけれど、「敷居をまたぐ」ことがこの段階では一番大事なので、「理論」はなるべくなしで。
自力htlm化が面倒なら…→Kindle $1
作者: ryamada
発売日: 2014/06/21
メディア: Kindle版
この商品を含むブログ (1件) を見る
以下、Rmdファイル
# Multivariate Data 多変量のデータ

# Two variables 二変量、Independence 独立

## Discrete vs. Discrete 離散 対 離散

Variable 1 V1 変量１：Sex 性別

Variable 2 V2 変量２ : Pass/Failure 合否

* V1: nm males and mf females.男nm人、女nf人
* V2 : Overall passing rate p. 全体の合格率p
* V1 and V2 are independent, no association between V1 and V2. V1とV2は独立、無関連
```{r}
nm <- 18
nf <- 12
# 1: male, 2: female
V1 <- c(rep(1,nm),rep(2,nf))
p <- 0.7
# 0: faillure, 1: pass
V2 <- sample(c(0,1),nm+nf,prob=c(1-p,p),replace=TRUE)
# What V1 and V2 are like?
V1
V2
# count 
table(V1)
table(V2)
```

* V2 : Pass or Failure. np passed
```{r}
np <- (nm+nf)*p
V2 <- c(rep(1,np),rep(0,nm+nf-np))
V2
table(V2)
```

* Combine V1 and V2 2変量を結合する
```{r}
V1V2 <- data.frame(V1,V2)
V1V2
table(V1V2)
```
* Independence 独立
```{r}
V1.1 <- sample(V1)
V2.1 <- sample(V2)
V1V2.1 <- data.frame(V1.1,V2.1)
table(V1V2.1)
```

### Independent table 独立なテーブル
* male, female fraction : pm,1-pm　男女の割合
* pass,failure fraction : p 1-p 合否の割合
* Independence is multiplication 独立は掛け算

```{r}
pm <- 0.4
pmf <- c(pm,1-pm)
p <- 0.7
p.pass.failure <- c(p,1-p)
# Independence
pm.pass <- pmf[1] * p.pass.failure[1]
pm.fail <- pmf[1] * p.pass.failure[2]
pf.pass <- pmf[2] * p.pass.failure[1]
pf.fail <- pmf[2] * p.pass.failure[2]
p.mf.pf <- matrix(c(pm.pass,pm.fail,pf.pass,pf.fail),byrow=TRUE,2,2)
p.mf.pf
# Same calculation
p.mf.pf2 <- outer(pmf,p.pass.failure)
p.mf.pf2
sum(p.mf.pf)
```

### Make a table with independent fractions 独立な場合の割合でテーブルを作る

* Total n individuals 全部でn人
```{r}
n <- 110
number.of.table <- 1
table1 <- rmultinom(number.of.table,n,c(p.mf.pf))
table1
sum(table1)
```

### Test independence 独立性をテストする
```{r}
matrix(table1,2,2)
test.out <- chisq.test(matrix(table1,2,2),correct=FALSE)
test.out
test.out$p.value
```

### Generate many tables and test independence たくさんのテーブルを作って独立性をテストする

Each column is four cells. 各列が４セルの数
```{r}
number.of.table <- 3
tables <- rmultinom(number.of.table,n,c(p.mf.pf))
tables
tables[,1]
tables[,2]
```

Loop to calculate p values of every table 各テーブルのp値を計算するためにループを回す
```{r}
number.of.table <- 1000
tables <- rmultinom(number.of.table,n,c(p.mf.pf))
# Stocker of p-values
p.vals <- rep(0,number.of.table)
for(i in 1:number.of.table){
  four.vals <- tables[,i]
  test.out <- chisq.test(matrix(four.vals,2,2),correct=FALSE)
  p.vals[i] <- test.out$p.value
}
min(p.vals)
max(p.vals)
mean(p.vals)
summary(p.vals)
# Uniform distribution
hist(p.vals)
plot(sort(p.vals))
abline(0,1/number.of.table,col=2)
```

### Not independent, Dependent, Associated 独立でない、相互に依存している、関連がある

* Male's pass-probability pass.m, female's pass.f
* Fraction of male : pm
```{r}
pass.m <- 0.6
pass.f <- 0.7
pm <- 0.4

p.mf.pf3 <- c(pm * c(pass.m,1-pass.m), (1-pm) * c(pass.f,1-pass.f))
p.mf.pf3
```

* Many tables with this assumption (Alternative hypothesis) この仮定でたくさんのテーブルを
```{r}
number.of.table <- 1000
tables <- rmultinom(number.of.table,n,c(p.mf.pf3))
# Stocker of p-values
p.vals <- rep(0,number.of.table)
for(i in 1:number.of.table){
  four.vals <- tables[,i]
  test.out <- chisq.test(matrix(four.vals,2,2),correct=FALSE)
  p.vals[i] <- test.out$p.value
}
min(p.vals)
max(p.vals)
mean(p.vals)
summary(p.vals)
# Non-Uniform distribution
hist(p.vals)
plot(sort(p.vals))
abline(0,1/number.of.table,col=2)
```

## Discrete vs. Discrete II 離散 対 離散

Variable 1 V1 変量１：Treatment types 1,2,3 治療タイプ

Variable 2 V2 変量２：Number of syncopal episodes 失神エピソード数

* V1: ns (No. individuals per type)
* V2 : Poisson distribution (mean)

```{r}
n.types <- 3
types <- 1:3
ns <- c(50,40,60)
means <- c(2.1,3.2,0.3)
V1 <- V2 <- c()
for(i in 1:n.types){
  V1 <- c(V1,rep(types[i],ns[i]))
  V2 <- c(V2,rpois(ns[i],means[i]))
}
V12 <- data.frame(V1,V2)
V12
table(V12)
```

## Discrete vs. Continuous 離散 対 連続
Variable 1 V1 変量１：Stage of a disease 1,2,3,4 疾患のステージ

Variable 2 V2 変量２ : A clinical marker 臨床マーカー

* V1: ns (No. individuals per stage)
* V2 : Normally distributed (means and sds) 

```{r}
n.stages <- 4
ns <- c(50,40,30,10)
V1 <- c()
for(i in 1:n.stages){
  V1 <- c(V1,rep(i,ns[i]))
}
table(V1)

means <- c(100,100,100,100)
sds <- c(10,10,10,10)

V2 <- c()
for(i in 1:n.stages){
  V2 <- c(V2,rnorm(ns[i],means[i],sds[i]))
}
plot(V1,V2)
boxplot(V2~V1)
```

* Stages and marker values are associated ステージとマーカー値は関連している
* Means and stages are in linear relation 平均値とステージが線形関係
* Sds and stages are in linear 標準偏差とステージも線形
```{r}
stages <- 1:4
a <- 5
b <- 90
means <- a * stages +b
means
a2 <- 2
b2 <- 9
sds <- a2 * stages + b2
V2 <- c()
for(i in 1:n.stages){
  V2 <- c(V2,rnorm(ns[i],means[i],sds[i]))
}
plot(V1,V2)
abline(b,a,col=2)
boxplot(V2~V1)
abline(b,a,col=2)
```

* A bit complicated 少し複雑に
```{r}
stages <- 1:4
a <- 5
b <- 90
means <- a * stages^2 +b
means
a2 <- 2
b2 <- 9
sds <- a2 * stages + b2
V2 <- c()
for(i in 1:n.stages){
  V2 <- c(V2,rnorm(ns[i],means[i],sds[i]))
}
plot(V1,V2)
x <- seq(from=0,to=5,length=100)
points(x,a*x^2+b,type="l",col=2)
boxplot(V2~V1)
points(x,a*x^2+b,type="l",col=2)
```

## Continuous vs. Continuous 連続 対 連続

Variable 1 V1 変量１：Height 身長

Variable 2 V2 変量２ : Weight 体重

* V1: Nornal distribution
* V2 : Normal distribution
* V1 and V2 are associated V2 = a*V1 + b + z; z : normal dist randomness with mean 0 sd = k * V1 V1とV2は関連があって、V2はa*V1+bに平均0、標準偏差がV1に比例

Weight is in the distribution conditional to Height 体重は身長に関する条件付き分布に従う

```{r}
n <- 500
m.height <- 165
sd.height <- 10
V1 <- rnorm(n,m.height,sd.height)
a <- 0.3
b <- 10
k <- 1/50
V2 <- a*V1 + b
for(i in 1:n){
  V2[i] <- V2[i] + rnorm(1,0,k*V1[i])
}
plot(V1,V2)
lm.out <- lm(V2~V1)
abline(lm.out$coefficients[1],lm.out$coefficients[2],col=2)
anova(lm.out)
```


* If Height and Weight are mutually independent? 身長・体重が独立だったら？

```{r}
V2.shuffle <- sample(V2)
lm.out.2 <- lm(V2.shuffle~V1)
plot(V1,V2.shuffle,pch=20)
points(V1,V2,col=3,pch=20,cex=1)
abline(lm.out$coefficients[1],lm.out$coefficients[2],col=2)
abline(lm.out.2$coefficients[1],lm.out.2$coefficients[2],col=3)
anova(lm.out.2)
```

### A bit complicated 少し複雑に

* There are two populations; one is healthy and the other is diseased ２集団ある。健康群と病気群
* There are two markers that segregate two groups ２つのマーカーがあって、それにより２群は分けられる
* One marker is weight (W) and the other marker is serum marker M １つのマーカーは体重、もう１つのマーカーは血清マーカーM
* W and M are positively correlated WとMとは正の相関がある
* Disease is indicated by relatively high M value respect to W 体重に照らして相対的にMが大きいとき疾患である

```{r}
# Healthy
nh <- 500
m.weight <- 55
sd.weight <- 5
W.h <- rnorm(nh,m.weight,sd.weight)
a <- 0.03
b <- -0.5
M.h <- a*W.h+b+rnorm(nh,0,0.1)
plot(W.h,M.h)
```

b represents relative highness of M Mが相対的に高い、というのは変数bで表す
```{r}
# Diseased
nd <- 500
m.weight <- 55
sd.weight <- 5
W.d <- rnorm(nd,m.weight,sd.weight)
a <- 0.03
b <- -0.3 # modified
M.d <- a*W.d+b+rnorm(nd,0,0.1)
ylim <- range(c(M.h,M.d))
plot(W.h,M.h,pch=20,ylim=ylim)
points(W.d,M.d,col=2,pch=20)
```

A bit more complicated もう少し複雑に

"Relative highness of M" should be more explicit when W is larger 体重が重い方が相対的な高さが顕著である
a represents it 変数aにて表現する
```{r}
# Diseased model 2
nd <- 500
m.weight <- 55
sd.weight <- 5
W.d <- rnorm(nd,m.weight,sd.weight)
a <- 0.035 # modified 
b <- -0.3 # previously modified
M.d.2 <- a*W.d+b+rnorm(nd,0,0.1)
ylim <- range(c(M.h,M.d,M.d.2))
plot(W.h,M.h,pch=20,ylim=ylim)
points(W.d,M.d,col=2,pch=20)
points(W.d,M.d.2,col=3,pch=20)
```

A further more complicated さらに複雑に

Diseased population's weight is smaller 病気群は体重が軽め

```{r}
# Diseased model 2
nd <- 500
m.weight <- 50
sd.weight <- 5
W.d.2 <- rnorm(nd,m.weight,sd.weight)
a <- 0.035 # modified 
b <- -0.3 # previously modified
M.d.3 <- a*W.d.2+b+rnorm(nd,0,0.1)
xlim <- range(c(W.d,W.d.2))
ylim <- range(c(M.h,M.d,M.d.2,M.d.3))
plot(W.h,M.h,pch=20,xlim=xlim,ylim=ylim)
points(W.d,M.d,col=2,pch=20)
points(W.d,M.d.2,col=3,pch=20)
points(W.d.2,M.d.3,col=4,pch=20)
```

It is better to handle data as they are, but you can make an index with W and M データをそのまま用いる方がよいけれど、２変量W,Mからインデックスを作ってもよい

```{r}
MW.h <- M.h/W.h
MW.d <- M.d.3/W.d.2
h <- hist(c(MW.h,MW.d),plot=FALSE)
hist(MW.h,breaks=h$breaks,density=17)
hist(MW.d,breaks=h$breaks,col=2,density=20,add=TRUE)
```

## One more step もう１歩

* So far we assumed two distinct populations, healthy and diseased but in reality they might be continuous ここまでは2群が存在すると仮定してきたが、実際は連続かもしれない
* Healthy/diseased phenotype (P) is now converted to continuous 健・病フェノタイプ(P)を連続値にする
* P determines parameter a, b and affects Weight Pはa,bを定め、体重にも影響する


```{r}
# Total sample size
n <- 1000
# Healthy/Disease phenotype
P <- rexp(n)
hist(P)
# a, b determined by P
as <- 0.03 + 0.01 * P
bs <- -0.5 + 0.05 * P

# Weight is in normal distribution with reduction by P
m.weight <- 55
sd.weight <- 5
W <- rnorm(n,m.weight,sd.weight) - P * 2

# M
M <- as * W + bs + rnorm(n,0,0.5)

hist(M/W)

# How good M/W as an index of P?
plot(P,M/W)

# P is standardized between 0 and 1
P.st <- (P-min(P))/(max(P)-min(P))
# Color based on phenotype
col <- rgb(P.st,1-P.st,0)
plot(W,M,col=col,pch=20)
# size of dots are proportional to P.st
plot(W,M,cex=P.st*2,pch=20)
```

## More than two variables 3個以上の変量

* V1,V2,V3,... affect Y with coeffecients a1,a2,a3...,b1,b2,b3 with normal distribution biase

```{r}
n <- 1000
V1 <- sample(0:1,n,replace=TRUE)
V2 <- rexp(n, 3)
V3 <- rnorm(n,300,20)
a1 <- 50
b1 <- 0
a2 <- 10 
b2 <- 0
a3 <- 1
b3 <- -200
Y <- (a1 * V1 + b1) + (a2 *V2  + b2) + (a3 * V3 + b3) + rnorm(n,0,50) 
D <- data.frame(V1,V2,V3,Y)
plot(D,pch=20,cex=0.2)
lm.out <- lm(Y ~ V1+V2+V3)
lm.out
anova(lm.out)
```
## Homework this month

Describe your data model and generate data set(s) with plot(s) explaining the data well. 
Parameterize the effect size of your model and make multiple data sets with multiple plots with which you can explain effect size in the plot.

自身の研究対象に応じてモデルを作り、相当するデータセット(1個以上)を作れ。データの様相がよくわかるプロットをつけること。
また、モデルで設定したエフェクトサイズをパラメタ化し、複数のパラメタ値に対して同様にデータセットとプロットを作ること。プロットはエフェクトサイズがどこにどのように現れているかを説明するためのプロットであるとして作成せよ