医科学のためのデータシミュレーション 2 - ryamadaの遺伝学・遺伝統計学メモ

こちらの続き
医学・生物学データの扱いに慣れるための、データシミュレーションセミナーの資料(医学研究科大学院コース　ゲノム遺伝学＠京大 2014:こちら)
htmlファイルはこちら(Login as a guest でログインすると見られます)
こちらを参考にすればhtml化可能です
これをやる前にこちらの内容はできることが必要です。
# Rを用いた医科学のためのデータシミュレーションの基礎2 Basics of Data Simulation for Medical Sciene with R 2

## 確率密度分布と累積分布、Probability Density Distribution and Cumulative Density Distribution

値が分布をなしているとき、大きく分けて2つのとらえ方がある。
その一つが密度分布、もう一つが累積分布である。

When values are in a disrtibution, there are two ways to grab them. One is a density distribution and the other is a cumulative distribution.

```{r}
# 正規分布に従う乱数を発生させる
# Generate random numbers in a normal distribution.
n <- 100
x <- rnorm(n)
boxplot(x)
```
```{r}
hist(x)
```
```{r}
plot(density(x))
```
```{r}
plot(sort(x))
```

ソートしてプロットすると累積に関する情報が得られる。
いくつかのやり方がある。
Plot a sorted value set, then you get an idea on their cumulative distribution.
There are several ways for it. 

```{r}
n <- 10
x <- rnorm(n)
s.x <- sort(x)
plot(s.x)
```

x-y 反転。
Exchange x and y axes.
```{r}
s <- seq(from=0,to=1,length=length(x)+1)
s <- s[-1]
plot(s.x,s)
```

階段状に。
Like steps
```{r}
plot(s.x,s,type="s")
```

打点の高さを少しいじる。
Modify the y-axis values a bit.

便利な関数もある。
You may find a useful function.

```{r}
plot(s.x,ppoints(n,a=0.5),type="b")
```
```{r}
plot.stepfun(s.x)
points(s.x,s,col=2)
s.2 <- seq(from=0,to=1,length=length(x)+1)
s.2 <- s.2[-length(s.2)]
points(s.x,s.2,col=3)
points(s.x,ppoints(n,a=0.5),col=4,type="b")
```

## 色々な分布 Various Distributions

### 二項分布 Binomial distribution

治療成功率がpであるとき、n人を治療するとn人のうちk=0,1,...,n人で治療が成功する確率は以下のように計算できる。

Assume a sequence of n independent treatments of patients and the success rate of treatmen is p.
The number of successes, k, in a n treatments ranges from 0 to n and the probability of the number being k is given as below.


$$
\begin{equation}
\begin{pmatrix}
n\\k
\end{pmatrix} p^k (1-p)^{n-k} = \frac{n!}{k!(n-k)!}p^k (1-p)^{n-k}
\end{equation}
$$

```{r}
n <- 5
# 整数列を作る Integer sequence generation
ks <- 0:n
# 二項分布の確率 Probability of binomial distribution
p <- 0.7
pr.binom <-dbinom(ks,n,p)
# 全事象の確率の和は1. Sum of probability of all events should be 1.
sum(pr.binom)
plot(ks,pr.binom,type="h")
```

今、n人の治療をして、k人で成功したとき、成功確率pの推定をしたいとする。
Assume you observed k successes in n treatments and you want to estimate the success probability p.

pの値はわからないので、取りうる値をたくさん仮定してしらみつぶしにn人中k回の成功がおきる確率を計算してみる。

Because we have no idea on the value of p, let's calculate probability to observe k successes in n experiments with many values for p.

pは0から1の範囲にあるから、そこに均等に値を取ることにする。１次元格子を作るとも言う。

p ranges from 0 to 1. Therefore we generate multiple p values with fixed intervals. This is generation of a lttice in 1 dimensional space.

```{r}
n.pt <- 100
ps <- seq(from=0, to=1, length=n.pt)
n <- 5
k<-2
probs <- dbinom(k,n,ps)
plot(ps,probs,type="l")
```

pの値を均等割りにする代わりに、たくさんの乱数を使うこともできる。
Instead of generating p values with fixed intervals, we can use random numbers.

一様分布からの乱数を用いる。
Random numbers in uniform distribution are used.

```{r}
n.pt <- 100
ps.2 <- runif(n.pt)
n <- 5
k<-2
probs.2 <- dbinom(k,n,ps.2)
plot(ps.2,probs.2,type="l",col=2)
```

作成した乱数列をソートすればきちんと描ける。
Sort random numbers first, then you will get the appropriate curve.

```{r}
ps.2.sorted <- sort(ps.2)
probs.2.sorted <- dbinom(k,n,ps.2.sorted)
plot(ps.2.sorted,probs.2.sorted,type="l",col=2)
```

実は、この分布は、ベータ分布と同じ形をしている。
Actually this distribution has the same shape with beta distribution.

```{r}
# kとn-kに1を加えているのがポイント
# Add 1 to k and n-k
beta.prob <- dbeta(ps,k+1,n-k+1)
plot(ps,beta.prob,type="l")
```

Exercise.

* n=50人を治療し、k=42人の治療が成功したという。この治療法の成功率pがどのくらいなのかを、dbeta()関数を使う方法とそうでない方法とでグラフとして示せ。Assume you treated n=50 patients and k=42 patients were successfully treated. Draw a probability graph for success rate p with two different ways; using dbeta() function and not using the function.

## 二項分布で分割表 Contingency tables with binomial distribution

2つの治療法A,Bがあって、それぞれの成功率はpa,pbであるとする。全部でn人を2群にランダムに割り付け治療するとする。

Assume 2 treatments, A and B and their success rates are pa and pb, respectively. Now we treat n patients with random assignment. 

```{r}
n <- 200
pa <- 0.6
pb <- 0.65
na <- rbinom(1,n,0.5)
nb <- n-na
a.success <- rbinom(1,na,pa)
a.failure <- na-a.success
b.success <- rbinom(1,nb,pb)
b.failure <- nb-b.success
fisher.test(matrix(c(a.success,a.failure,b.success,b.failure),2,2))
```

これを何度も繰り返してみる。

```{r}
n <- 200
pa <- 0.6
pb <- 0.65
n.iter <- 500
pvals <- nas <- nbs <- rep(0,n.iter)
for(i in 1:n.iter){
  nas[i] <- rbinom(1,n,0.5)
  nbs[i] <- n-nas[i]
  a.success <- rbinom(1,nas[i],pa)
  a.failure <- nas[i]-a.success
  b.success <- rbinom(1,nbs[i],pb)
  b.failure <- nbs[i]-b.success
  tmp.out <- fisher.test(matrix(c(a.success,a.failure,b.success,b.failure),2,2))
  pvals[i] <- tmp.out$p.value
}
hist(pvals,main="p.values")
```
```{r}
plot.stepfun(sort(pvals),main="p.values cumulative")
```
```{r}
hist(nas,main="nas")
```

2つの治療法A,Bがあって、それぞれの成功率はpa,pbであるとする。全部でn人を2群にランダムに割り付けながら治療するとする。人数を増やしながら逐一検定するとどうなるかをシミュレーションしてみる。

Assume 2 treatments, A and B and their success rates are pa and pb, respectively. Now we treat n patients with random assignment. Along with the assignment, let's test the difference between A and B.

```{r}
n <- 500
pa <- 0.6
pb <- 0.7
pvals <- rep(0,n)
assignment <- sample(0:1,n,replace=TRUE)
tbl <- matrix(0,2,2)
for(i in 1:n){
  if(assignment[i] == 0){#A
    tmp.success <- sample(0:1,1,prob=c(1-pa,pa))
    if(tmp.success==0){# failure
      tbl[1,1] <- tbl[1,1]+1
    }else{# success
      tbl[1,2] <- tbl[1,2]+1
    }
  }else{#B
    tmp.success <- sample(0:1,1,prob=c(1-pb,pb))
    if(tmp.success==0){# failure
      tbl[2,1] <- tbl[2,1]+1
    }else{# success
      tbl[2,2] <- tbl[2,2]+1
    }
  }
  tmp.test.out <- fisher.test(tbl)
  pvals[i] <- tmp.test.out$p.value
}
plot(pvals,type="l")
```

この推移を1度の治験だとすれば、複数回の治験をすればそれごとのp値の推移がシミュレーションできる。

This broken line is on one clinical trial. Multiple clinical trials will produce multiple lines.

```{r}
n.iter <- 10
n <- 500
pa <- 0.6
pb <- 0.7
#pvals <- rep(0,n)
pvals <- matrix(0,n.iter,n)
for(j in 1:n.iter){
  assignment <- sample(0:1,n,replace=TRUE)
  tbl <- matrix(0,2,2)
  for(i in 1:n){
    if(assignment[i] == 0){#A
      tmp.success <- sample(0:1,1,prob=c(1-pa,pa))
      if(tmp.success==0){# failure
        tbl[1,1] <- tbl[1,1]+1
     }else{# success
       tbl[1,2] <- tbl[1,2]+1
      }
    }else{#B
     tmp.success <- sample(0:1,1,prob=c(1-pb,pb))
     if(tmp.success==0){# failure
        tbl[2,1] <- tbl[2,1]+1
      }else{# success
        tbl[2,2] <- tbl[2,2]+1
     }
    }
    tmp.test.out <- fisher.test(tbl)
    pvals[j,i] <- tmp.test.out$p.value
  }
  
}
matplot(t(pvals),type="l")
```

## ポアソン分布 Poisson distribution

ある確率でランダムに起きる事象がある。
一定期間内に、または一定の区間内にその事象が起きる回数についての分布。
その期間・区間中に何回起きたかは0,1,2,...$\infty$回になるけれど、その確率の分布。

During a fixed period, or in a segment with fixed length, a kind of events occurs at random. The number of events that really occur in a period/segment ranges from 0 to $\infty$. Poisson distribution tells the probability that the events occurs k times in a period/segment. 

今、1人の女性が1生に産む子どもの数の平均がk人だとする。
子どもの数が0,1,2,...人であって、これがポアソン分布に従うと仮定する。

Assume the mean number of offsprings that one female has per her life is k. The number of offsprings can be 0,1,2,... and assume the probability is in Poisson distribution.

```{r}
k <- 1.6
ks <- 0:20
pr.kids <- dpois(ks,k)
sum(pr.kids)
# 平均子ども数 mean of number of children
sum(pr.kids * ks)
k
plot(ks,pr.kids,type="h")
```

この調査を子どものいる女性だけで調査するとどうなるだろうか。

Assume you survey the number of children by asking women with any children.

```{r}
k <- 1.6
ks <- 0:20
pr.kids.2 <- dpois(ks,k)
# 子どものいない女性は除かれる Women without any child is out of count
pr.kids.2[1] <- 0
sum(pr.kids.2)
# 平均子ども数 mean of number of children
# カウントされた女性の相対頻度で計算する
# The relative fraction of women with children should be used for the calculation
pr.kids.2.rel <- pr.kids.2/sum(pr.kids.2)
sum(pr.kids.2.rel * ks)
k
```


## 次世代シークエンシングとポアソン分布 Next-generation sequencing (NGS) and Poisson distribution

次世代シークエンシングにおいて、ゲノムの各箇所が平均nデプスで読まれるとき、デプス別の分布はポアソン分布に近似できる。

NGS returns various loci with various depth. The distibution of number of loci per depths is close to Poisson distribution.

この仮定の下で計算すると、どのくらいの割合で「全く読まれない」のだろうか？

How often do we miss loci completely?

```{r}
mean.depth <- 10
ds <- 0:50
pr.pois <- dpois(ds,mean.depth)
sum(pr.pois)
pr.pois[1]
```

今、NGSを用いてSNVの箇所の検出をしているとする。

Assume we are to detect SNVs with NGS.

SNVでは理想的には、アレルごとのDNA分子数が等量である。
Number of DNA molecules per alleles are identical, theoretically.

アレルごとの平均デプスはローカスことの平均デプスの半分である。

Mean depth per alleles is the half of mean depth per loci.

2アレルM,mごとのデプス別確率を求める。

Probability per depths for 2 alleles, M and m.

処理内容を理解しやすくするため、mean.depthを小さくしてやってみる。

For easier understanding the procedures, let's make mean.depth smaller.

```{r}
mean.depth <- 6
mean.depth.allele <- mean.depth/2
ds <- 0:20
pr.pois.M <- dpois(ds,mean.depth.allele)
pr.pois.m <- pr.pois.M
pr.M.m <- outer(pr.pois.M,pr.pois.m,"*")
image(ds,ds,pr.M.m)
```

2アレルの本数の合計がn.all 以上の場合にデータとして採用し、マイナーアレルのリード数がn.minor以上の場合にSNVであるとみなすとすると、どのくらいの確率でSNVであると認められるだろうか。

When you accept your data only when sum of two alleles is equal or more than n.all and when the number of reads of minor allele is equal or more than n.minor, you judge they are SNV. Then you can calculate the probability to identify the loci SNV.

```{r}
n.both <- outer(ds,ds,"+")
n.both
n.all <- 5
n.minor <- 2
ok.n.all <- which(n.both>=n.all,arr.ind=TRUE)

# ok.minor
pr.pois.M.ok <- pr.pois.m.ok <- pr.pois.M
pr.pois.M.ok[1:n.minor] <- pr.pois.m.ok[1:n.minor] <- 0
pr.M.m.ok <- outer(pr.pois.M.ok,pr.pois.m.ok,"*")
image(ds,ds,pr.M.m.ok)
```

n.minorの基準を満たさない図の左下隅が除かれた。

Left-bottom cells were removed due to n.minor restriction.

```{r}
# n.all okをかぶせる
pr.M.m.ok.all.ok <- matrix(0,length(ds),length(ds))
pr.M.m.ok.all.ok[ok.n.all] <- pr.M.m.ok[ok.n.all]
image(ds,ds,pr.M.m.ok.all.ok)
```

n.all基準により右下がりの対角線より上側のみが採用された。
The upper side of the declining line was accepted due to n.all restriction.

```{r}
# n.allもn.minorも基準を満たした割合
# The fraction with both n.all and n.minor adequate.
sum(pr.M.m.ok.all.ok)
```

## Exercise
* 平均デプスと、アレル合算デプス基準値、アレル別デプス基準値を適当に与え、SNV検出感度を算出せよ。 Set mean depth value and n.all cutoff and n.minor cutoff as you like and calculate the sensitivity to detect SNV.

* 数えられる病変数はポアソン分布で表すことができるかもしれない。たとえば、身体のあちこちに現れる皮膚病変、ランダムに起きているように思われる神経変性部位、相当するうある関節のうちの罹患関節数。何かしら身近な例を取ってシミュレーションしてみよ。 Number of countable lesions can take Poisson distributions, e.g., skin lesions scattered out on the body surface, neural degenerative lesions appearing at random, affected joints among substantial number of them. Simulate a medical/biological phenomenon that may show Poisson distribution.


## 正規分布 Normal distribution

血清コレステロール値が群Aでは平均ma,分散vaの、群Bでは、それぞれmb,vbの正規分布に従っているという。

Assume serum cholesterol concentration in group A is in a normal distribution with mean being ma and variance being va and in group B, mean is mb and variance is vb.

```{r}
na <- 150
nb <- 120
ma <- 200
va <- 100
mb <- 180
vb <- 80
chol.a <- rnorm(na,ma,sqrt(va))
chol.b <- rnorm(nb,mb,sqrt(vb))
h <- hist(c(chol.a,chol.b))
hist(chol.a,density=20,col=2,add=TRUE,breaks=h$breaks)
hist(chol.b,density=17,col=3,add=TRUE,breaks=h$breaks)
```

t検定をしてみる。
Let's t-test these data.

```{r}
t.test(chol.a,chol.b)
```

もし、群Aと群Bとが血清コレステロール値に関して純粋な分類であれば、値のまとまりが良い(分散が小さい,vx)が、さまざまな関与因子があってその影響が加わっているとすると、次のようにばらつきを加えることになる。

```{r}
vx <- 100
chol.a.1 <- chol.a + rnorm(na,0,sqrt(vx))
chol.b.1 <- chol.b + rnorm(nb,0,sqrt(vx))
t.test(chol.a.1,chol.b.1)
h <- hist(c(chol.a.1,chol.b.1))
hist(chol.a.1,density=20,col=2,add=TRUE,breaks=h$breaks)
hist(chol.b.1,density=17,col=3,add=TRUE,breaks=h$breaks)
```

ここでは、群Aに関して、まず分散vaの正規分布で値を生成し、そこにさらなるばらつきの項を分散vxの正規乱数で加えた。

We generated random values in normal distribution with variance va then added normal random values of variance vx.

正規分布では、分散を足すことができるので、以下のようにもできる。

You can add variances for normal distribution.

2つの方法で生成した分布を比べてみる。
両者はほとんど同じである。

Let's compare two distributions.
Both two are almost identical.

```{r}
chol.a.2 <- rnorm(na,ma,sqrt(va+vx))
summary(chol.a.1)
summary(chol.a.2)
h <- hist(c(chol.a.1,chol.a.2))
hist(chol.a.1,density=20,col=2,add=TRUE,breaks=h$breaks)
hist(chol.a.2,density=17,col=3,add=TRUE,breaks=h$breaks)
```

2つの分布が近いことはソートして散布図を描くことで示すこともできる。

Coplot of sorted values can show the similarity of two distributions.

```{r}
plot(sort(chol.a.1),sort(chol.a.2))
abline(0,1,col=2)
```

では、2群の平均は固定し、2群の分散をだんだん大きくして、そのt検定の結果を比べてみる。

Now fix two groups' mean and change their variance and see their effect on the result of t-test.

```{r}
na <- 100
nb <- 100
ma <- 200
mb <- 180
vabs <- 2^(seq(from=0,to=10,length=100))
ps <- rep(0,length(vabs))
for(i in 1:length(vabs)){
  chol.a <- rnorm(na,ma,sqrt(vabs[i]))
  chol.b <- rnorm(nb,mb,sqrt(vabs[i]))
  tmp.out <- t.test(chol.a,chol.b)
  ps[i] <- tmp.out$p.value
}
plot(log(vabs),log(ps))
```

上の例では、2つの群を決める何かがあり、それが群の平均値を決めるというモデルであった。
なぜ正規分布を取るかというと、群(とその平均値)を決める以外の要因が正規分布を作り出しているというモデルであった。

We assumed two groups that have mean value per groups and other factors added normal variation from the means in the above.

今度は、複数の要因がそれぞれ正規分布様の寄与をしているときに、そのうちの1要因について効果のある介入をするというようなシミュレーションをしてみる。

Our next example is as follows. Multiple factors are related to a phenotype and each of them contributes in a normal distribution-like fashion. Toward the phenomenon, an intervention specific to one of the factors is given.

介入は、第1因子をk倍(k<1)に小さくする働きをもち、その効果にはばらつきがあるものとする。

Assume the intervention make the 1st factor x k (k<1) with random variation.


```{r}
n.factor <- 4
ms <- runif(n.factor)
vs <- runif(n.factor)^2

n <- 100
x <- matrix(0,n,n.factor)
for(i in 1:n.factor){
  x[,i] <- rnorm(n,ms[i],sqrt(vs[i]))
}
X <- apply(x,1,sum)
hist(X)
```
```{r}
h <- hist(c(x))
for(i in 1:n.factor){
  hist(x[,i],density = 10 +i*3,add=TRUE, breaks=h$breaks,col=i+1)
}
```
```{r}
# k is interventional effect
k <- 0.4
# v.tr is variance of treatment effect
v.tr <- 1
x1.post <- x[,1] * k + rnorm(n,0,sqrt(v.tr))
x.post <- x
x.post[,1] <- x1.post
X.post <- apply(x.post,1,sum)
h <- hist(c(X,X.post),density =20)
hist(X,density=17,add=TRUE,col=2,breaks=h$breaks)
hist(X.post,density=17,add=TRUE,col=3,breaks=h$breaks)
```

治療効果が表れて、介入後の分布が左にシフトしている。

Intervention is successful and the distribution shifted smaller.

治療前と治療後との分布の違いを検定することはできる。

You can test the difference between the pre- and post-interventional distributions.

```{r}
t.test(X,X.post)
```

しかしながら、取られたデータは以下のように、治療前後の対応関係を持っている。

However the pre- and post-data are paired.

```{r}
xlim <- range(c(X,X.post))
plot(X,X.post,xlim=xlim,ylim=xlim)
```

従って、その対応を考慮して効果を判定する方がよい。

```{r}
t.test(X,X.post,paired=TRUE)
```

p値がぐっと小さくなることがわかる。

## Exercise
* この例で、次の4変数について、個別にその値を振り、その他の変数の値は固定し、値を振った変数がpaired検定に及ぼす影響についてシミュレーション的に評価せよ
 * 第1因子の寄与ms[1]
 * 第1因子のばらつきvs[1]
 * 介入の効果k
 * 介入の効果のばらつきv.tr