- 100 samples x 25000 genes
- You can predict perfectly when you are allowed to use explanatory variables as many as sample size.
n <- 10
m <- 10
p <- sample(0:1,n,replace=TRUE)
g <- matrix(sample(0:1,n*m,replace=TRUE),n,m)
p
g
lm.out <- lm(p~g)
p.est <- predict(lm.out)
plot(p,p.est)
print(cbind(p,p.est))
print(cbind(p,round(p.est)))
print(round(lm.out$coefficients))
> p
[1] 1 1 0 0 0 1 1 0 0 0
> g
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0 1 1 0 1 1 0 1 1 0
[2,] 1 0 0 0 0 1 1 0 0 1
[3,] 0 0 0 0 1 1 1 0 0 1
[4,] 0 1 1 0 1 1 1 0 0 1
[5,] 0 1 0 0 0 0 1 0 0 1
[6,] 1 1 1 0 0 0 1 1 0 0
[7,] 1 0 0 0 0 0 1 0 0 0
[8,] 1 0 1 0 0 0 1 1 0 0
[9,] 0 1 1 1 1 0 0 0 0 0
[10,] 1 0 0 0 1 1 1 0 0 0
> print(cbind(p,p.est))
p p.est
1 1 1.000000e+00
2 1 1.000000e+00
3 0 -3.333333e-01
4 0 8.881784e-16
5 0 3.333333e-01
6 1 6.666667e-01
7 1 6.666667e-01
8 0 3.333333e-01
9 0 1.110223e-16
10 0 3.333333e-01
> print(cbind(p,round(p.est)))
p
1 1 1
2 1 1
3 0 0
4 0 0
5 0 0
6 1 1
7 1 1
8 0 0
9 0 0
10 0 0
>
> print(round(lm.out$coefficients))
(Intercept) g1 g2 g3 g4 g5 g6 g7 g8
1 1 0 0 -1 -1 0 -1 0
g9 g10
NA NA
- When you have p >> n explanatory variabls, you would find many variable sets, each of which perfectly predict dependent variable.