Cox比例ハザードモデルによる生存分析を試してみた

Tokyo.R #46 Cox比例ハザードモデルとその周辺拝見してから生存時間分析試したいと思ってたらちょうど良い題材が見つかったのでやってみた。

<a href="http://stackoverflow.com/questions/27080207/survival-analysis-for-telecom-churn-using-r">Survival Analysis for Telecom Churn using R</a>stackoverflow.com

とりえあずそのままコピペ

まずはコピペして実行。

> # データロード
> nm <- read.csv("http://www.sgi.com/tech/mlc/db/churn.names", 
+                skip=4, colClasses=c("character", "NULL"), header=FALSE, sep=":")[[1]]
> dat <- read.csv("http://www.sgi.com/tech/mlc/db/churn.data", header=FALSE, col.names=c(nm, "Churn"))
> 
> # 生存時間データオブジェクト作成
> library(survival)
> s <- with(dat, Surv(account.length, as.numeric(Churn)))
> 
> # モデル作成
> model <- coxph(s ~ total.day.charge + number.customer.service.calls, data=dat[, -4])
> 
> # モデル確認
> summary(model)
Call:
coxph(formula = s ~ total.day.charge + number.customer.service.calls, 
    data = dat[, -4])

  n= 3333, number of events= 483 

                                 coef exp(coef) se(coef)     z Pr(>|z|)    
total.day.charge              0.05373   1.05520  0.00502 10.70   <2e-16 ***
number.customer.service.calls 0.31043   1.36401  0.02726 11.39   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                              exp(coef) exp(-coef) lower .95 upper .95
total.day.charge                  1.055     0.9477     1.045     1.066
number.customer.service.calls     1.364     0.7331     1.293     1.439

Concordance= 0.702  (se = 0.015 )
Rsquare= 0.066   (max possible= 0.872 )
Likelihood ratio test= 228.2  on 2 df,   p=0
Wald test            = 244.6  on 2 df,   p=0
Score (logrank) test = 246.6  on 2 df,   p=0

> plot(survfit(model))

f:id:tak95:20150602174344p:plain

stepAICによる変数選択

[R]R言語で生存時間分析 - yokkunsの日記にもあるようにCox比例ハザードモデルでもAICによる変数選択が行なえる。

> model <- coxph(s ~.-account.length-phone.number-total.day.charge-Churn, data=dat)
> AIC(model)
[1] 6459.798

> library(MASS)
> model.step <- stepAIC(model)
(中略)
> AIC(model.step)
[1] 6428.962
> summary(model.step)
Call:
coxph(formula = s ~ international.plan + voice.mail.plan + number.vmail.messages + 
    total.day.minutes + total.eve.minutes + total.night.minutes + 
    total.intl.calls + total.intl.charge + number.customer.service.calls, 
    data = dat)

  n= 3333, number of events= 483 

                                    coef  exp(coef)   se(coef)      z
international.plan yes         1.1902999  3.2880672  0.1028265 11.576
voice.mail.plan yes           -2.1294370  0.1189042  0.4783759 -4.451
number.vmail.messages          0.0432365  1.0441848  0.0147541  2.930
total.day.minutes              0.0091907  1.0092330  0.0008652 10.623
total.eve.minutes              0.0041881  1.0041969  0.0008873  4.720
total.night.minutes            0.0030797  1.0030845  0.0008930  3.449
total.intl.calls              -0.0701120  0.9322894  0.0209824 -3.341
total.intl.charge              0.2024902  1.2244481  0.0615919  3.288
number.customer.service.calls  0.3037536  1.3549352  0.0271930 11.170
                              Pr(>|z|)    
international.plan yes         < 2e-16 ***
voice.mail.plan yes           8.53e-06 ***
number.vmail.messages         0.003384 ** 
total.day.minutes              < 2e-16 ***
total.eve.minutes             2.36e-06 ***
total.night.minutes           0.000563 ***
total.intl.calls              0.000833 ***
total.intl.charge             0.001010 ** 
number.customer.service.calls  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

                              exp(coef) exp(-coef) lower .95 upper .95
international.plan yes           3.2881     0.3041   2.68791    4.0222
voice.mail.plan yes              0.1189     8.4101   0.04656    0.3037
number.vmail.messages            1.0442     0.9577   1.01442    1.0748
total.day.minutes                1.0092     0.9909   1.00752    1.0109
total.eve.minutes                1.0042     0.9958   1.00245    1.0059
total.night.minutes              1.0031     0.9969   1.00133    1.0048
total.intl.calls                 0.9323     1.0726   0.89473    0.9714
total.intl.charge                1.2244     0.8167   1.08521    1.3816
number.customer.service.calls    1.3549     0.7380   1.28461    1.4291

Concordance= 0.772  (se = 0.015 )
Rsquare= 0.124   (max possible= 0.872 )
Likelihood ratio test= 440.4  on 9 df,   p=0
Wald test            = 469.1  on 9 df,   p=0
Score (logrank) test = 497  on 9 df,   p=0

ボイスメール、昼間・夕方・夜間の総通話時間、国際電話の呼数・通話料、カスタマーセンターへの呼数などが重要なようです。

ノモグラムのプロット

@kikurag1001氏のスライドを参考にノモグラムを描いてみる

> library(rms)
> ddist <- with(dat, datadist(international.plan, 
+                             voice.mail.plan,
+                             number.vmail.messages,
+                             total.day.minutes,
+                             total.eve.minutes,
+                             total.night.minutes, 
+                             total.intl.calls,
+                             total.intl.charge,
+                             number.customer.service.calls))
> options(datadist='ddist')
> f <- cph(Surv(account.length, as.numeric(Churn))~international.plan
+          +voice.mail.plan
+          +number.vmail.messages
+          +total.day.minutes
+          +total.eve.minutes
+          +total.night.minutes
+          +total.intl.calls
+          +total.intl.charge
+          +number.customer.service.calls
+          , data=dat, x=T, y=T, surv=T
+ )
> 
> survfun <- Survival(f)
> m50 <- function(lp) survfun(50, lp)
> m100 <- function(lp) survfun(100, lp)
> at.surv <- c(.001, .01, seq(.1,.9, by=.1), .95, .99, .999)
> 
> nom <- with(dat, nomogram(f1, 
+                           fun=list(m50, m100),
+                           fun.at=list(at.surv, at.surv),
+                           lp=F,
+                           funlabel=c("50 months later", "100 months later")))
> plot(nom)
>

f:id:tak95:20150602175035p:plain

描けた!

予測結果

各レコードの生存確率はsurvfitで求められる。

> d.fit <- survfit(f, newdata=dat)
> 
> prob <- as.data.frame(t(d.fit$surv))
> colnames(prob) <- d.fit$time
> head(prob)
          1         2        12        13        16        17        19
1 0.9998239 0.9996475 0.9994702 0.9992925 0.9991142 0.9987570 0.9985781
2 0.9999111 0.9998221 0.9997325 0.9996428 0.9995528 0.9993724 0.9992820
3 0.9998321 0.9996640 0.9994949 0.9993256 0.9991556 0.9988151 0.9986445
4 0.9990591 0.9981178 0.9971718 0.9962249 0.9952754 0.9933751 0.9924243
5 0.9991587 0.9983170 0.9974709 0.9966240 0.9957747 0.9940746 0.9932239
6 0.9994656 0.9989308 0.9983931 0.9978547 0.9973145 0.9962328 0.9956913
         21        23        24        25        27        28        29
1 0.9983987 0.9980390 0.9974984 0.9973180 0.9971372 0.9969559 0.9967742
2 0.9991914 0.9990097 0.9987365 0.9986453 0.9985539 0.9984622 0.9983704
3 0.9984735 0.9981306 0.9976152 0.9974432 0.9972709 0.9970980 0.9969248
4 0.9914722 0.9895648 0.9867034 0.9857500 0.9847957 0.9838391 0.9828812
5 0.9923719 0.9906647 0.9881031 0.9872495 0.9863948 0.9855381 0.9846801
6 0.9951488 0.9940612 0.9924280 0.9918835 0.9913381 0.9907912 0.9902433
（中略）
        184       188       193       197       201        208        209
1 0.7011099 0.6937232 0.6832193 0.6713584 0.6562353 0.63522574 0.61162993
2 0.8359023 0.8314451 0.8250663 0.8178048 0.8084532 0.79528275 0.78023111
3 0.7128446 0.7056836 0.6954944 0.6839802 0.6692852 0.64884397 0.62584874
4 0.1499163 0.1416661 0.1305734 0.1189073 0.1052758 0.08847216 0.07226888
5 0.1832854 0.1742403 0.1619897 0.1489861 0.1336182 0.11437779 0.09545318
6 0.3404308 0.3296630 0.3147479 0.2984579 0.2785201 0.25233274 0.22495074
         212        224         225
1 0.57410858 0.49796963 0.383945846
2 0.75569131 0.70332058 0.616803440
3 0.58919620 0.51446662 0.401516500
4 0.05152452 0.02408735 0.006001394
5 0.07053651 0.03574049 0.010316663
6 0.18563213 0.12054360 0.054757376