Bolin Wu

Survival Analysis 4: Cox proportional hazards model

2023-08-30 · 15 min read
survival_analysis R

This post will briefly share the derivation, estimation, assumption and application of the Cox proportional hazards (PH) model. In addition, it will also mention using ANOVA to test two nested models.

Model derivation

  • Assume we have a vector of p explanatory variables: X=(X1,X2,...,Xp)X = (X_{1}, X_{2},..., X_{p}).
  • We can express the hazard function h(tX)=h0(t)g(X)h(t|X) = h_{0}(t)g(X) as a function of time t and the vector of explanatory variables X. The baseline hazard h0(t)h_{0}(t) only depends of survival time t but not on X. The g(X)g(X) is a function of the explanatory variables X which are assumed to be time-invariant.\
  • When the function g(X)g(X) is set to be

g(X)=exp(βX)=exp(β1X1+β2X2+...+βpXp)=exp(i=1pβiXi)g(X) = exp(\beta X) = exp(\beta_{1}X_{1} + \beta_{2}X_{2} + ... + \beta_{p}X_{p}) = exp(\sum_{i = 1}^{p} \beta_{i}X_{i})

  • The hazard function becomes

h(tX)=h0(t)g(X)=h0exp(i=1pβiXi)ln(h(tX))=ln[h0(t)]+i=1pβiXi\begin{aligned} h(t|X) &= h_{0}(t)g(X) = h_{0}exp(\sum_{i = 1}^{p} \beta_{i}X_{i}) \\ ln(h(t|X)) &= ln[h_{0}(t)] + \sum_{i = 1}^{p} \beta_{i}X_{i} \end{aligned}

Where

  • h0(t)h_{0}(t) is the baseline hazard.
  • ln[h0(t)]ln[h_{0}(t)] plays the role of the random error term in ordinary multiple regression and
  • exp(i=1pβiXi)exp(\sum_{i = 1}^{p} \beta_{i}X_{i}) is the relative hazard.

In case you wonder why exp(i=1pβiXi)exp(\sum_{i = 1}^{p} \beta_{i}X_{i}) is the relative hazard:

  • Consider just one categorical variable X with values 0 and 1, then,

h(tX=0)=h0(t)exp(βX)=h0exp(β×0)=h0(t)h(tX=1)=h0(t)exp(βX)=h0exp(β×1)=h0exp(β)\begin{aligned} h(t|X = 0) &= h_{0}(t) exp(\beta X) = h_{0}exp(\beta \times 0) = h_{0}(t) \\ h(t|X = 1) &= h_{0}(t) exp(\beta X) = h_{0}exp(\beta \times 1) = h_{0}exp(\beta) \end{aligned}

  • Thus we have

h(tX=1)h(tX=0)=h0(t)exp(β)h0(t)=exp(β)\frac{h(t|X = 1)}{h(t|X = 0)} = \frac{h_{0}(t)exp(\beta)}{h_{0}(t)} = exp(\beta)

  • The proportion is a constant that is independent of survival time t. That is the reason we call such models as proportional hazards models. No matter what the survival time t distribution looks like, it does not affect the hazards' proportion. This attribute makes it popular in research fields.

Model estimation

  • At each uncensored data point tit_{i}, assume R(ti)R(t_{i}) to be the set of observations still at risk immediately before tit_{i}:
  • Conditional on R(ti)R(t_{i}), the probability that point (xi,ti)(x_{i}, t_{i}) is the one which occurs is given by

exp(βixi)jR(ti)exp(βjxj)\frac{exp(\beta_{i} x_{i})}{\sum_{j \in R(t_{i})}exp(\beta_{j} x_{j})}

  • The conditional likelihood L(β)L(\beta) is given by the product of all the uncensored points:

L(β)=exp(βixi)jR(ti)exp(βjxj)L(\beta) = \prod \frac{exp(\beta_{i} x_{i})}{\sum_{j \in R(t_{i})} exp(\beta_{j} * x_{j})}

  • The estimation of β\beta is obtained through iteration using the second derivatives of the likelihood L(β)L(\beta). But sorry to be honest, I am not sure how the algorithm exactly works in details.

Model assumption

As mentioned above, Cox PH model focuses on studying the two hazard functions with the assumption that they are proportional to each other. In the Cox PH model, people do not need to specify the probability distribution for the baseline hazard function. As a result, it is very popular for researchers who uses observational data in their fields.

One useful post about testing the assumptions is here. In short there are 3 common methods to test the assumption of proportional hazards:

  • Graphic approach: since h(tX=1)h(tX=0)=exp(β)\frac{h(t|X=1)}{h(t|X = 0)} = exp(\beta) is a constant over t, we can plot the estimated hazard functions of the different levels of the covariate and examine if they are proportional, i.e., they do not cross. One can also use the log-log plots of estimated survivor functions. Under the PH assumption the two curves ln[ln[S^1(t)]]-ln[-ln[\hat{S}_{1}(t)]] and ln[ln[S^0(t)]]-ln[-ln[\hat{S}_{0}(t)]] should be parallel for all values of t.
  • Time-dependent variables: introduce a interaction term of a potential time-dependent variable and a time's function to assess whether the time-depended variable is independent of time or not.
  • Goodness-of-fit-test: one can use cox.zph() in R.

Model application

I would use an example data to demonstrate the application in R.

  • Read data.
## read data
df <- readxl::read_excel("cox_model/data/Survival-2023-HomeExam-14-20-March-2023.xlsx")
## show the first few lines for readers to know what the df looks like
head(df) |> kable()
Cluster BirCoh Residence Educ ExpMonths FirstBirth
1 2 3 0 27 0
1 4 3 0 11 1
1 4 3 0 63 0
1 7 3 0 5 1
1 1 3 0 4 1
1 7 3 0 29 1
  • Create a survival object with Surv() function.
# create a survival object
birth_time <- Surv(df$ExpMonths, df$FirstBirth)

Model the intensity of experiencing the event of interest as a function of one covariate (Birth Cohort). Use the first level of the covariate as baseline (reference) level.

As mentioned above, the proportional hazard model is estimated as the form of

h(tX)=h0(t)g(X)=h0exp(i=1pβiXi)ln(h(tX))=ln[h0(t)]+i=1pβiXi\begin{aligned} h(t|X) &= h_{0}(t)g(X) = h_{0}exp(\sum_{i = 1}^{p} \beta_{i}X_{i}) \\ ln(h(t|X)) &= ln[h_{0}(t)] + \sum_{i = 1}^{p} \beta_{i}X_{i} \end{aligned}

Where

  • h0(t)h_{0}(t) is the baseline hazard and
  • exp(i=1pβiXi)exp(\sum_{i = 1}^{p} \beta_{i}X_{i}) is the relative hazard.

Please note: in proportional hazard model, the baseline hazard is left unspecified. Therefore we can not get it from the R output.

  • By default, coxph in R uses the first level as baseline.
  • The summary result from coxph is as follows (rounded to 3 digits):
cph_biroh <- coxph(birth_time ~ as.factor(BirCoh), data = df)
round(summary(cph_biroh)$coefficients, 3) |> kable()
coef exp(coef) se(coef) z Pr(>|z|)
as.factor(BirCoh)2 -0.314 0.730 0.068 -4.615 0
as.factor(BirCoh)3 -0.511 0.600 0.066 -7.696 0
as.factor(BirCoh)4 -0.709 0.492 0.068 -10.388 0
as.factor(BirCoh)5 -0.809 0.445 0.067 -12.036 0
as.factor(BirCoh)6 -1.014 0.363 0.071 -14.200 0
as.factor(BirCoh)7 -1.171 0.310 0.072 -16.320 0
  • Interpretation
    • 'coef': this is the βi\beta_{i} listed above. It is usually obtained through iteration of the log-likelihood function: L(β)=exp(βixi)exp(βjxj)L(\beta) = \prod \frac{exp(\beta_{i}x_{i})}{\sum exp(\beta_{j} x_{j})}. (Reference: lecture 5 notes)
    • 'exp(coef)': relative hazard that we are interested.
    • 'z': the z score = β^SE(β^)\frac{\hat{\beta}}{SE(\hat{\beta})}. It is used to get the p-value.
    • 'Pr': it is derived by comparing z value and N(0,1)N(0,1).
  • Result
    1. All the birth cohort levels are significant at 0.05 level.
    2. When cohort = 1 (youngest cohort) is the baseline: The relative hazard of the birth cohort = 2 is 0.730, i.e. at any given time, cohort 2 is 27% less likely to have the first child. Similar interpretation may apply to other cohorts.
    3. The conclusion is that older cohorts are less likely to have the first child.
    4. This result matches the Kaplan-Merier curve below.
library(ggsurvfit)
survfit2(birth_time ~ as.factor(BirCoh), data = df) %>%
  ggsurvfit() +
  labs(
    x = "Months",
    y = "Survival probability",
    title = "Survival curve: Kaplan-Meier estimates",
    color = "Birth cohort"
  ) +
  theme(plot.title = element_text(hjust = 0.5)) +
  add_confidence_interval()

Model the intensity of experiencing the event of interest as a function of two covariates (Birth Cohort and Residence). Use the first levels of the covariates as baseline (reference) levels.

cph_biroh_resid <- coxph(birth_time ~ as.factor(BirCoh) + as.factor(Residence), data = df
)
round(summary(cph_biroh_resid)$coefficients, 3) |> kable()
coef exp(coef) se(coef) z Pr(>|z|)
as.factor(BirCoh)2 -0.302 0.739 0.068 -4.438 0
as.factor(BirCoh)3 -0.493 0.611 0.066 -7.420 0
as.factor(BirCoh)4 -0.696 0.499 0.068 -10.187 0
as.factor(BirCoh)5 -0.790 0.454 0.067 -11.741 0
as.factor(BirCoh)6 -0.997 0.369 0.071 -13.964 0
as.factor(BirCoh)7 -1.151 0.316 0.072 -16.057 0
as.factor(Residence)2 0.227 1.255 0.048 4.754 0
as.factor(Residence)3 0.409 1.506 0.042 9.633 0
  • Result
    1. The birth cohort part is similar to the interpretation above.
    2. When residence = 1 (Metropolitan City) is baseline, the relative hazard of residence = 2 (Other Towns) is 1.255; the relative hazard of residence = 3 (rural area) is 1.506.
    3. The conclusion is that at a given time, married women from other towns are 25.5% more likely and women from rural area are 50.6% more likely to have the first kid, compared with women from metropolitan city.
    4. This result matches the Kaplan-Merier curve below.
survfit2(birth_time ~ as.factor(Residence), data = df) %>%
  ggsurvfit() +
  labs(
    x = "Months",
    y = "Survival probability",
    title = "Survival curve: Kaplan-Meier estimates",
    color = "Residence level"
  ) +
  theme(plot.title = element_text(hjust = 0.5)) +
  add_confidence_interval()

Test assumption

Please note that the code above is just for demonstration purpose. One can further use cox.zph() to test if the PH assumption is fulfilled. In fact, the example data does not fulfill the assumption.

cox.zph(cph_biroh_resid)
##                      chisq df       p
## as.factor(BirCoh)    23.12  6 0.00076
## as.factor(Residence)  1.66  2 0.43610
## GLOBAL               24.71  8 0.00174
cox.zph(cph_biroh)
##                   chisq df       p
## as.factor(BirCoh)  24.4  6 0.00043
## GLOBAL             24.4  6 0.00043

Test improvement

For the two cases above, one uses birth cohort only, the other uses both birth cohort and education level. Natually people would like to know whether adding the additional factor could improve the model performance. In this case, we can use ANOVA with setting below:

  • We have model 1 with all variables, model 2 with restricted variables.
  • The null hypothesis is that all the coefficients in model 1 are 0, except for the ones in model 2.

In R, it looks like below:

anova(cph_biroh_resid, cph_biroh)
## Analysis of Deviance Table
##  Cox model: response is  birth_time
##  Model 1: ~ as.factor(BirCoh) + as.factor(Residence)
##  Model 2: ~ as.factor(BirCoh)
##   loglik  Chisq Df Pr(>|Chi|)    
## 1 -40265                         
## 2 -40322 112.85  2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Result:

  • The p-value is smaller than 0.05, therefore we reject the null hypothesis that all the coefficients except for the ones in model 2 are 0.
  • The conclusion is that we need to take model 1: adding education level improves the model.

End

Thank you for reading this far. I hope you can get something useful from this post. In the next posts I will share the Accelerated Failure-time (AFT) models.

Prudence is a fountain of life to the prudent.