This post will briefly share the derivation, estimation, assumption and application of the Cox proportional hazards (PH) model. In addition, it will also mention using ANOVA to test two nested models.
Model derivation
Assume we have a vector of p explanatory variables: X=(X1,X2,...,Xp).
We can express the hazard function h(t∣X)=h0(t)g(X) as a function of time t and the vector of explanatory variables X. The baseline hazard h0(t) only depends of survival time t but not on X. The g(X) is a function of the explanatory variables X which are assumed to be time-invariant.\
The proportion is a constant that is independent of survival time t. That is the reason we call such models as proportional hazards models. No matter what the survival time t distribution looks like, it does not affect the hazards' proportion. This attribute makes it popular in research fields.
Model estimation
At each uncensored data point ti, assume R(ti) to be the set of observations still at risk immediately before ti:
Conditional on R(ti), the probability that point (xi,ti) is the one which occurs is given by
∑j∈R(ti)exp(βjxj)exp(βixi)
The conditional likelihood L(β) is given by the product of all the uncensored points:
L(β)=∏∑j∈R(ti)exp(βj∗xj)exp(βixi)
The estimation of β is obtained through iteration using the second derivatives of the likelihood L(β). But sorry to be honest, I am not sure how the algorithm exactly works in details.
Model assumption
As mentioned above, Cox PH model focuses on studying the two hazard functions with the assumption that they are proportional to each other. In the Cox PH model, people do not need to specify the probability distribution for the baseline hazard function. As a result, it is very popular for researchers who uses observational data in their fields.
One useful post about testing the assumptions is here. In short there are 3 common methods to test the assumption of proportional hazards:
Graphic approach: since h(t∣X=0)h(t∣X=1)=exp(β) is a constant over t, we can plot the estimated hazard functions of the different levels of the covariate and examine if they are proportional, i.e., they do not cross. One can also use the log-log plots of estimated survivor functions. Under the PH assumption the two curves −ln[−ln[S^1(t)]] and −ln[−ln[S^0(t)]] should be parallel for all values of t.
Time-dependent variables: introduce a interaction term of a potential time-dependent variable and a time's function to assess whether the time-depended variable is independent of time or not.
Goodness-of-fit-test: one can use cox.zph() in R.
Model application
I would use an example data to demonstrate the application in R.
Read data.
## read data
df <- readxl::read_excel("cox_model/data/Survival-2023-HomeExam-14-20-March-2023.xlsx")
## show the first few lines for readers to know what the df looks like
head(df) |> kable()
Cluster
BirCoh
Residence
Educ
ExpMonths
FirstBirth
1
2
3
0
27
0
1
4
3
0
11
1
1
4
3
0
63
0
1
7
3
0
5
1
1
1
3
0
4
1
1
7
3
0
29
1
Create a survival object with Surv() function.
# create a survival object
birth_time <- Surv(df$ExpMonths, df$FirstBirth)
Model the intensity of experiencing the event of interest as a function of one covariate (Birth Cohort). Use the first level of the covariate as baseline (reference) level.
As mentioned above, the proportional hazard model is estimated as the form of
'coef': this is the βi listed above. It is usually obtained through iteration of the log-likelihood function: L(β)=∏∑exp(βjxj)exp(βixi). (Reference: lecture 5 notes)
'exp(coef)': relative hazard that we are interested.
'z': the z score = SE(β^)β^. It is used to get the p-value.
'Pr': it is derived by comparing z value and N(0,1).
Result
All the birth cohort levels are significant at 0.05 level.
When cohort = 1 (youngest cohort) is the baseline: The relative hazard of the birth cohort = 2 is 0.730, i.e. at any given time, cohort 2 is 27% less likely to have the first child. Similar interpretation may apply to other cohorts.
The conclusion is that older cohorts are less likely to have the first child.
This result matches the Kaplan-Merier curve below.
library(ggsurvfit)
survfit2(birth_time ~ as.factor(BirCoh), data = df) %>%
ggsurvfit() +
labs(
x = "Months",
y = "Survival probability",
title = "Survival curve: Kaplan-Meier estimates",
color = "Birth cohort"
) +
theme(plot.title = element_text(hjust = 0.5)) +
add_confidence_interval()
Model the intensity of experiencing the event of interest as a function of two covariates (Birth Cohort and Residence). Use the first levels of the covariates as baseline (reference) levels.
The birth cohort part is similar to the interpretation above.
When residence = 1 (Metropolitan City) is baseline, the relative hazard of residence = 2 (Other Towns) is 1.255; the relative hazard of residence = 3 (rural area) is 1.506.
The conclusion is that at a given time, married women from other towns are 25.5% more likely and women from rural area are 50.6% more likely to have the first kid, compared with women from metropolitan city.
This result matches the Kaplan-Merier curve below.
survfit2(birth_time ~ as.factor(Residence), data = df) %>%
ggsurvfit() +
labs(
x = "Months",
y = "Survival probability",
title = "Survival curve: Kaplan-Meier estimates",
color = "Residence level"
) +
theme(plot.title = element_text(hjust = 0.5)) +
add_confidence_interval()
Test assumption
Please note that the code above is just for demonstration purpose. One can further use cox.zph() to test if the PH assumption is fulfilled. In fact, the example data does not fulfill the assumption.
cox.zph(cph_biroh_resid)
## chisq df p
## as.factor(BirCoh) 23.12 6 0.00076
## as.factor(Residence) 1.66 2 0.43610
## GLOBAL 24.71 8 0.00174
cox.zph(cph_biroh)
## chisq df p
## as.factor(BirCoh) 24.4 6 0.00043
## GLOBAL 24.4 6 0.00043
Test improvement
For the two cases above, one uses birth cohort only, the other uses both birth cohort and education level. Natually people would like to know whether adding the additional factor could improve the model performance. In this case, we can use ANOVA with setting below:
We have model 1 with all variables, model 2 with restricted variables.
The null hypothesis is that all the coefficients in model 1 are 0, except for the ones in model 2.
The p-value is smaller than 0.05, therefore we reject the null hypothesis that all the coefficients except for the ones in model 2 are 0.
The conclusion is that we need to take model 1: adding education level improves the model.
End
Thank you for reading this far. I hope you can get something useful from this post. In the next posts I will share the Accelerated Failure-time (AFT) models.