Bolin Wu
#
Survival Analysis 4: Cox proportional hazards model

# Model derivation

# Model estimation

# Model assumption

# Model application

### Model the intensity of experiencing the event of interest as a function of one covariate (Birth Cohort). Use the first level of the covariate as baseline (reference) level.

## Model the intensity of experiencing the event of interest as a function of two covariates (Birth Cohort and Residence). Use the first levels of the covariates as baseline (reference) levels.

### Test assumption

### Test improvement

# End

This post will briefly share the derivation, estimation, assumption and application of the *Cox proportional hazards (PH) model*. In addition, it will also mention using ANOVA to test two nested models.

- Assume we have a vector of p explanatory variables: $X = (X_{1}, X_{2},..., X_{p})$.
- We can express the hazard function $h(t|X) = h_{0}(t)g(X)$ as a function of time t and the vector of explanatory variables X. The baseline hazard $h_{0}(t)$ only depends of survival time t but not on X. The $g(X)$ is a function of the explanatory variables X which are assumed to be time-invariant.\
- When the function $g(X)$ is set to be

$g(X) = exp(\beta X) = exp(\beta_{1}X_{1} + \beta_{2}X_{2} + ... + \beta_{p}X_{p}) = exp(\sum_{i = 1}^{p} \beta_{i}X_{i})$

- The hazard function becomes

$\begin{aligned} h(t|X) &= h_{0}(t)g(X) = h_{0}exp(\sum_{i = 1}^{p} \beta_{i}X_{i}) \\ ln(h(t|X)) &= ln[h_{0}(t)] + \sum_{i = 1}^{p} \beta_{i}X_{i} \end{aligned}$

Where

- $h_{0}(t)$ is the baseline hazard.
- $ln[h_{0}(t)]$ plays the role of the random error term in ordinary multiple regression and
- $exp(\sum_{i = 1}^{p} \beta_{i}X_{i})$ is the relative hazard.

In case you wonder why $exp(\sum_{i = 1}^{p} \beta_{i}X_{i})$ is the relative hazard:

- Consider just one categorical variable X with values 0 and 1, then,

$\begin{aligned} h(t|X = 0) &= h_{0}(t) exp(\beta X) = h_{0}exp(\beta \times 0) = h_{0}(t) \\ h(t|X = 1) &= h_{0}(t) exp(\beta X) = h_{0}exp(\beta \times 1) = h_{0}exp(\beta) \end{aligned}$

- Thus we have

$\frac{h(t|X = 1)}{h(t|X = 0)} = \frac{h_{0}(t)exp(\beta)}{h_{0}(t)} = exp(\beta)$

- The proportion is a constant that is independent of survival time t. That is the reason we call such models as
**proportional hazards**models. No matter what the survival time t distribution looks like, it does not affect the hazards' proportion. This attribute makes it popular in research fields.

- At each uncensored data point $t_{i}$, assume $R(t_{i})$ to be the set of observations still at risk immediately before $t_{i}$:
- Conditional on $R(t_{i})$, the probability that point $(x_{i}, t_{i})$ is the one which occurs is given by

$\frac{exp(\beta_{i} x_{i})}{\sum_{j \in R(t_{i})}exp(\beta_{j} x_{j})}$

- The conditional likelihood $L(\beta)$ is given by the product of all the uncensored points:

$L(\beta) = \prod \frac{exp(\beta_{i} x_{i})}{\sum_{j \in R(t_{i})} exp(\beta_{j} * x_{j})}$

- The estimation of $\beta$ is obtained through iteration using the second derivatives of the likelihood $L(\beta)$. But sorry to be honest, I am not sure how the algorithm exactly works in details.

As mentioned above, *Cox PH model* focuses on studying the two hazard functions with the assumption that they are proportional to each other. In the Cox PH model, people do not need to specify the probability distribution for the baseline hazard function. As a result, it is very popular for researchers who uses observational data in their fields.

One useful post about testing the assumptions is here. In short there are 3 common methods to test the assumption of proportional hazards:

- Graphic approach: since $\frac{h(t|X=1)}{h(t|X = 0)} = exp(\beta)$ is a constant over t, we can plot the estimated hazard functions of the different levels of the covariate and examine if they are proportional, i.e., they do not cross. One can also use the log-log plots of estimated survivor functions. Under the PH assumption the two curves $-ln[-ln[\hat{S}_{1}(t)]]$ and $-ln[-ln[\hat{S}_{0}(t)]]$ should be parallel for all values of t.
- Time-dependent variables: introduce a interaction term of a potential time-dependent variable and a time's function to assess whether the time-depended variable is independent of time or not.
- Goodness-of-fit-test: one can use
`cox.zph()`

in R.

I would use an example data to demonstrate the application in R.

- Read data.

```
## read data
df <- readxl::read_excel("cox_model/data/Survival-2023-HomeExam-14-20-March-2023.xlsx")
## show the first few lines for readers to know what the df looks like
head(df) |> kable()
```

Cluster | BirCoh | Residence | Educ | ExpMonths | FirstBirth |
---|---|---|---|---|---|

1 | 2 | 3 | 0 | 27 | 0 |

1 | 4 | 3 | 0 | 11 | 1 |

1 | 4 | 3 | 0 | 63 | 0 |

1 | 7 | 3 | 0 | 5 | 1 |

1 | 1 | 3 | 0 | 4 | 1 |

1 | 7 | 3 | 0 | 29 | 1 |

- Create a survival object with
`Surv()`

function.

```
# create a survival object
birth_time <- Surv(df$ExpMonths, df$FirstBirth)
```

As mentioned above, the proportional hazard model is estimated as the form of

$\begin{aligned} h(t|X) &= h_{0}(t)g(X) = h_{0}exp(\sum_{i = 1}^{p} \beta_{i}X_{i}) \\ ln(h(t|X)) &= ln[h_{0}(t)] + \sum_{i = 1}^{p} \beta_{i}X_{i} \end{aligned}$

Where

- $h_{0}(t)$ is the baseline hazard and
- $exp(\sum_{i = 1}^{p} \beta_{i}X_{i})$ is the relative hazard.

*Please note: in proportional hazard model, the baseline hazard is left unspecified. Therefore we can not get it from the R output.*

- By default,
`coxph`

in R uses the first level as baseline. - The summary result from
`coxph`

is as follows (rounded to 3 digits):

```
cph_biroh <- coxph(birth_time ~ as.factor(BirCoh), data = df)
round(summary(cph_biroh)$coefficients, 3) |> kable()
```

coef | exp(coef) | se(coef) | z | Pr(>|z|) | |
---|---|---|---|---|---|

as.factor(BirCoh)2 | -0.314 | 0.730 | 0.068 | -4.615 | 0 |

as.factor(BirCoh)3 | -0.511 | 0.600 | 0.066 | -7.696 | 0 |

as.factor(BirCoh)4 | -0.709 | 0.492 | 0.068 | -10.388 | 0 |

as.factor(BirCoh)5 | -0.809 | 0.445 | 0.067 | -12.036 | 0 |

as.factor(BirCoh)6 | -1.014 | 0.363 | 0.071 | -14.200 | 0 |

as.factor(BirCoh)7 | -1.171 | 0.310 | 0.072 | -16.320 | 0 |

- Interpretation
- 'coef': this is the $\beta_{i}$ listed above. It is usually obtained through iteration of the log-likelihood function: $L(\beta) = \prod \frac{exp(\beta_{i}x_{i})}{\sum exp(\beta_{j} x_{j})}$. (
*Reference: lecture 5 notes*) - 'exp(coef)': relative hazard that we are interested.
- 'z': the z score = $\frac{\hat{\beta}}{SE(\hat{\beta})}$. It is used to get the p-value.
- 'Pr': it is derived by comparing z value and $N(0,1)$.

- 'coef': this is the $\beta_{i}$ listed above. It is usually obtained through iteration of the log-likelihood function: $L(\beta) = \prod \frac{exp(\beta_{i}x_{i})}{\sum exp(\beta_{j} x_{j})}$. (
- Result
- All the birth cohort levels are significant at 0.05 level.
- When cohort = 1 (youngest cohort) is the baseline: The relative hazard of the birth cohort = 2 is 0.730, i.e. at any given time, cohort 2 is 27% less likely to have the first child. Similar interpretation may apply to other cohorts.
- The conclusion is that older cohorts are less likely to have the first child.
- This result matches the Kaplan-Merier curve below.

```
library(ggsurvfit)
survfit2(birth_time ~ as.factor(BirCoh), data = df) %>%
ggsurvfit() +
labs(
x = "Months",
y = "Survival probability",
title = "Survival curve: Kaplan-Meier estimates",
color = "Birth cohort"
) +
theme(plot.title = element_text(hjust = 0.5)) +
add_confidence_interval()
```

```
cph_biroh_resid <- coxph(birth_time ~ as.factor(BirCoh) + as.factor(Residence), data = df
)
round(summary(cph_biroh_resid)$coefficients, 3) |> kable()
```

coef | exp(coef) | se(coef) | z | Pr(>|z|) | |
---|---|---|---|---|---|

as.factor(BirCoh)2 | -0.302 | 0.739 | 0.068 | -4.438 | 0 |

as.factor(BirCoh)3 | -0.493 | 0.611 | 0.066 | -7.420 | 0 |

as.factor(BirCoh)4 | -0.696 | 0.499 | 0.068 | -10.187 | 0 |

as.factor(BirCoh)5 | -0.790 | 0.454 | 0.067 | -11.741 | 0 |

as.factor(BirCoh)6 | -0.997 | 0.369 | 0.071 | -13.964 | 0 |

as.factor(BirCoh)7 | -1.151 | 0.316 | 0.072 | -16.057 | 0 |

as.factor(Residence)2 | 0.227 | 1.255 | 0.048 | 4.754 | 0 |

as.factor(Residence)3 | 0.409 | 1.506 | 0.042 | 9.633 | 0 |

- Result
- The birth cohort part is similar to the interpretation above.
- When residence = 1 (Metropolitan City) is baseline, the relative hazard of residence = 2 (Other Towns) is 1.255; the relative hazard of residence = 3 (rural area) is 1.506.
- The conclusion is that at a given time, married women from other towns are 25.5% more likely and women from rural area are 50.6% more likely to have the first kid, compared with women from metropolitan city.
- This result matches the Kaplan-Merier curve below.

```
survfit2(birth_time ~ as.factor(Residence), data = df) %>%
ggsurvfit() +
labs(
x = "Months",
y = "Survival probability",
title = "Survival curve: Kaplan-Meier estimates",
color = "Residence level"
) +
theme(plot.title = element_text(hjust = 0.5)) +
add_confidence_interval()
```

Please note that the code above is just for demonstration purpose. One can further use `cox.zph()`

to test if the PH assumption is fulfilled. In fact, the example data does not fulfill the assumption.

```
cox.zph(cph_biroh_resid)
```

```
## chisq df p
## as.factor(BirCoh) 23.12 6 0.00076
## as.factor(Residence) 1.66 2 0.43610
## GLOBAL 24.71 8 0.00174
```

```
cox.zph(cph_biroh)
```

```
## chisq df p
## as.factor(BirCoh) 24.4 6 0.00043
## GLOBAL 24.4 6 0.00043
```

For the two cases above, one uses birth cohort only, the other uses both birth cohort and education level. Natually people would like to know whether adding the additional factor could improve the model performance. In this case, we can use ANOVA with setting below:

- We have model 1 with all variables, model 2 with restricted variables.
- The null hypothesis is that all the coefficients in model 1 are 0, except for the ones in model 2.

In R, it looks like below:

```
anova(cph_biroh_resid, cph_biroh)
```

```
## Analysis of Deviance Table
## Cox model: response is birth_time
## Model 1: ~ as.factor(BirCoh) + as.factor(Residence)
## Model 2: ~ as.factor(BirCoh)
## loglik Chisq Df Pr(>|Chi|)
## 1 -40265
## 2 -40322 112.85 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Result:

- The p-value is smaller than 0.05, therefore we reject the null hypothesis that all the coefficients except for the ones in model 2 are 0.
- The conclusion is that we need to take model 1: adding education level improves the model.

Thank you for reading this far. I hope you can get something useful from this post. In the next posts I will share the Accelerated Failure-time (AFT) models.

- Model derivation
- Model estimation
- Model assumption
- Model application

*- Model the intensity of experiencing the event of interest as a function of one covariate (Birth Cohort). Use the first level of the covariate as baseline (reference) level.
- Model the intensity of experiencing the event of interest as a function of two covariates (Birth Cohort and Residence). Use the first levels of the covariates as baseline (reference) levels.

- End