Bolin Wu

Survival Analysis 2: Non-Parametric Estimation of Survival Functions

Survival Analysis 2: Non-Parametric Estimation of Survival Functions
2023-07-21 · 17 min read
survival_analysis R

Concepts of survival function estimations and corresponding calculations both manually and in R.

Introduction

In the Survival Analysis 1: Basic concepts and three fundamental functions I
shared about the three basic blocks of survival analysis: density function,
survival function and hazard function.

Next, it is natural to think about: now we know their meaning, but how
to estimate them? It brings us to the next topic of non-parametric estimation
of survival functions
. The word non-parametric means there is no
assumption
about the distribution of survival time T.

A simple case without censoring

No censoring indicates all samples are observed until they experience the event of
interest (e.g., death, unemployment, account deletion). Then we can
estimate survival function S(t) by:

S^ti=number of individuals survived longer than ttotal number of individuals in the sample\hat{S}_{t_{i}} = \frac{\text{number of individuals survived longer than t}}{\text{total number of individuals in the sample}}

Assume we are observing 8 mice’ survival time

Time Accumulate death number S^ti\hat{S}_{t_{i}}
2 3 838=58=0.625\frac{8-3}{8} = \frac{5}{8} = 0.625
4 5 858=38=0.375\frac{8-5}{8} = \frac{3}{8} = 0.375
12 8 888\frac{8-8}{8} = 0

Estimation with consoring

The method above is applicable only in situations where all individuals
are observed until the events of interest occur (no censoring). However,
in real life, such a scenario is rare because of unfeasible study
design, high expenses, etc. As a result, we have to adjust the risk
set accordingly when new observations are entering the study or
previous ones are being censored. Two common methods are Life-Table
(Actuarial)
and Product-Limit (Kaplan-Meier) methods.

The Life-Table (Actuarial) method

This method objectively defines the time intervals by split points and
assume the observations being censored within an interval were at risk
for half of the length of that interval. Let me first introduce some
jargon:

  • The split points = 0τ1τ2...τk0 \le \tau_{1} \le \tau_{2} ...\le \tau_{k}
  • Then the time intervals = Ir=[τr,τr+1)I_{r} = [\tau_{r}, \tau_{r+1})
  • The number of observations enter the rthr^{th} interval = NrN_{r}
  • The number of observations experiencing the event in the rthr^{th}
    interval = ErE_{r}
  • The number of observations censored in the rthr^{th} interval = ZrZ_{r}
  • Therefore we have Nr=Nr1Er1Zr1N_{r} = N_{r-1} - E_{r-1} - Z_{r-1}
  • The risk set (observations exposed to risk of experiencing the event)
    = Rr=Nr0.5ZrR_{r} = N_{r} - 0.5*Z_{r}. The “0.5Zr0.5 * Z_{r}” part comes from the
    assumption mentioned above.

Next, we can compute conditional probability of experiencing the
event in the rthr^{th} interval:

q^r=ErRr\hat{q}_{r} = \frac{E_{r}}{R_{r}}

The conditional probability of surviving the rthr^{th} interval:

p^r=1q^r\hat{p}_{r} = 1 - \hat{q}_{r}

Finally we can compute the survivor function as

  • when r = 0:

S^0=1\hat{S}_{0} = 1

  • for r = 1, 2, 3, …

S^r=p^r1S^r1=p^r1(p^r2S^r2)=...=i=1r1p^i\hat{S}_{r} = \hat{p}_{r-1} * \hat{S}_{r-1} = \hat{p}_{r-1} * (\hat{p}_{r-2} * \hat{S}_{r-2}) = ... = \prod_{i=1}^{r-1}\hat{p}_{i}

  • with standard error estimation

SE(S^r)=S^r2(i=1rq^ip^iRi)SE(\hat{S}_{r}) = \sqrt{\hat{S}_{r}^{2} (\sum_{i=1}^{r} \frac{\hat{q}_{i}}{\hat{p}_{i} R_{i}})}

  • density function

f^r=S^r1S^rτrτr1\hat{f}_{r} = \frac{\hat{S}_{r-1} - \hat{S}_{r}}{\tau_{r} - \tau_{r-1}}

  • with standard error estimation

SE(f^r)=(q^rS^rτr+1τr)2(i=1r1q^ip^iRi+p^rq^rRr)SE(\hat{f}_{r}) = \sqrt{(\frac{\hat{q}_{r}\hat{S}_{r}}{\tau_{r+1}-\tau_{r}})^{2} (\sum_{i=1}^{r-1} \frac{\hat{q}_{i}}{\hat{p}_{i}R_{i}} + \frac{\hat{p}_{r}}{\hat{q}_{r}R_{r}})}

  • the hazard function is estimated by

h^r=Er(RrEr2)(τr+1τr)\hat{h}_{r} = \frac{E_{r}}{(R_{r} - \frac{E_{r}}{2}) (\tau_{r+1}-\tau_{r})}

  • with standard error estimation

SE(h^r)=h^r2q^rRr[1(h^(τr+1τr)2)2]SE(\hat{h}_{r}) = \sqrt{\frac{\hat{h}^{2}_{r}}{\hat{q}_{r}R_{r}} [1 - (\frac{\hat{h}(\tau_{r+1} - \tau_{r})}{2})^{2}]}

A simple exercise

The following table contains data on the recidivism-example in Allison
(2010, p. 45):

interval_start num_entering num_withdrawn num_exposed_risk num_events
0 432 0 432 14
10 418 0 418 21
20 397 0 397 23
30 374 0 374 23
40 351 0 351 26
50 325 318 166 7

Example data

Based on the values in the table, let’s manually estimate the survival
function, density function, and hazard function together with their
standard errors.

  • When r = 0

Sr^=1\hat{S_{r}} = 1

  • When r = 1

ErE_{r} = 14, ZrZ_{r} = 0, NrN_{r} = 432, RrR_{r} = 432

qr^=ErRr=14432=0.03240741\hat{q_{r}} = \frac{E_{r}}{R_{r}} = \frac{14}{432} = 0.03240741

pr^=1qr^=114432=0.9675926\hat{p_{r}} = 1 - \hat{q_{r}} = 1 - \frac{14}{432} = 0.9675926

Sr^=i=1rp^i=0.9675926\hat{S_{r}} = \prod_{i = 1}^{r}\hat{p}_{i} = 0.9675926

SE(Sr^)=Sr^2(i=1r1q^ip^iRi)=12(0.032407410.9675926432)=0.0087SE(\hat{S_{r}}) = \sqrt{\hat{S_{r}}^{2} (\sum_{i=1}^{r-1} \frac{\hat{q}_{i}}{\hat{p}_{i} R_{i}})} = \sqrt{1^{2} (\frac{0.03240741}{0.9675926 * 432})} = 0.0087

fr^=S^r1S^rτrτr1=10.967592610=0.00324074\hat{f_{r}} = \frac{\hat{S}_{r-1} - \hat{S}_{r}}{\tau_{r} - \tau_{r-1}} = \frac{1 - 0.9675926}{10} = 0.00324074

SE(fr^)=(qr^S^rτr+1τr)2(i=1r1qi^p^iRi+p^rq^rRr)=(0.03210)2(0.0320.968432+0.9680.032432)=0.000847243\begin{aligned} SE(\hat{f_{r}}) & = \sqrt{(\frac{\hat{q_{r}}\hat{S}_{r}}{\tau_{r+1}-\tau_{r}})^{2} (\sum_{i=1}^{r-1} \frac{\hat{q_{i}}}{\hat{p}_{i}R_{i}} + \frac{\hat{p}_{r}}{\hat{q}_{r}R_{r}})} \\ &= \sqrt{(\frac{0.032}{10})^{2}(\frac{0.032}{0.968*432} + \frac{0.968}{0.032*432})} \\ & = 0.000847243 \end{aligned}

h^r=Er(RrEr2)(τr+1τr)=14(43214/2)10=0.003294118\hat{h}_{r} = \frac{E_{r}}{(R_{r} - \frac{E_{r}}{2}) (\tau_{r+1}-\tau_{r})} = \frac{14}{(432 - 14/2)10} = 0.003294118

SE(h^r)=h^r2q^rRr[1(h^r(τr+1τr)2)2]=0.00329411820.03240741432[1(0.003294118102)2]=0.0008802706\begin{aligned} SE(\hat{h}_{r}) & = \sqrt{\frac{\hat{h}^{2}_{r}}{\hat{q}_{r}R_{r}} [1 - (\frac{\hat{h}_{r}(\tau_{r+1} - \tau_{r})}{2})^{2}]} \\ &= \sqrt{\frac{0.003294118^{2}}{0.03240741 * 432} [1 - (\frac{0.003294118* 10}{2})^{2}]} \\ & = 0.0008802706 \end{aligned}

  • For r > 2, I will calculate them with R.
# input data
tau_interval <- 10
r <- 1:6
i <- seq(0, 50, 10)
Nr <- c(432, 418, 397, 374, 351, 325)
Zr <- c(0, 0, 0, 0, 0, 318)
Rr <- c(432, 418, 397, 374, 351, 166)
Er <- c(14, 21, 23, 23, 26, 7)

# by definition
qr <- Er / Rr
pr <- 1 - qr
hr <- Er / ((Rr - Er / 2) * tau_interval)
Sr <- c()
Sr[1] <- pr[1]
for (i in 2:6) {
  Sr[i] <- Sr[i - 1] * pr[i]
}

fr <- c()
fr[1] <- 1 - Sr[1]

for (i in 2:6) {
  fr[i] <- (Sr[i - 1] - Sr[i]) / tau_interval
}


# find se

se_sr <- c()
inner_sum <- qr[1] / (pr[1] * Rr[1])
se_sr[1] <- sqrt(Sr[1]^2 * inner_sum)
for (i in 2:6) {
  inner_sum <- inner_sum + qr[i] / (pr[i] * Rr[i])
  se_sr[i] <- sqrt(Sr[i]^2 * inner_sum)
}

se_fr <- c()
inner_sum <- qr[1] / (pr[1] * Rr[1])
se_fr[1] <- sqrt(((qr[1] * Sr[1] / tau_interval)^2) * (inner_sum + pr[1] / (qr[1] * Rr[1])))
# se_fr[1] = sqrt(((qr[1] * Sr[1]/ tau_interval)^2) * (1 + pr[1]/(qr[1] * Rr[1])))
for (i in 2:6) {
  inner_sum <- inner_sum + qr[i] / (pr[i] * Rr[i])
  se_fr[i] <- sqrt((qr[i] * Sr[i] / tau_interval)^2 * (inner_sum + pr[i] / (qr[i] * Rr[i])))
}

se_hr <- c()
for (i in 1:6) {
  se_hr[i] <- sqrt((hr[i]^2 / (qr[i] * Rr[i])) * (1 - (hr[i] * tau_interval) / 2)^2)
}

df1 <- tibble::tibble(r, i, Nr, Zr, Rr, Er, qr, pr)
df2 <- tibble::tibble(Sr, se_sr, fr, se_fr, hr, se_hr)
r i Nr Zr Rr Er qr pr
1 6 432 0 432 14 0.032 0.968
2 6 418 0 418 21 0.050 0.950
3 6 397 0 397 23 0.058 0.942
4 6 374 0 374 23 0.061 0.939
5 6 351 0 351 26 0.074 0.926
6 6 325 318 166 7 0.042 0.958

Example data

Sr se_sr fr se_fr hr se_hr
0.968 0.009 0.032 0.001 0.003 0.001
0.919 0.013 0.005 0.001 0.005 0.001
0.866 0.016 0.005 0.001 0.006 0.001
0.812 0.019 0.005 0.001 0.006 0.001
0.752 0.021 0.006 0.001 0.008 0.001
0.721 0.023 0.003 0.001 0.004 0.002

Derived functions and corresponding standard error

The Kaplan-Meier (Product-Limit) Method

One helpful YouTube video to cover this method is from “zedstatistics”
here.

Its estimation is similar to life-table method above:

S^(t)=r:τrt(1ErRr)\hat{S}(t) = \prod_{r:\tau_{r}\le t}(1 - \frac{E_{r}}{R_{r}})

V(ln(S(t)))=t(i)tErRr(RrEr)V(ln(S(t))) = \sum_{t(i)\le t}\frac{E_{r}}{R_{r}(R_{r} - E_{r})}

I really enjoy the fact that YouTube video step by step deducts the
stand error
V(ln(S(t)))=t(i)tErRr(RrEr)V(ln(S(t))) = \sum_{t(i)\le t}\frac{E_{r}}{R_{r}(R_{r} - E_{r})}.
Please check it out if interested!

If you plot out the survival function versus the time, then you can get
the survival curve, which is often used in the observational data
analysis.

A simple exercise

The example data is from the survival analysis course at SU.

  • To estimate the survival curves (Kaplan-Meier method), we need to use
    survival library in R.
library(survival)
library(tidyverse)
  • Create a survival object with Surv() function.
# read data
df <- readxl::read_excel("nonparametric_estimation/data/example_data.xlsx")
# create a survival object
birth_time <- Surv(df$ExpMonths, df$FirstBirth)
  • Kaplan-Meier estimation for the different levels of birth cohort
km_BirCoh <- survfit(birth_time ~ as.factor(BirCoh), data = df)
  • Plot the curve
plot(km_BirCoh, main = "Survival curve: Kaplan-Meier estimates")
  • or we can have a better looking curve by using survfit2 from
    ggsurvfit package.
library(ggsurvfit)
survfit2(birth_time ~ as.factor(BirCoh), data = df) %>%
  ggsurvfit() +
  labs(
    x = "Months",
    y = "Survival probability",
    title = "Survival curve: Kaplan-Meier estimates",
    color = "Birth cohort"
  ) +
  theme(plot.title = element_text(hjust = 0.5)) +
  add_confidence_interval()

Please note: since the confidence intervals are symmetric, when the
estimated survivor function is close to 0 or 1, they may sit outside the
(0,1) range. One solution is to manually enclose interval with 0 or 1.
This is also mentioned in the video above.

  • The cumulative hazard function H(t) is estimated by

H^(t)=log[S^(t)]\hat{H}(t) = - log[\hat{S}(t)]

If we plot H(t) versus the time t, finding the slope is constant, then
it indicates that the hazard rate is constant over time.

Life-table VS Kaplan-Meier

  1. Life-table uses aggregated data, with arbitrary choice of interval.
    This may lead to loss of information; Kaplan-Meier does not demand
    grouping of data. It records every spot when event of interest
    happens. Therefore we can get more information from Kaplan-Meier.
  2. However, one shortcoming of Kaplan-Meier is that if one prints out
    the estimations, it could be a very long table.

End thoughts

The derivation of life-table and Kaplan-Meier methods to estimate the
survival functions are intuitive and straightforward. It does not involve
complicated algebraic calculations. I feel the “statistical spices”
comes from the confidence interval and standard error estimation, which
are based on the theory of sampling distribution of the mean.

So now we know how to estimate the functions. In the next post, we will look at the comparison of two survival functions.

Prudence is a fountain of life to the prudent.