Survival Analysis 1: Basic Concepts and Three Fundamental Functions

2023-05-28 · 11 min read

This post covers their concepts and relationship among the three pillows of survival analysis: survivor function, density function, hazard function.

The cover picture is from Jim Gruman’s survival analysis post on
myTidyTuesday.

Introduction

During February and March 2023, I attended the course Analysis of
Survival Data with Demographic
Applications
at Stockholm University by Prof. Gebrenegus Ghilagaber. After finishing
the exam, I am writing a series of posts to share what I learned from
this course and my thoughts on different subjects.

I am working as statistician in aging research field where survival
analysis is widely used. When I am writing posts, I would try to avoid
purely sharing mathematics equations and statistical theorems but
combine practical applications and statistics behind them.

In this post, I will share the concepts of:

Censoring & a motivating example.
Survivor function, density function, hazard function.
The relationships of functions above.

Censoring

Survival analysis is
a branch of statistics for analyzing the expected duration of time until
one event occurs.
Censoring
is a form of missing data problem in which time to event is not observed
for reasons such as termination of study before all recruited subjects
have shown the event of interest or the subject has left the study prior
to experiencing an event.

Without considering censoring data, we may just tend to delete the
observations not experiencing interested event. However, this may lost
some information contained in the data sets.

A motivating example:

Let’s say we have 1000 participants join a cohort study in the
beginning of 2000. By the end of year 2000, 100 die and 900 survive.
200 more participants join the study in the beginning of 2001.
By the end of 2001, 100 participants who entered the study in 2000,
and 20 participants who joined in 2001 died.
The study ends at the end of 2001.

Question: Estimate the proportion of participants surviving for 6
years or longer.

Challenge: The 2nd group is followed only for one year.

Solution:

Ignore the patients in the 2nd group. Get an estimate of surviving 2
years: $\hat{S}(2) = \frac{900}{1000} = 0.9$ .
Participants who survived two years can be considered this: they
survive the first year and then surviving the second year. Thus
we can apply the joined probability as below

\begin{aligned} \hat{S}(2) &= P(\textrm{surviving 1st year THEN 2nd year}) \\ &= P(\textrm{surviving 1st year}) * P(\textrm{surviving 2nd year | survived 1st year})\\ &= (\frac{900 + 80}{1000+200}) (\frac{800}{900}) \\ &= 0.72 \end{aligned}

Which solution do you think is better?

If you think the second is better, then welcome to the world of survival
analysis! The essential reason why statisticians developed this field is
to make the most use of the censored data.

Your choice of the second solution is better may come from intuition.
Next I will introduce three important fundamental blocks for
constructing the survival analysis world: survival function S(t),
density function f(t) and hazard function h(t). They provide beautiful
mathematics reasoning to support your intuition is sound.

Functions of survival time

After we get the survival data, a natural intuition is to describe the
survival time numerically and graphically. Another post will talk about
graphical description of survival time. Here I mainly introduce the
numeric approach.

In classical statistics, random variables are described by
probability density functions (PDF) or culmulative distribution function
(CDF). In survival analysis statistics, there are three functions:

the survival function, S(t)
the density function, f(t)
the hazard function, h(t) or $\lambda (t)$

Survival function S(t)

The survival function S(t) can be described in two equivalent ways

the probability of surviving beyond time point t
the probability that the event of interest does not occur until
time-point t

S(t) = P(T>t) = 1 - P(T \le t) = 1 - F(t)

where $F(t)$ is the cumulative distribution function of T.

Therefore S(0) = 1, S( $\infty$ ) = 0. The graph of S(t) versus t is
called survival curve.

Density function f(t)

It is defined as the probability that the event of interest takes place
within a small time interval:

f(t) = \lim_{\Delta t \to 0} \frac{P(t<T<t+\Delta)}{\Delta t}

when

f(t) > 0, t > 0
f(t) = 0, t < 0
$\int_{0}^{\infty} f(t) dt = 1$

Please note the difference between probability and probability
density. At any time point, the P(t) = 0, but it does not mean f(t) =
0. Similarly, P(t) can not be bigger than 1 but f(t) could actually be.

Hazard function h(t)

The hazard function is a instantaneous rate at which an event of
interest happens within a small time interval, given that it has not
occurred before time t.

h(t) = \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t | T \geq t)}{\Delta t}

Please note that hazard function is essentially a rate, not a
probability. That being said, $0 < h(t) < \infty$ .

Relationships among 3 survival functions

It is natural to ask questions like

Why do we need to know these three survival function?
What is the relationship among them?

The hazard function h(t) is simply the ratio of the density function
f(t) and the survival function S(t).

h(t) = \frac{f(t)}{S(t)}

f(t) is related to the probability that an observation has
experienced the event of interest (uncensored).
S(t) is related to the probability that an observation survives
beyond time t (censored at time t).
Therefore, the hazard function contains information on both censored
and uncensored observations. This is the exact reason why hazard
function is of core interest in survival analysis. When we have hazard
function, we are making good use of the censored and uncensored
observations.

For mathematics details please see section below.

Mathematics deduction

We can begin with the density function f(t)

\begin{aligned} f(t) &= \lim_{\Delta t \to 0} \frac{P(t<T<t+\Delta)}{\Delta t} \\ f(t)\Delta t &\cong P(t<T<t+\Delta) \end{aligned}

With the hazard function we have

\begin{aligned} h(t) &= \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t | T \geq t)}{\Delta t} \\ h(t) \Delta t & \cong P(t < T < t + \Delta t | T \geq t) \end{aligned}

By applying the Bayes’
theorem

\begin{aligned} P(t < T < t + \Delta t | T \geq t) = \frac{P(t < T < t + \Delta t, T \geq t)}{P(T \geq t)} \end{aligned}

The numerator can be simplifies as
$P(t < T < t + \Delta t, T \geq t) = P(t < T < t + \Delta t)$ . Therefore
we have

\begin{aligned} h(t) &= \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t | T \geq t)}{\Delta t} \\ h(t) \Delta t & \cong P(t < T < t + \Delta t | T \geq t) \\ & \cong \frac{P(t < T < t + \Delta t)}{P(T \geq t)}\\ & \cong \frac{f(t) \Delta t}{S(t)} \end{aligned}

If we cancel out $\Delta t$ on both sides of the equation we have

h(t) = \frac{f(t)}{S(t)}

End thoughts

In this post I have briefly shared motivation of survival analysis and
three fundamental functions. And I have shared the reason that hazard function
is so crucial in survival analysis.

Personally speaking, the moment when I get to the end of the
mathematical deduction of $h(t) = \frac{f(t)}{S(t)}$ , I gasped with
admiration in my heart: it is so beautiful!

Introduction
Censoring
Functions of survival time
End thoughts