Bolin Wu
#
Survival Analysis 1: Basic Concepts and Three Fundamental Functions

# Introduction

# Censoring

# Functions of survival time

## Survival function S(t)

## Density function f(t)

## Hazard function h(t)

## Relationships among 3 survival functions

## Mathematics deduction

# End thoughts

2023-05-28 · 11 min read

survival_analysis
This post covers their concepts and relationship among the three pillows of survival analysis: survivor function, density function, hazard function.

*The cover picture is from Jim Gruman’s survival analysis post on
myTidyTuesday.*

During February and March 2023, I attended the course Analysis of

Survival Data with Demographic

Applications

at Stockholm University by Prof. Gebrenegus Ghilagaber. After finishing

the exam, I am writing a series of posts to share what I learned from

this course and my thoughts on different subjects.

I am working as statistician in aging research field where survival

analysis is widely used. When I am writing posts, I would try to avoid

purely sharing mathematics equations and statistical theorems but

combine practical applications and statistics behind them.

In this post, I will share the concepts of:

- Censoring & a motivating example.
- Survivor function, density function, hazard function.
- The relationships of functions above.

Survival analysis is

a branch of statistics for analyzing the expected duration of time until

one event occurs.

Censoring

is a form of missing data problem in which time to event is not observed

for reasons such as termination of study before all recruited subjects

have shown the event of interest or the subject has left the study prior

to experiencing an event.

Without considering censoring data, we may just tend to delete the

observations not experiencing interested event. However, this may lost

some information contained in the data sets.

**A motivating example:**

- Let’s say we have 1000 participants join a cohort study in the

beginning of 2000. By the end of year 2000, 100 die and 900 survive. - 200 more participants join the study in the beginning of 2001.
- By the end of 2001, 100 participants who entered the study in 2000,

and 20 participants who joined in 2001 died. - The study ends at the end of 2001.

**Question:** Estimate the proportion of participants surviving for 6

years or longer.

**Challenge:** The 2nd group is followed only for one year.

**Solution:**

- Ignore the patients in the 2nd group. Get an estimate of surviving 2

years: $\hat{S}(2) = \frac{900}{1000} = 0.9$. - Participants who survived two years can be considered this: they

survive the first year**and then**surviving the second year. Thus

we can apply the joined probability as below

$\begin{aligned} \hat{S}(2) &= P(\textrm{surviving 1st year THEN 2nd year}) \\ &= P(\textrm{surviving 1st year}) * P(\textrm{surviving 2nd year | survived 1st year})\\ &= (\frac{900 + 80}{1000+200}) (\frac{800}{900}) \\ &= 0.72 \end{aligned}$

**Which solution do you think is better?**

If you think the second is better, then welcome to the world of survival

analysis! The essential reason why statisticians developed this field is

to make the most use of the censored data.

Your choice of the second solution is better may come from intuition.

Next I will introduce three important fundamental blocks for

constructing the survival analysis world: survival function S(t),

density function f(t) and hazard function h(t). They provide beautiful

mathematics reasoning to support your intuition is sound.

After we get the survival data, a natural intuition is to describe the

survival time numerically and graphically. Another post will talk about

graphical description of survival time. Here I mainly introduce the

numeric approach.

In **classical statistics**, random variables are described by

probability density functions (PDF) or culmulative distribution function

(CDF). In **survival analysis statistics**, there are three functions:

- the survival function, S(t)
- the density function, f(t)
- the hazard function, h(t) or $\lambda (t)$

The survival function S(t) can be described in two equivalent ways

- the probability of surviving beyond time point t
- the probability that the event of interest does not occur until

time-point t

$S(t) = P(T>t) = 1 - P(T \le t) = 1 - F(t)$

where $F(t)$ is the cumulative distribution function of T.

Therefore S(0) = 1, S($\infty$) = 0. The graph of S(t) versus t is

called **survival curve**.

It is defined as the probability that the event of interest takes place

within a small time interval:

$f(t) = \lim_{\Delta t \to 0} \frac{P(t<T<t+\Delta)}{\Delta t}$

when

- f(t) > 0, t > 0
- f(t) = 0, t < 0
- $\int_{0}^{\infty} f(t) dt = 1$

Please note the difference between **probability** and **probability
density**. At any time point, the P(t) = 0, but it does not mean f(t) =

0. Similarly, P(t) can not be bigger than 1 but f(t) could actually be.

The hazard function is a **instantaneous rate** at which an event of

interest happens within a small time interval, given that it has not

occurred before time t.

$h(t) = \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t | T \geq t)}{\Delta t}$

Please note that hazard function is essentially a rate, not a

probability. That being said, $0 < h(t) < \infty$.

It is natural to ask questions like

- Why do we need to know these three survival function?
- What is the relationship among them?

The hazard function h(t) is simply the ratio of the density function

f(t) and the survival function S(t).

$h(t) = \frac{f(t)}{S(t)}$

- f(t) is related to the probability that an observation
**has**the event of interest (uncensored).

experienced - S(t) is related to the probability that an observation survives

**beyond**time t (censored at time t). - Therefore, the hazard function contains information on
**both censored**observations. This is the exact reason why hazard

and uncensored

function is of core interest in survival analysis. When we have hazard

function, we are making good use of the censored and uncensored

observations.

For mathematics details please see section below.

We can begin with the density function f(t)

$\begin{aligned} f(t) &= \lim_{\Delta t \to 0} \frac{P(t<T<t+\Delta)}{\Delta t} \\ f(t)\Delta t &\cong P(t<T<t+\Delta) \end{aligned}$

With the hazard function we have

$\begin{aligned} h(t) &= \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t | T \geq t)}{\Delta t} \\ h(t) \Delta t & \cong P(t < T < t + \Delta t | T \geq t) \end{aligned}$

By applying the Bayes’

theorem

$\begin{aligned} P(t < T < t + \Delta t | T \geq t) = \frac{P(t < T < t + \Delta t, T \geq t)}{P(T \geq t)} \end{aligned}$

The numerator can be simplifies as

$P(t < T < t + \Delta t, T \geq t) = P(t < T < t + \Delta t)$. Therefore

we have

$\begin{aligned} h(t) &= \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t | T \geq t)}{\Delta t} \\ h(t) \Delta t & \cong P(t < T < t + \Delta t | T \geq t) \\ & \cong \frac{P(t < T < t + \Delta t)}{P(T \geq t)}\\ & \cong \frac{f(t) \Delta t}{S(t)} \end{aligned}$

If we cancel out $\Delta t$ on both sides of the equation we have

$h(t) = \frac{f(t)}{S(t)}$

In this post I have briefly shared motivation of survival analysis and

three fundamental functions. And I have shared the reason that hazard function

is so crucial in survival analysis.

Personally speaking, the moment when I get to the end of the

mathematical deduction of $h(t) = \frac{f(t)}{S(t)}$, I gasped with

admiration in my heart: it is so beautiful!