For those unfamiliar with survival analysis, there is a cursory introduction in Mini Appendix D, at the end of this notebook.
This notebook is published under the Creative Commons CC BY-NC-SA 4.0 license by mathias.elsner.
You may share and adapt this material for non-commercial purposes under the same license, giving attribution.
Of course, this paragraph is supplied tongue-in-cheek, as there will be no reason at all for someone to re-use this stuff.
This is a walk-through of performing statistical time-to-event analysis, exclusively employing open source tools. My background is that of clinician, not statistician, thus I have chosen a historical medical dataset to showcase the methods.
The Framingham Heart Study, inaugurated in 1948, arguably is the single most famous observational study in the history of medicine.
The study was seminal for the concept of cardiovascular risk factors, which seems so self-evident three generations later. The very first publication, in 1957, covered 4 years of follow-up, which was later remarkably extended to 24 years. The importance of the Framingham Heart Study and it's multiple extensions and spinoffs may be appreciated by the fact that, over time, it gave rise to more than 3000 original publications in peer-reviewed journals...
Further background information can be found on Wikipedia or on the Framingham Study's homepage. You could even watch a twelve minute YouTube video on the 70th anniversary of the study by clicking the thumbnail:
A dilettante (Italian dilettare from Latin delectare "to rejoice", "to delight") is a lover of an art or science who deals with it without formal education, and not professionally. As an amateur or layman, he does a thing for its own sake, that is, out of interest, pleasure or passion. That's me with regard to programming languages and to statistics ;-)
I do not have access to commercial statistics software like SPSS, MedCalc or SAS. I neither have experience in R programming nor MATLAB scripting. Python rules! because of versatility, readability and community.
I previously performed my fledgling data analyses in Python through scripts written with conventional IDEs or text editors, because that's where I came from. Reminiscence:
A fleeting stream of text and numeric output (from statistical test procedures etc.) would cascade through the terminal window, while layered plots cluttered my desktop. The more intricate an analysis got, the more changes accumulated, the more difficult it became to keep track of everything. Versioning and troubleshooting were veritable PITAs...
Jupyter Notebooks seem such a logical choice in hindsight. Yet, I am very late to the party.
Lifelines is a comprehensive Python library for survival analysis, that neatly integrates with notebooks and comes with a rather user-friendly documentation. Some background information on lifelines may be found in this article by the delevoper. Alternatives for survival analysis in Python might be scikit-survival, pysurvival or statsmodels. The latter library is very powerful and extensive. It is quite useful for various statistical methods. However, to my mind, it is less intuitive to work with on survival analysis, which may be a matter of documentation rather than implementation.
The U.S. National Heart, Lung and Blood Institute provides a sanitized "teaching dataset" of the original Framingham Heart Study. On registration, I submitted a data request, which kindly was fulfilled:
The dataset is a subset of the study data collected from 1948 onwards and includes laboratory, clinic, questionnaire, and adjudicated event data on 4,434 participants. Data was collected during three examination periods, approximately 6 years apart, from roughly 1956 to 1968. Each participant was followed for a total of 24 years.
"Although the enclosed dataset contains Framingham data ‘as collected’ by Framingham investigators, specific methods were employed to ensure an anonymous dataset that protects patient confidentiality. Therefore, this dataset is inappropriate for publication purposes."
Acknowledged. See disclaimer.
The aim of this notebook is not to gain new insights from scientific data that have already been thoroughly analyzed decades ago. We will simply use the venerable dataset as a generic example to play around with. We will neither aim at a full descriptive nor formal analysis of the data.
The Framingham study is merely a starting point to try out some basic statistics stuff. Furthermore, this document is not a proper HowTo, but rather a walkthrough.
Like in any other Python project, we will first import some useful libraries:
# Import general data manipulation libraries
import pandas as pd
import numpy as np
# Import IFrame inline display
from IPython.display import IFrame
Then we will import the full dataset, and have a first look at the structure:
# Import full CSV dataset
survival_data = pd.read_csv("frmgham2_me_cohort-1.csv", sep=';')
print(survival_data)
RANDID SEX TOTCHOL AGE SYSBP DIABP CURSMOKE CIGPDAY BMI \ 0 2448 1 195.0 39 106 70 0 0.0 26,97 1 6238 2 250.0 46 121 81 0 0.0 28,73 2 9428 1 245.0 48 127,5 80 1 20.0 25,34 3 10552 2 225.0 61 150 95 1 30.0 28,58 4 11252 2 285.0 46 130 84 1 23.0 23,1 ... ... ... ... ... ... ... ... ... ... 4429 9990894 2 248.0 48 131 72 1 20.0 22 4430 9993179 2 210.0 44 126,5 87 1 15.0 19,16 4431 9995546 2 269.0 52 133,5 83 0 0.0 21,47 4432 9998212 1 185.0 40 141 98 0 0.0 25,6 4433 9999312 2 196.0 39 133 86 1 30.0 20,91 DIABETES ... CVD HYPERTEN TIMEAP TIMEMI TIMEMIFC TIMECHD \ 0 0 ... 1 0 8766 6438 6438 6438 1 0 ... 0 0 8766 8766 8766 8766 2 0 ... 0 0 8766 8766 8766 8766 3 0 ... 1 1 2956 2956 2956 2956 4 0 ... 0 1 8766 8766 8766 8766 ... ... ... ... ... ... ... ... ... 4429 0 ... 0 1 6433 6433 6433 6433 4430 0 ... 0 1 6729 6729 6729 6729 4431 0 ... 1 1 5939 8766 5209 5209 4432 0 ... 0 1 8766 8766 8766 8766 4433 0 ... 0 1 8766 8766 8766 8766 TIMESTRK TIMECVD TIMEDTH TIMEHYP 0 8766 6438 8766 8766 1 8766 8766 8766 8766 2 8766 8766 8766 8766 3 2089 2089 2956 0 4 8766 8766 8766 4285 ... ... ... ... ... 4429 6433 6433 6433 2219 4430 6729 6729 6729 4396 4431 8766 5209 8766 735 4432 8766 8766 8766 0 4433 8766 8766 8766 4201 [4434 rows x 39 columns]
For clarity, only a selection of rows and columns is displayed.
The entire table is best viewed in a dedicated spreadsheet program like Excel, Numbers or LibreOfficeCalc. For copyright and licensing reasons, I may not provide a download link to the full dataset that was supplied to me by NIH/NHLBI.
To get a better idea of the available variables, you could have a look at the embedded data reference document from NIH in the IFrame window below:
IFrame("documents/FraminghamDataDocumentation.pdf", width=750, height=400)
On mobile devices, PDF documents embedded in IFrames are not properly displayed. Most of the time, only the first page is rendered. If you encounter problems with scrolling through the document, you may open the Framingham Data Documentation in a new browser window by clicking the blue link in this textbox.
The Framingham Heart Study collected time-dependent event data for a number of clinical endpoints, amongst others, stroke, hospitalized myocardial infarction and combined "cardiovascular disease". Furthermore, some quite soft and subjective endpoints like "angina pectoris" were included in the dataset.
We have to bear in mind that, in the era of the study's inception, and during the initial cohort's follow-up, diagnostic tests and criteria were markedly different from today:
There were no Troponin tests, nor even CK laboratory tests, which became commercially available in the 1960ies. Echocardiography still was in it's infancy. Computed tomography (CT) scanning and coronary angiography did not enter the stage of medicine until the 1970ies, whereas Magnetic Resonance Imaging (MRI) began to become clinically useful in the 1980ies.
We will restrict most of our our analyses to the clinical endpoint of mortality, that is death from any cause.
The reasons for this are:
- Mortality is relevant.
- Mortality is unambiguous.
- All cause mortality definition remains unchanged over time.
Let's import the first tool from the lifelines library. We will use some more methods from there as we go along.
# Import Kaplan-Meier survival method
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
Let's prepare the timeline and outcome variables for our survival analysis:
# Choose time variable from full dataset
T = survival_data["TIMEDTH"]/365
# Choose event variable from full dataset
E = survival_data["DEATH"]
Professional grade graphs can be generated with the matplotlib library.
# Import plotting library
from matplotlib import pyplot as plt
import matplotlib.ticker as mtick
Executed from within a standard Python environment, each figure would normally be displayed in a separate window supplied by the graphics backend and the computer's operating system. In order to display the figures within our Jupyter notebook, we will specify one of IPython's so-called magic commands (these special commands are preceded by a '%').
%matplotlib inline
To get a first idea of the temporal distribution of our event data, we will plot a simple histogram.
# Define graph size
plt.figure(figsize=(10,5))
# Define plot
ax = plt.subplot(111)
# Additional markers
plt.ylabel("Number of Fatalities")
plt.xlabel("(Years)")
# Additional Infos
plt.title("Framingham-Study (1948 ff.): Mortality Events Over Time", color = "#4C9900")
plt.text(18,5, 'mathias.elsner: CC BY-NC-SA', fontsize = 7, color = 'grey', style = 'italic')
# Plot histogram
plt.hist(T, bins = 24, range = (0,24), histtype='step')
(array([ 31., 28., 30., 41., 46., 53., 49., 46., 48., 57., 65., 70., 60., 75., 78., 83., 82., 81., 72., 90., 107., 82., 85., 91.]), array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24.]), [<matplotlib.patches.Polygon at 0x12f7e0790>])
Like most survival data, ours is not normally distributed, but skewed.
We will then model a survival curve for our entire study population.
# Define graph size
plt.figure(figsize=(7,6))
# Define plot
ax = plt.subplot(111)
# Set percent formatter in order to display percent instead of float on y-axis
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0, None,'%'))
# Fit and plot model
kmf.fit(T, event_observed=E, label='surviving')
kmf.plot_survival_function(ax=ax)
plt.grid(visible=True, which='both', axis='both', color='#C0C0C0', linestyle='-', linewidth=0.25)
# Additional Infos
plt.text(0, 0.65, 'mathias.elsner: CC BY-NC-SA', fontsize = 7, color = 'grey', style = 'italic')
plt.xlabel("(Years)")
plt.title("Framingham-Study (1948 ff.): Overall Cohort Survival", color = "#4C9900")
Text(0.5, 1.0, 'Framingham-Study (1948 ff.): Overall Cohort Survival')
We will have a look at the influence of several 'covariates' ('predictor variables', 'independent variables') on survival.
Among the 4433 participants, 8.6% of men and 7.1% of women were diagnosed with diabetes mellitus, either at baseline, or during follow-up. Presence of Diabetes was defined by being treated as such, or by a casual glucose level of 200 mg/dl or more.
Let's plot Kaplan-Meier survival curves of diabetic and nondiabetic subjects:
# Define graph size
plt.figure(figsize=(7,6))
# Define plot
ax = plt.subplot(111)
# Set percent formatter in order to display percent instead of float on y-axis
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0, None,'%'))
# Iterate discriminator for diabetes status and plot respective survival curves
value_range = [0, 1]
for value in value_range:
disc = (survival_data["DIABETES"] == value)
kmf.fit(T[disc], event_observed=E[disc], label=value)
kmf.plot_survival_function(ax=ax)
# Add ticks, grid, title and labels
plt.tick_params(labeltop=False, labelright=True, right=True)
plt.xticks(np.arange(0, 25, step=1), fontsize = 10) # Set label locations.
plt.grid(visible=True, which='both', axis='both', color='#C0C0C0', linestyle='-', linewidth=0.25)
plt.title("Framingham-Study (1948 ff.): Survival Condition Diabetes mellitus", color = "#4C9900")
plt.xlabel("(Years)")
# Additional Info: license
plt.text(0, 0.2, 'mathias.elsner: CC BY-NC-SA', fontsize = 7, color = 'grey', style = 'italic')
# Set layout to fill canvas
plt.tight_layout()
Wow! The striking survival penalty for diabetic subjects most propably reflects the limited treatment options available at that time, as well as the relevance of Diabetes as a cardiovascular risk factor.
Let's explore, whether major cardiovascular events show a similiar pattern, which would corroborate the above notion:
Although called survival analysis, clinical endpoints other than mortality can be analized in the same fashion. When employed for non-lethal endpoints, we use the term event free survival to denote the time-course of endpoint-events. Let's prepare some more variables...
# Choose time variables from full dataset
TM = survival_data["TIMEMIFC"]/365
TS = survival_data["TIMESTRK"]/365
# Choose event variables from full dataset
EM = survival_data["MI_FCHD"]
ES = survival_data["STROKE"]
...and plot their eventfree survival curves:
# Define plot
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(9,6))
fig.suptitle("Event-Free Survival of Diabetic vs. Nondiabetic Subjects")
# Subplot 1:
# Set percent formatter in order to display percent instead of float on y-axis
ax1.yaxis.set_major_formatter(mtick.PercentFormatter(1.0, None,'%'))
# Iterate discriminator for diabetes status and plot respective survival curves
value_range = [0, 1]
for value in value_range:
disc = (survival_data["DIABETES"] == value)
kmf.fit(TM[disc], event_observed=EM[disc], label=value)
kmf.plot_survival_function(ax=ax1)
# Add ticks, grid, title and labels
ax1.tick_params(labeltop=False, labelright=False, right=False)
ax1.set_title("Myocardial Infarction", color = "#4C9900")
ax1.set(xlabel='(Years)')
# Set layout to fill canvas
plt.tight_layout()
# Subplot 2:
# Set percent formatter in order to display percent instead of float on y-axis
ax2.yaxis.set_major_formatter(mtick.PercentFormatter(1.0, None,'%'))
# Iterate discriminator for diabetes status and plot respective survival curves
value_range = [0, 1]
for value in value_range:
disc = (survival_data["DIABETES"] == value)
kmf.fit(TS[disc], event_observed=ES[disc], label=value)
kmf.plot_survival_function(ax=ax2)
# Add ticks, grid, title and labels
ax2.tick_params(labeltop=False, labelright=True, right=True)
ax2.set_title("Stroke", color = "#4C9900")
ax2.set(xlabel='(Years)')
# Additional Info: license
plt.text(8, 0.35, 'mathias.elsner: CC BY-NC-SA', fontsize = 7, color = 'grey', style = 'italic')
# Set layout to fill canvas
plt.tight_layout()
It seems quite plausible from our data, that the dire mortality outcome of diabetic subjects is related to cardiovascular events, which are the number one driver of mortality in the general population.
Since the pathophysiologic mechanisms and causal relationships between diabetes, cardiovascular disease and mortality have been extensively researched by our times, we will not explore this issue further.
We will, instead, turn our naïve attention to a risk factor, that is not amenable to medical therapy...
In the Framingham Study, educational status was stratified by four layers:
- 0-11 years (Elementary, Primary or Middle School)
- High School Diploma, GED
- Some College, Vocational School
- College (BS, BA) degree or more
So, turning back to all cause mortality, what is the influence of education? Let's plot:
# Define Figure
plt.figure(figsize=(8,7))
# Define survival plot
ax = plt.subplot(111)
# Set percent formatter in order to display percent instead of float
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0, None,'%'))
# Iterate discriminator
value_range = [1, 2, 3, 4]
for value in value_range:
disc = (survival_data["educ"] == value)
kmf.fit(T[disc], event_observed=E[disc], label=value)
kmf.plot_survival_function(ax=ax)
# Add ticks, grid, title and labels
plt.tick_params(labeltop=False, labelright=True, right=True)
plt.xticks(np.arange(0, 25, step=1), fontsize = 10) # Set label locations.
plt.grid(visible=True, which='both', axis='both', color='#C0C0C0', linestyle='-', linewidth=0.25)
plt.title("Framingham Study (1948 ff.): Survival Condition Educational Status", color = "#4C9900")
plt.xlabel("(Years)")
# Additional Info: license
plt.text(0, 0.55, 'mathias.elsner: CC BY-NC-SA', fontsize = 7, color = 'grey', style = 'italic')
# Additional Info: Discriminator legend
plt.text(12, 1.00, '1: Elementary / Primary / Middle School', fontsize = 9, color = 'grey', style = 'normal', weight = 'normal')
plt.text(12, 0.99, '2: High School Diploma', fontsize = 9, color = 'grey', style = 'normal', weight = 'normal')
plt.text(12, 0.98, '3: Undergraduate / Vocational School', fontsize = 9, color = 'grey', style = 'normal', weight = 'normal')
plt.text(12, 0.97, '4: College (BS, BA) degree or more', fontsize = 9, color = 'grey', style = 'normal', weight = 'normal')
# Set layout
plt.tight_layout()