Survival Analysis Kaplan-Meier(英)

 

Table of Contents

1. What is Survival Analysis?
2. Sample
3. Code (Experiment)
_3.1 Kaplan-Meier fitter
_3.2 Kaplan-Meier fitter Based on Different Groups.
_3.3 Log-Rank-Test

1. What is Survival Analysis?

Survival analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as a death in biological organisms and failure of mechanical systems.

For instance, how can Survival Analysis be useful to analyze the ongoing COVID-19 pandemic data?

・We can find the number of days until patients showed COVID-19 symptoms.

・We can find for which age group it is deadlier.

・We can find which treatment has the highest survival probability.

・We can find whether a person’s sex has a significant effect on their survival time?

・We can find the median number of days of survival for patients.

・We can find which factor has more impact on patients’ survival.

Survival time and type of events in cancer studies

Survival Time: It is usually referred to as an amount of time until when a subject is alive or actively participates in a survey.

There are three main types of events in survival analysis:

1) Relapse: Relapse is defined as a deterioration in the subject’s state of health after a temporary improvement.

2) Progression: Progression is defined as the process of developing or moving gradually towards a more advanced state. It basically means that the health of the subject under observation is improving.

3) Death: Death is defined as the destruction or the permanent end of something. In our case, death will be our event of interest.

Censoring

As we discussed above, survival analysis focuses on the occurrence of an event of interest. The event of interest can be anything like birth, death, or retirement. However, there is still a possibility that the event we are interested in does not occur. Such observations are known as censored observations.

There are three types of censoring:

Right Censoring: The subject under observation is still alive. In this case, we cannot have our timing when our event of interest (death) occurs.

Left Censoring: In this type of censoring, the event cannot be observed for any reason. It may also include the event that occurred before the experiment started, such as the number of days from birth when the kid started walking.

3) Interval Censoring: In this type of data censoring, we only have data for a specific interval, so it is possible that the event of interest does not occur during that time.

2. Use cases

Medical

・Survival timeline Analysis

・Recovery time Analysis

Production

・Machine life Analysis

・Production line Analysis

・Production critical Analysis

・Maintenance Analysis

Business

・ Analysis of warranty claim period

・ Leads to sales time analysis

・ Analysis of sales staff for the first sales analysis

3. Code (Experiment)

Environment:Google Colab

Library:lifelines

Survival analysis was originally developed and applied heavily by the actuarial and medical community. Its purpose was to answer why are events occurring now versus later under uncertainty (where events might refer to deaths, disease remission, etc.). This is great for researchers who are interested in measuring lifetimes: they can answer questions like what factors might influence deaths?

Reference:https://lifelines.readthedocs.io/en/latest/

Dataset:NCCTG Lung Cancer Data

The North Central Cancer Treatment Group (NCCTG) data set records the survival of patients with advanced lung cancer, together with assessments of the patients’ performance status measured either by the physician and by the patients themselves. The goal of the study was to determine whether patients’ self-assessment could provide prognostic information complementary to the physician’s assessment. The data set contains 228 patients, including 63 patients that are right censored (patients that left the study before their death).

Reference:http://www-eio.upc.edu/~pau/cms/rdata/doc/survival/lung.html

 

3.1 Kaplan-Meier

Library

!pip install lifelines

Import Library

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import urllib

from urllib import request

import lifelines

from lifelines import KaplanMeierFitter

 

print(‘lifelines version: ‘, lifelines.__version__)

lifelines version:  0.25.5

Download data

csv_src = “http://www-eio.upc.edu/~pau/cms/rdata/csv/survival/lung.csv”

csv_path = ‘lung.csv’

urllib.request.urlretrieve(csv_src, csv_path)

Create dataframe

df = pd.read_csv(csv_path)

 

#  Precessing

df.loc[df.status == 1, ‘dead’] = 0

df.loc[df.status == 2, ‘dead’] = 1

df = df.drop([‘Unnamed: 0’, ‘status’], axis=1)

 

df.head()

Get statistical info

df.describe()

Kaplan-Meier-Fitter

#  Kaplan-Meier-Fitter

kmf = KaplanMeierFitter()

kmf.fit(durations = df.time, event_observed = df.dead)

 

Event table:

  1. a) event_at: It stores the value of the timeline for our dataset. I.e., when was the patient observed in our experiment or when was the experiment conducted. It can be several minutes, days, months, years, and others. In our case, it is going to be for many days. It stores the value of survival days for the subjects.
  2. b) at_risk: It stores the number of current patients under observation. In the beginning, it will be the total number of patients we are going to observe in our experiment. If new patients are added at a particular time, then we have to increase their value accordingly. Therefore:

Figure 19: Analyzing at_risk variable. | Survival Analysis with Python Tutorial

Figure 19: at_risk variable formula.

  1. c) entrance: It stores the value of new patients in a given timeline. It is possible that while experimenting, other patients are also diagnosed with the disease. To account for that, we have the entrance column.
  2. d) censored: Our ultimate goal is to find the survival probability for a patient. At the end of the experiment, if the person is still alive, we will add him/her to the censored category. We have already discussed the types of censoring.
  3. e) observed: It stores the value of the number of subjects that died during the experiment. From a broad perspective, these are the people who met our event of interest.
  4. f) removed: It stores the values of patients that are no longer part of our experiment. If a person dies or is censored, then he/she falls into this category. In short, it is an addition of the data in the observed and censored category.

 

kmf.event_table

Calculating the probability of survival for individual timelines:

Let’s first see the formula for calculating the survival of a particular person at a given time.

Calculating the probability of survival for individual timelines:

Let’s first see the formula for calculating the survival of a particular person at a given time.

#  Calculate survival probability for given time

 

print(“Survival probablity for t=0: “, kmf.predict(0))

print(“Survival probablity for t=5: “, kmf.predict(5))

print(“Survival probablity for t=11: “, kmf.predict(11))

print(“Survival probablity for t=840: “, kmf.predict(840))

 

print(kmf.predict([0,5,11,840]))

Survival probablity for t=0:  1.0

Survival probablity for t=5:  0.9956140350877193

Survival probablity for t=11:  0.9824561403508766

Survival probablity for t=840:  0.06712742409441387

0      1.000000

5      0.995614

11     0.982456

840    0.067127

Name: KM_estimate, dtype: float64

 

survial full list

#  Get survial full list

print(kmf.survival_function_)

KM_estimate

timeline

0.0          1.000000

5.0          0.995614

11.0         0.982456

12.0         0.978070

13.0         0.969298

…               …

840.0        0.067127

883.0        0.050346

965.0        0.050346

1010.0       0.050346

1022.0       0.050346

 

[187 rows x 1 columns]

 

survival graph curve

#  Plot the survival graph

 

kmf.plot()

plt.title(“Kaplan-Meier”)

plt.xlabel(“Number o days”)

plt.ylabel(“Probablity of survival”)

plt.show()

 

survival time median

#  Get survival time median

 

survival_time_median = kmf.median_survival_time_

print(” The median survial time: “, survival_time_median)

The median survial time:  310.0

survival function with confidence interval

#  Plot survival function with confidence interval

 

survival_confidence_interval = kmf.confidence_interval_survival_function_

 

plt.plot(survival_confidence_interval[‘KM_estimate_lower_0.95’], label=”Lower”)

plt.plot(survival_confidence_interval[‘KM_estimate_upper_0.95’], label=”Upper”)

plt.title(“Survival Function with confidence interval”)

plt.xlabel(“Number of days”)

plt.ylabel(“Survial Probablity”)

plt.legend()

plt.show()

 

Cumulative survival

#  Cumulative survival

cumu_surv = kmf.cumulative_density_

 

#  Plot cumulative density

kmf.plot_cumulative_density()

plt.title(“Cumulative density”)

plt.xlabel(“Number of days”)

plt.ylabel(“Probablity of death”)

plt.show()

 

Median survival

# Median time left for event

 

median_time_to_event = kmf.conditional_time_to_event_

plt.plot(median_time_to_event, label = “Median  time left”)

plt.title(“Median time to event”)

plt.xlabel(“Total days”)

plt.ylabel(“Conditional median time to event”)

plt.legend()

plt.show()

3.2 Kaplan-Meier Estimator with groups

Until now, we saw how we could find the survival probability and hazard probability for all of our observations. Now it is time to perform some analysis on our data to determine whether there is any difference in survival probability if we divide our data into groups based on specific characteristics. Let’s divide our data into two groups based on sex: Male and Female. Our goal here is to check is there any significant difference in survival rate if we divide our dataset based on sex. Later in this tutorial, we will see on what basis do we divide the data into groups.

 

# KM for Male and Female

 

kmf_m = KaplanMeierFitter()

kmf_f = KaplanMeierFitter()

Male = df.query(“sex == 1”)

Female = df.query(“sex == 2″)

kmf_m.fit(durations=Male.time, event_observed=Male.dead, label=”Male”)

kmf_f.fit(durations=Female.time, event_observed=Female.dead, label=”Female”)

survival function comparison

# Plot survival function

 

kmf_m.plot()

kmf_f.plot()

plt.title(“KM Survival Probaility: Male vs Female”)

plt.xlabel(“Days”)

plt.ylabel(“Survival Probability”)

plt.show()

 

cumulative density comparison

#  Plot cumulative density

 

kmf_m.plot_cumulative_density()

kmf_f.plot_cumulative_density()

plt.title(“Cumulative density”)

plt.xlabel(“Number of days”)

plt.ylabel(“Probablity of death”)

plt.show()

 

3.3 Log-Rank Test:

The log-rank test is a hypothesis test that is used to compare the survival distribution of two samples.

Goal: Our goal is to see if there is any significant difference between the groups being compared.

Null Hypothesis: The null hypothesis states that there is no significant difference between the groups being studied. If there is a significant difference between those groups, then we have to reject our null hypothesis.

from lifelines.statistics import logrank_test

 

time_m = Male.time

event_m = Male.dead

 

time_f = Female.time

event_f = Female.dead

 

results = logrank_test(time_m, time_f, event_m, event_f)

results.print_summary()

print(“P-value: “, results.p_value)

 

Less than (5% = 0.05) P-value means there is a significant difference between the groups we compared. We can partition our groups based on their sex, age, race, treatment method, and others.

We have compared the survival distributions of two different groups using the famous statistical method, the Log-rank test. Here we can notice that the p-value is 0.00131 (<0.005) for our groups, which denotes that we must reject the null hypothesis and admit that the survival function for both groups is significantly different.

 

担当者:HM
香川県高松市出身 データ分析にて、博士(理学)を取得後、自動車メーカー会社にてデータ分析に関わる。その後コンサルティングファームでデータ分析プロジェクトを歴任後独立 気が付けばデータ分析プロジェクトだけで50以上担当
理化学研究所にて研究員を拝命中 応用数理学会所属