How to Plot a Kaplan-Meier Curve in R
Table of Contents
⚠️ Make Sure You Understand the Code Before Using It ⚠️
Introduction
The Kaplan-Meier curve is a popular tool for visualizing survival data. It is often used in medical research to compare the survival rates of different treatments or patient groups. In this tutorial, we will show you how to plot a Kaplan-Meier curve with ready-to-public style in R using the survival
and survminer
package. We will use sample data from the veteran
dataset, which is included in the survival package. For calculating the confidence intervals, R’s default is the log setting while Stata and SAS use the complimentary log-log method.
In this tutorial, we will demonstrate how to plot Kaplan-Meier curves for one or two groups in R, with options to display the median survival time, hazard ratio, and their corresponding 95% confidence intervals. You need to modify the functions a little bit before using. We assume you have prelimnary knowledge of R packages. We will comments on the lines if the functions of these lines are not straightforward enough.
Library, ggplot Theme, and Data Prep
The veteran
dataset contains survival data from a two-treatment, randomized trial for lung cancer. The variables in veteran
are:
trt
: 1=standard 2=testcelltype
: 1=squamous, 2=small cell, 3=adeno, 4=largetime
: survival time in daysstatus
: censoring status 0=censored, 1=deathkarno
: Karnofsky performance score (100=good)diagtime
: months from diagnosis to randomizationage
: in yearsprior
: prior therapy 0=no, 10=yes
library(survival)
library(survminer)
library(ggplot2)
library(dplyr)
custom_theme <- theme_cleantable() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.x = element_blank(),
axis.line.x = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
plot.title = element_text(size=14) #change the fontsize of "Number at risk" title
)
data = veteran %>%
mutate(
trt = factor(trt, levels = c(1,2), labels = c("Control", "Treatment")),
time = time/30.437 # One month = 30.437 days
)
data_control = data %>%
filter(trt == "Control")
data_trt = data %>%
filter(trt == "Treatment")
One Group Plot
Input:
data
= Your cleaned-up survival dataset.outcomeTime
= Your outcome time variable, please make sure the unit is correct. The unit is Months by default.outcome
= The numeric outcome indicator, 0 = censored, 1 = event.outcomeName
= Name of outcome of interest, i.e. OS, PFS.xlab
= Label of x-axis.ylab
= Label of y-axis.title
= title of the plot.BreakBy
= Break x-axis every X units, it’s 6 months by default.text_x
= X-axis location of text lines, no default.text_y
= Y-axis location of text lines, input numbers from 0-1, by default it’s 0.9.
OneGroup_survPlot <- function(data = data,
outcomeTime = "time",
outcome = "status",
outcomeName = "OS",
xlab = "Months since The First Treatment",
ylab = "Probability of OS",
title = NULL,
risk_table_title="No. at risk",
BreakBy = 6, # Usually people want to see survival probability every 3-, 6-, 12-months.
text_x,
text_y = 0.9
){
survF = as.formula(paste0("Surv(",outcomeTime,",",outcome,") ~ 1"))
Sfit <- surv_fit(survF, data = data)
annoLine1 = paste0("Median ",outcomeName," (95% CI)")
annoLine2 = paste0(round(surv_median(Sfit)$median,1), " Months (" , round(surv_median(Sfit)$lower,1), ", ", round(surv_median(Sfit)$upper,1), ")")
g = ggsurvplot(Sfit, data = data, palette = "black",
xlab = xlab, ylab = ylab,
font.x = c(16), font.y = c(16), #Font size of x,y axis label
title = title,
break.time.by=BreakBy,
xlim=c(0,max(data[[outcomeTime]])+0.15*max(data[[outcomeTime]])), #Extend the x-axis to show all tick marks.
risk.table = TRUE,
risk.table.title=risk_table_title,
tables.theme = custom_theme,
tables.height = 0.15,
surv.plot.height = 0.8,
# axes.offset = F, # By uncommenting this line, the x- and y-axes will start at zero. However, the first value in the 'number at risk' table won't align well with the x-axis of the plot. I recommend using the base R method to generate the KM curve if you need this feature and don't like its performance here.
font.tickslab = c(18), # Make the tick marks bigger.
fontsize = 5.5, #Font size of risk table numbers
conf.int = FALSE) # TRUE, conf.int.style=c("step").
#if you want to confidence interval with dashed lines.
g$plot = g$plot + theme(legend.position = "none") +
annotate("text", x = text_x, na.rm = T, y = text_y, label = annoLine1, hjust = 0, size = 5) +
annotate("text", x = text_x,na.rm = T, y = text_y-0.05, label = annoLine2, hjust = 0, size = 5) # Modify size to adjust the text font size.
#### If your risk table is not aligning with your plot, force them to have the same margin. plot.margin = unit(c(top, right, bottom, left), "points")
# g$table <- g$table + theme(plot.margin = unit(c(5.5, 5.5, 5.5, 5.5), "points"))
# g$plot <- g$plot + theme(plot.margin = unit(c(5.5, 5.5, 5.5, 5.5), "points"))
return(g)
}
Sample KM Plot of Treatment Group
sample_onegroup = OneGroup_survPlot(data = data_trt, outcomeTime = "time",outcome = "status", text_x = 2)
## Warning: `select_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `select()` instead.
## ℹ The deprecated feature was likely used in the survminer package.
## Please report the issue at <https://github.com/kassambara/survminer/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Two Groups Plot
You can easily modify this function for >= 3 groups. PLEASE confirm that the labels of two groups are correct. By default, control group is the referece group, meaning that control is level=1 in a factor variable.
Input:
data
= Your cleaned-up survival datasetoutcomeTime
= Your outcome time variable, please make sure the unit is correct. The unit is Months by default.outcome
= Outcome status, 1 = event, 0 = censored.xlab
= Label of x-axisylab
= Label of y-axisVarName
= The variable that indicates groupsExpName
= legend title, i.e. ArmExpLabels
= legend labels, i.e. Treatment, ControlHRText
= show Cox model HR and 95% text if TRUE, not show if FALSE. The p-value will be Cox model p-value.MedianText
= show Median and 95% text if TRUE, not show if FALSE. The p-value will be Log-rank test p-value. Please do not set both HRText and MedianText = TRUE at the same time.BreakBy
= Break x-axis every X units, it’s 6 months by default.title
= title of the plot.risk_table_title
= title of risk table.text_x
= X-axis location of text lines, no default.text_y
= Y-axis location of text lines, input numbers from 0-1, by default it’s 0.9.group_levels_label
= Labels of your group levels.legend_location
= c(x-axis, y-axis) location of legend box.
- I suggest using the commented code at the end of the function to double-check your KM plot results, ensuring the correct reference group and labels are applied.
# Surv plot for two groups
# risk.table.title="My title"
TwoGroups_survPlot <- function(data,
outcomeTime, outcome,
xlab, ylab,
VarName,ExpName, ExpLabels = NULL,
HRText = F, MedianText=T, BreakBy=6, # Usually people want to see survival probability every 3-, 6-, 12-months.
title=NULL, risk_table_title="No. at risk",
text_x, text_y=0.9,
group_levels_label = c("Treatment","Control"), legend_location = c(0.9, 0.9)){
data$Group = data[[VarName]]
if (is.null(outcomeTime)) {
survF = as.formula(paste0("Surv(begin_a,end_a,",outcome,") ~ Group"))
} else {
survF = as.formula(paste0("Surv(",outcomeTime,",",outcome,") ~ Group"))
}
Sfit <- surv_fit(survF, data = data)
g = ggsurvplot(Sfit, data = data,
legend.title = ExpName, legend = legend_location,
legend.labs = ExpLabels,
xlab = xlab, ylab = ylab,
font.x = c(16), font.y = c(16), #Font size of x,y axis label
title=title,
xlim=c(0,max(data[[outcomeTime]])+0.15*max(data[[outcomeTime]])), #Extend the x-axis to show all tick marks.
break.time.by = BreakBy,
tables.height = 0.2,
surv.plot.height = 0.8,
# axes.offset = F, # By uncommenting this line, the x- and y-axes will start at zero. However, the first value in the 'number at risk' table won't align well with the x-axis of the plot. I recommend using the base R method to generate the KM curve if you need this feature and don't like its performance here.
tables.theme = custom_theme,
risk.table = TRUE,
risk.table.title=risk_table_title,
font.tickslab = c(18), #Font size of ticks
fontsize = 5.5) #Font size of risk table numbers
if (HRText){
sdf.cox = coxph(survF, data = data, ties="breslow")
HR = round(exp(sdf.cox$coefficients),2)
up95 = round(exp(confint(sdf.cox))[2],2)
low95 = round(exp(confint(sdf.cox))[1],2)
p.val = round(summary(sdf.cox)$waldtest[3],4)
annoLine1 = paste0(group_levels_label[1]," vs. ",group_levels_label[2])
annoLine2 = paste0("Cox Model HR (95% CI)")
annoLine3 = paste0(HR," (",low95,", ",up95, ")")
annoLine4 = paste0("p = ",p.val)
g$plot = g$plot +
annotate("text", x = text_x, y = text_y, label = annoLine1, hjust = 0,size=4) + # Edit y=0.xx to adjust the y-axis location of your text.
annotate("text", x = text_x, y = text_y-0.05, label = annoLine2, hjust = 0,size=5) +
annotate("text", x = text_x, y = text_y-0.1, label = annoLine3, hjust = 0,size=5) +
annotate("text", x = text_x, y = text_y-0.15, label = annoLine4, hjust = 0,size=5)+
theme(legend.text = element_text(size = 8)) # Modify the legend text font size.
}
if (MedianText) {
logrank = survdiff(survF, data = data)
median0 = round(surv_median(Sfit)$median[1],1)
median1 = round(surv_median(Sfit)$median[2],1)
median0low = round(surv_median(Sfit)$lower[1],1)
median0up = round(surv_median(Sfit)$upper[1],1)
median1low = round(surv_median(Sfit)$lower[2],1)
median1up = round(surv_median(Sfit)$upper[2],1)
p.val = round(logrank$pvalue,3)
annoLine1 = paste0("Median Survival (95% CI)")
annoLine2 = paste0(group_levels_label[1],": ",median1, " (", median1low,", ",median1up,") Months")
annoLine3 = paste0(group_levels_label[2],": ",median0, " (", median0low,", ",median0up,") Months")
annoLine4 = paste0("Log-rank P-Value = ",p.val)
g$plot = g$plot +
annotate("text", x = text_x, y = text_y, label = annoLine1, hjust = 0,size=5) +
annotate("text", x = text_x, y = text_y-0.05, label = annoLine2, hjust = 0,size=5) +
annotate("text", x = text_x, y = text_y-0.1, label = annoLine3, hjust = 0,size=5) +
annotate("text", x = text_x, y = text_y-0.15, label = annoLine4, hjust = 0,size=5)+
theme(legend.text = element_text(size = 8)) # Modify the legend text font size.
}
#### If your risk table is not aligning with your plot, force them to have the same margin. plot.margin = unit(c(top, right, bottom, left), "points")
# g$table <- g$table + theme(plot.margin = unit(c(5.5, 5.5, 5.5, 5.5), "points"))
# g$plot <- g$plot + theme(plot.margin = unit(c(5.5, 5.5, 5.5, 5.5), "points"))
return(g)
# surv_median(Sfit) ## To check if you have the correct reference group, when MedianText = T, comment return(g) and un-comment this line.
# summary(sdf.cox) ## To check if you have the correct reference group, when HRText = T, comment return(g) and un-comment this line.
}
Sample KM Plot of Two Groups
sample_twogroup = TwoGroups_survPlot(data = data, outcomeTime = "time", outcome = "status",
xlab = "Months since the first treatment", ylab = "Probability of OS",
VarName = "trt",
ExpName = "Arm", ExpLabels = c("Ctrl", "Trt"),
HRText = F, MedianText = T,
text_x = 8, title = "Sample KM Plot of Two Groups",
BreakBy = 6,
)
Basic R Method
The basic R method is not as straight forward as ggplot2 method. However, it’s useful when we want to customize certain parts of the plot. One example is when you have very long group names, basic R method will keep everything in your desire place.
The first part is a R function to general tables of survival estimates.
Inputs:
data
: your data in data.frame formatbyvar
: optional input of by-group comparisonstime
: survival timeevent
: indicator of event or censoringtimepoints
: vector of times at which numbers at risk are to be estimated
Output: Returns a list of 4 items:
- Data frame with a print out of survival estimates at each failure time (time, estimate, 95% CI)
- Number at risk for intervals specified by timepoints
- Survfit object, to be used to creating KM curves
- SurvF object, the formula of survival function
- You can use
coxph(output.obj[[4]])
to fit a proportional cox model or usesurvdiff(output.obj[[4]])
to run log-rank test.
generate.survival.objects <- function(data, time, event, subset=(1:nrow(data)), byvar=NULL, timepoints=seq(0,max(time),1),
labels=NULL, conftype="log-log", ## Default is log.
ratemult=365.25) { ## Enter 365.25 if your time unit is year, enter 30.437 if your time unit is month, enter 1 if your time unit is day.
data <- data[subset,]
time <- time[subset]
event <- event[subset]
if (is.null(byvar)) {
survF = Surv(time, event, type = "right")~1
fit.os <- survfit(formula = Surv(time, event, type = "right")~1, data = data, na.action = na.exclude, conf.int = 0.95, se.fit = T, type = "kaplan-meier", error = "greenwood", conf.type = conftype, conf.lower = "usual")
tab.os <- round(data.frame(fit.os$time, fit.os$n.risk, fit.os$surv, fit.os$lower, fit.os$upper), digits=3)[match(unique(fit.os$surv), fit.os$surv),]
names(tab.os) <- c("Time", "Number at risk", "Probability", "95%", "CI")
nrisk = nevents = rate = matrix(0, nrow=length(timepoints), ncol=1)
for (iii in 1:length(timepoints)) {
for (jjj in 1) {
nrisk[iii,jjj] <- length(time[time>=timepoints[iii] & !is.na(time)])
nevents[iii,jjj] <- length(time[time>=timepoints[iii] & time<timepoints[iii+1] & event==1 & !is.na(time)])
ratetime <- ifelse(time<=timepoints[iii], 0,
ifelse(time>timepoints[iii] & time<=timepoints[iii+1], time,
ifelse(time>timepoints[iii+1], timepoints[iii+1], NA)))
ratetime <- ratetime - timepoints[iii]
ratetime[ratetime<0] <- 0
rate[iii,jjj] <- round(nevents[iii,jjj]/sum(ratetime,na.rm=T) * ratemult,2)
}
}
}
if (!is.null(byvar)) {
byvar <- byvar[subset]
survF = Surv(time, event, type = "right")~factor(byvar)
fit.os <- survfit(formula = Surv(time, event, type = "right") ~ factor(byvar), data = data, na.action = na.exclude, conf.int = 0.95, se.fit = T, type = "kaplan-meier", error = "greenwood", conf.type = conftype, conf.lower = "usual")
tab.os <- round(data.frame(fit.os$time, fit.os$n.risk, fit.os$surv, fit.os$lower, fit.os$upper), digits=3)[match(unique(fit.os$surv), fit.os$surv),]
names(tab.os) <- c("Time", "Number at risk", "Probability", "95%", "CI")
iii <- rep(NA,nrow(tab.os))
iii[1] <- 1
for (ttt in 2:length(iii)) {iii[ttt] <- ifelse(tab.os$Time[ttt]>tab.os$Time[ttt-1],0,1)}
tab.temp <- tab.os
col1 <- rep("", nrow(tab.os))
if (!is.null(labels)) col1[which(iii==1)] <- labels
tab.os <- cbind(col1, tab.temp)
names(tab.os)[1] <- " "
nrisk = nevents = rate = matrix(0, nrow=length(timepoints), ncol=length(na.omit(unique(byvar))))
for (iii in 1:length(timepoints)) {
for (jjj in 1:length(na.omit(unique(byvar)))) {
nrisk[iii,jjj] <- length(time[time>=timepoints[iii] & !is.na(time) & !is.na(byvar) & byvar==sort(na.omit(unique(byvar)))[jjj]])
nevents[iii,jjj] <- length(time[time>=timepoints[iii] & time<timepoints[iii+1] & event==1 & !is.na(time) & byvar==sort(na.omit(unique(byvar)))[jjj]])
ratetime <- ifelse(time<=timepoints[iii], 0,
ifelse(time>timepoints[iii] & time<=timepoints[iii+1], time,
ifelse(time>timepoints[iii+1], timepoints[iii+1], NA)))
ratetime <- ratetime - timepoints[iii]
ratetime[ratetime<0] <- 0
rate[iii,jjj] <- round(nevents[iii,jjj]/sum(ratetime[byvar==sort(na.omit(unique(byvar)))[jjj]],na.rm=T) * ratemult,2)
}
}
}
return(list(tab.os, cbind(nrisk,nevents,rate), fit.os, survF))
}
Sample Basic R KM Plot
Then we need some basic R plot techniques.
- To change the font size, set cex < 1 or > 1, cex=1 by default.
os.obj <- generate.survival.objects(data=data, time=data$time,
event=data$status,
byvar=data$trt,
conftype = "log",
timepoints=seq(0,36,6),ratemult = 30.437)
basic_km_setting = function(ylim, xlim, xaxis_range, yaxis_range) {
ylim=ylim
xlim=xlim
xaxis_range=xaxis_range
yaxis_range=yaxis_range
return(list(ylim, xlim, xaxis_range, yaxis_range))
}
para = basic_km_setting(ylim = c(0,1), xlim = c(0,36), xaxis_range = seq(0,36,6), yaxis_range = seq(0,100,10)) # Set up the axis parameters of the plot
par(mar=c(11,10,3,3)) # adjust the size of the plot margins using the notation par(mar = c(bottom, left, top, right)
plot(os.obj[[3]], fun = function(x) x, # x if KM plot, 1-x if cumulative
xlab="Months since the first treatment",
ylab="Probability of OS (%)",
xaxt="n", bty="n", yaxt="n", cex.lab=1.4, ylim=para[[1]],xlim=para[[2]],
mark.time=F, conf.int=F, lwd=3, col=c('orange', 'cornflowerblue'))
axis(2, at=para[[4]]/100, labels = para[[4]],
las=2, cex.axis=1.4)
axis(1, at = para[[3]], labels = para[[3]], cex.axis=1.4)
legend(24, 1, legend=c("Ctrl", "Trt"),
col=c("orange", "cornflowerblue"), lty=1, cex=0.8, box.lty=0)
mtext(c("No. at risk"), side=1, line=5, at=c(-5), adj=c(0,0), cex=1, font=2)
mtext(c("Ctrl: Let's try \na very long \ngroup name", os.obj[[2]][,1]), side=1, line=8, at=c(-8,para[[3]]), adj=c(0,0), cex=1) # If your name is 3 line, line=5+3, at=c(-8) means -8 at the x-axis.
mtext(c("Trt ", os.obj[[2]][,2]), side=1, line=9, at=c(-6,para[[3]]), adj=c(0,0), cex=1) # line = previous line+1
text(18, .60, "Log-rank p = 0.928", cex=1, font=0.5) # text(7, .13, "arm name", cex=1.2, font=2) ## Add more text as needed
Session Info
sessionInfo()
## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.3 survminer_0.4.9 ggpubr_0.6.0 ggplot2_3.4.4
## [5] survival_3.6-4
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.5 utf8_1.2.3 generics_0.1.3 tidyr_1.3.1
## [5] xml2_1.3.4 rstatix_0.7.2 blogdown_1.18 lattice_0.22-6
## [9] digest_0.6.33 magrittr_2.0.3 evaluate_0.22 grid_4.4.0
## [13] bookdown_0.35 fastmap_1.1.1 jsonlite_1.8.7 Matrix_1.7-0
## [17] backports_1.4.1 ggtext_0.1.2 gridExtra_2.3 purrr_1.0.1
## [21] fansi_1.0.4 scales_1.3.0 jquerylib_0.1.4 abind_1.4-5
## [25] cli_3.6.1 crayon_1.5.2 KMsurv_0.1-5 rlang_1.1.1
## [29] munsell_0.5.1 splines_4.4.0 withr_3.0.0 cachem_1.0.8
## [33] yaml_2.3.6 tools_4.4.0 ggsignif_0.6.4 colorspace_2.1-0
## [37] km.ci_0.5-6 broom_1.0.5 vctrs_0.6.3 R6_2.5.1
## [41] zoo_1.8-11 lifecycle_1.0.4 car_3.1-2 pkgconfig_2.0.3
## [45] pillar_1.9.0 bslib_0.5.1 gtable_0.3.5 Rcpp_1.0.10
## [49] data.table_1.14.8 glue_1.6.2 highr_0.10 xfun_0.43
## [53] tibble_3.2.1 tidyselect_1.2.1 rstudioapi_0.15.0 knitr_1.46
## [57] farver_2.1.2 xtable_1.8-4 survMisc_0.5.6 htmltools_0.5.4
## [61] labeling_0.4.3 rmarkdown_2.25 carData_3.0-5 compiler_4.4.0
## [65] gridtext_0.1.5