How to Plot a Kaplan-Meier Curve in R

HQ, AB

Last updated on Nov 5, 2024 13 min read Biostatistics

Table of Contents

⚠️ Make Sure You Understand the Code Before Using It ⚠️

Introduction

The Kaplan-Meier curve is a popular tool for visualizing survival data. It is often used in medical research to compare the survival rates of different treatments or patient groups. In this tutorial, we will show you how to plot a Kaplan-Meier curve with ready-to-public style in R using the survival and survminer package. We will use sample data from the veteran dataset, which is included in the survival package. For calculating the confidence intervals, R’s default is the log setting while Stata and SAS use the complimentary log-log method.

In this tutorial, we will demonstrate how to plot Kaplan-Meier curves for one or two groups in R, with options to display the median survival time, hazard ratio, and their corresponding 95% confidence intervals. You need to modify the functions a little bit before using. We assume you have prelimnary knowledge of R packages. We will comments on the lines if the functions of these lines are not straightforward enough.

Library, ggplot Theme, and Data Prep

The veteran dataset contains survival data from a two-treatment, randomized trial for lung cancer. The variables in veteran are:

trt: 1=standard 2=test
celltype: 1=squamous, 2=small cell, 3=adeno, 4=large
time: survival time in days
status: censoring status 0=censored, 1=death
karno: Karnofsky performance score (100=good)
diagtime: months from diagnosis to randomization
age: in years
prior: prior therapy 0=no, 10=yes

library(survival)
library(survminer)
library(ggplot2)
library(dplyr)

custom_theme <- theme_cleantable() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.x = element_blank(),
        axis.line.x = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        plot.title = element_text(size=14) #change the fontsize of "Number at risk" title
        ) 

data = veteran %>% 
  mutate(
    trt = factor(trt, levels = c(1,2), labels = c("Control", "Treatment")),
    time = time/30.437 # One month = 30.437 days
  )

data_control = data %>% 
  filter(trt == "Control")

data_trt = data %>% 
  filter(trt == "Treatment")

One Group Plot

Input:

data = Your cleaned-up survival dataset.
outcomeTime = Your outcome time variable, please make sure the unit is correct. The unit is Months by default.
outcome = The numeric outcome indicator, 0 = censored, 1 = event.
outcomeName = Name of outcome of interest, i.e. OS, PFS.
xlab = Label of x-axis.
ylab = Label of y-axis.
title = title of the plot.
BreakBy = Break x-axis every X units, it’s 6 months by default.
text_x = X-axis location of text lines, no default.
text_y = Y-axis location of text lines, input numbers from 0-1, by default it’s 0.9.

OneGroup_survPlot <- function(data = data,
                        outcomeTime = "time",
                        outcome = "status",
                        outcomeName = "OS",
                        xlab = "Months since The First Treatment",
                        ylab = "Probability of OS",
                        title = NULL,
                        risk_table_title="No. at risk",
                        BreakBy = 6, # Usually people want to see survival probability every 3-, 6-, 12-months.
                        text_x,
                        text_y = 0.9
                           ){
    survF = as.formula(paste0("Surv(",outcomeTime,",",outcome,") ~ 1"))
    Sfit <- surv_fit(survF, data = data)
    annoLine1 = paste0("Median ",outcomeName," (95% CI)")
    annoLine2 = paste0(round(surv_median(Sfit)$median,1), " Months (" , round(surv_median(Sfit)$lower,1), ", ", round(surv_median(Sfit)$upper,1), ")")  
    
    g = ggsurvplot(Sfit, data = data, palette = "black",
                   xlab = xlab, ylab = ylab,
                   font.x = c(16), font.y = c(16), #Font size of x,y axis label
                   title = title,
                   break.time.by=BreakBy, 
                   xlim=c(0,max(data[[outcomeTime]])+0.15*max(data[[outcomeTime]])), #Extend the x-axis to show all tick marks.
                   risk.table = TRUE, 
                   risk.table.title=risk_table_title,
                   tables.theme = custom_theme,
                   tables.height = 0.15,
                   surv.plot.height = 0.8,
                   # axes.offset = F, # By uncommenting this line, the x- and y-axes will start at zero. However, the first value in the 'number at risk' table won't align well with the x-axis of the plot. I recommend using the base R method to generate the KM curve if you need this feature and don't like its performance here.
                   font.tickslab = c(18), # Make the tick marks bigger.
                   fontsize = 5.5, #Font size of risk table numbers
                   conf.int = FALSE) # TRUE, conf.int.style=c("step").
    #if you want to confidence interval with dashed lines.

    
    g$plot = g$plot +  theme(legend.position = "none") + 
      annotate("text", x = text_x, na.rm = T, y = text_y, label = annoLine1, hjust = 0, size = 5) +
      annotate("text", x = text_x,na.rm = T, y = text_y-0.05, label = annoLine2, hjust = 0, size = 5) # Modify size to adjust the text font size.
  
    
 #### If your risk table is not aligning with your plot, force them to have the same margin.   plot.margin = unit(c(top, right, bottom, left), "points")
    # g$table <- g$table +   theme(plot.margin = unit(c(5.5, 5.5, 5.5, 5.5), "points"))
    # g$plot  <- g$plot +   theme(plot.margin = unit(c(5.5, 5.5, 5.5, 5.5), "points"))

    return(g)
}

Sample KM Plot of Treatment Group

sample_onegroup = OneGroup_survPlot(data = data_trt, outcomeTime = "time",outcome = "status", text_x = 2)

## Warning: `select_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `select()` instead.
## ℹ The deprecated feature was likely used in the survminer package.
##   Please report the issue at <https://github.com/kassambara/survminer/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Two Groups Plot

You can easily modify this function for >= 3 groups. PLEASE confirm that the labels of two groups are correct. By default, control group is the referece group, meaning that control is level=1 in a factor variable.

Input:

data = Your cleaned-up survival dataset
outcomeTime = Your outcome time variable, please make sure the unit is correct. The unit is Months by default.
outcome = Outcome status, 1 = event, 0 = censored.
xlab = Label of x-axis
ylab = Label of y-axis
VarName = The variable that indicates groups
ExpName = legend title, i.e. Arm
ExpLabels = legend labels, i.e. Treatment, Control
HRText = show Cox model HR and 95% text if TRUE, not show if FALSE. The p-value will be Cox model p-value.
MedianText = show Median and 95% text if TRUE, not show if FALSE. The p-value will be Log-rank test p-value. Please do not set both HRText and MedianText = TRUE at the same time.
BreakBy = Break x-axis every X units, it’s 6 months by default.
title = title of the plot.
risk_table_title = title of risk table.
text_x = X-axis location of text lines, no default.
text_y = Y-axis location of text lines, input numbers from 0-1, by default it’s 0.9.
group_levels_label = Labels of your group levels.
legend_location = c(x-axis, y-axis) location of legend box.

I suggest using the commented code at the end of the function to double-check your KM plot results, ensuring the correct reference group and labels are applied.

# Surv plot for two groups
# risk.table.title="My title"

TwoGroups_survPlot <- function(data, 
                               outcomeTime, outcome,
                               xlab, ylab,
                               VarName,ExpName, ExpLabels = NULL, 
                               HRText = F, MedianText=T, BreakBy=6, # Usually people want to see survival probability every 3-, 6-, 12-months. 
                               title=NULL, risk_table_title="No. at risk",
                               text_x, text_y=0.9,
                               group_levels_label = c("Treatment","Control"), legend_location = c(0.9, 0.9)){
  data$Group = data[[VarName]]
  if (is.null(outcomeTime)) {
    survF = as.formula(paste0("Surv(begin_a,end_a,",outcome,") ~ Group"))
    
  } else {
    survF = as.formula(paste0("Surv(",outcomeTime,",",outcome,") ~ Group"))
    
  }
  Sfit <- surv_fit(survF, data = data)
  g = ggsurvplot(Sfit, data = data, 
                 legend.title = ExpName, legend = legend_location, 
                 legend.labs = ExpLabels,
                 xlab = xlab, ylab = ylab,
                 font.x = c(16), font.y = c(16), #Font size of x,y axis label
                 title=title,
                 xlim=c(0,max(data[[outcomeTime]])+0.15*max(data[[outcomeTime]])), #Extend the x-axis to show all tick marks.
                 break.time.by = BreakBy,
                 tables.height = 0.2,
                 surv.plot.height = 0.8,
                 # axes.offset = F, # By uncommenting this line, the x- and y-axes will start at zero. However, the first value in the 'number at risk' table won't align well with the x-axis of the plot. I recommend using the base R method to generate the KM curve if you need this feature and don't like its performance here.
                 tables.theme = custom_theme,
                 risk.table = TRUE,
                 risk.table.title=risk_table_title,
                 font.tickslab = c(18), #Font size of ticks
                 fontsize = 5.5) #Font size of risk table numbers
  if (HRText){
        sdf.cox = coxph(survF, data = data, ties="breslow")
        HR = round(exp(sdf.cox$coefficients),2)
        up95 = round(exp(confint(sdf.cox))[2],2)
        low95 = round(exp(confint(sdf.cox))[1],2)
        p.val = round(summary(sdf.cox)$waldtest[3],4)
    annoLine1 = paste0(group_levels_label[1]," vs. ",group_levels_label[2])
    annoLine2 = paste0("Cox Model HR (95% CI)") 
    annoLine3 = paste0(HR," (",low95,", ",up95, ")")
    annoLine4 = paste0("p = ",p.val)
    g$plot = g$plot + 
      annotate("text", x = text_x, y = text_y, label = annoLine1, hjust = 0,size=4) + # Edit y=0.xx to adjust the y-axis location of your text.
      annotate("text", x = text_x, y = text_y-0.05, label = annoLine2, hjust = 0,size=5) +
      annotate("text", x = text_x, y = text_y-0.1, label = annoLine3, hjust = 0,size=5) +
      annotate("text", x = text_x, y = text_y-0.15, label = annoLine4, hjust = 0,size=5)+
      theme(legend.text = element_text(size = 8)) # Modify the legend text font size.
  }
  if (MedianText) {
    logrank = survdiff(survF, data = data)
    median0 = round(surv_median(Sfit)$median[1],1)
    median1 = round(surv_median(Sfit)$median[2],1)
    median0low = round(surv_median(Sfit)$lower[1],1)
    median0up = round(surv_median(Sfit)$upper[1],1)
    median1low = round(surv_median(Sfit)$lower[2],1)
    median1up = round(surv_median(Sfit)$upper[2],1)
    p.val = round(logrank$pvalue,3)
    
    annoLine1 = paste0("Median Survival (95% CI)")
    annoLine2 = paste0(group_levels_label[1],": ",median1, " (", median1low,", ",median1up,") Months") 
    annoLine3 = paste0(group_levels_label[2],": ",median0, " (", median0low,", ",median0up,") Months")
    annoLine4 = paste0("Log-rank P-Value = ",p.val)
    g$plot = g$plot + 
      annotate("text", x = text_x, y = text_y, label = annoLine1, hjust = 0,size=5) + 
      annotate("text", x = text_x, y = text_y-0.05, label = annoLine2, hjust = 0,size=5) +
      annotate("text", x = text_x, y = text_y-0.1, label = annoLine3, hjust = 0,size=5) +
      annotate("text", x = text_x, y = text_y-0.15, label = annoLine4, hjust = 0,size=5)+
      theme(legend.text = element_text(size = 8)) # Modify the legend text font size.
  }
  
   #### If your risk table is not aligning with your plot, force them to have the same margin.  plot.margin = unit(c(top, right, bottom, left), "points")
    # g$table <- g$table +   theme(plot.margin = unit(c(5.5, 5.5, 5.5, 5.5), "points"))
    # g$plot  <- g$plot +   theme(plot.margin = unit(c(5.5, 5.5, 5.5, 5.5), "points"))
  
  return(g)
  
  # surv_median(Sfit)       ## To check if you have the correct reference group, when MedianText = T, comment return(g) and un-comment this line.
  # summary(sdf.cox)      ## To check if you have the correct reference group, when HRText = T, comment return(g) and un-comment this line.
}

Sample KM Plot of Two Groups

sample_twogroup = TwoGroups_survPlot(data = data, outcomeTime = "time", outcome = "status", 
                   xlab = "Months since the first treatment", ylab = "Probability of OS", 
                   VarName = "trt", 
                   ExpName = "Arm", ExpLabels = c("Ctrl", "Trt"), 
                   HRText = F, MedianText = T, 
                   text_x = 8, title = "Sample KM Plot of Two Groups", 
                   BreakBy = 6,
                   )

Basic R Method

The basic R method is not as straight forward as ggplot2 method. However, it’s useful when we want to customize certain parts of the plot. One example is when you have very long group names, basic R method will keep everything in your desire place.

The first part is a R function to general tables of survival estimates.

Inputs:

data: your data in data.frame format
byvar: optional input of by-group comparisons
time: survival time
event: indicator of event or censoring
timepoints: vector of times at which numbers at risk are to be estimated

Output: Returns a list of 4 items:

Data frame with a print out of survival estimates at each failure time (time, estimate, 95% CI)
Number at risk for intervals specified by timepoints
Survfit object, to be used to creating KM curves
SurvF object, the formula of survival function

You can use coxph(output.obj[[4]]) to fit a proportional cox model or use survdiff(output.obj[[4]]) to run log-rank test.

generate.survival.objects <- function(data, time, event, subset=(1:nrow(data)), byvar=NULL, timepoints=seq(0,max(time),1), 
                                      labels=NULL, conftype="log-log", ## Default is log.
                                      ratemult=365.25) { ## Enter 365.25 if your time unit is year, enter 30.437 if your time unit is month, enter 1 if your time unit is day.

  data <- data[subset,]
  time <- time[subset]
  event <- event[subset]
  
  if (is.null(byvar)) {
    survF = Surv(time, event, type = "right")~1
    fit.os <- survfit(formula = Surv(time, event, type = "right")~1, data = data, na.action = na.exclude, conf.int = 0.95, se.fit = T, type = "kaplan-meier", error = "greenwood", conf.type = conftype, conf.lower = "usual")

    tab.os <- round(data.frame(fit.os$time, fit.os$n.risk, fit.os$surv, fit.os$lower, fit.os$upper), digits=3)[match(unique(fit.os$surv), fit.os$surv),]
    names(tab.os) <- c("Time", "Number at risk", "Probability", "95%", "CI")

    nrisk = nevents = rate = matrix(0, nrow=length(timepoints), ncol=1)
    for (iii in 1:length(timepoints)) {
      for (jjj in 1) {
        nrisk[iii,jjj] <- length(time[time>=timepoints[iii] & !is.na(time)])
        nevents[iii,jjj] <- length(time[time>=timepoints[iii] & time<timepoints[iii+1] & event==1 & !is.na(time)])
        ratetime <- ifelse(time<=timepoints[iii], 0,
                           ifelse(time>timepoints[iii] & time<=timepoints[iii+1], time,
                                  ifelse(time>timepoints[iii+1], timepoints[iii+1], NA)))
        ratetime <- ratetime - timepoints[iii]
        ratetime[ratetime<0] <- 0
        rate[iii,jjj] <- round(nevents[iii,jjj]/sum(ratetime,na.rm=T) * ratemult,2)
       }
    }
  }

  if (!is.null(byvar)) {

    byvar <- byvar[subset]
    survF = Surv(time, event, type = "right")~factor(byvar)
    fit.os <- survfit(formula = Surv(time, event, type = "right") ~ factor(byvar), data = data, na.action = na.exclude, conf.int = 0.95, se.fit = T, type = "kaplan-meier", error = "greenwood", conf.type = conftype, conf.lower = "usual")
    
    tab.os <- round(data.frame(fit.os$time, fit.os$n.risk, fit.os$surv, fit.os$lower, fit.os$upper), digits=3)[match(unique(fit.os$surv), fit.os$surv),]
    names(tab.os) <- c("Time", "Number at risk", "Probability", "95%", "CI")

    iii <- rep(NA,nrow(tab.os))
    iii[1] <- 1
    for (ttt in 2:length(iii)) {iii[ttt] <- ifelse(tab.os$Time[ttt]>tab.os$Time[ttt-1],0,1)}

    tab.temp <- tab.os
    col1 <- rep("", nrow(tab.os))
    if (!is.null(labels)) col1[which(iii==1)] <- labels
    tab.os <- cbind(col1, tab.temp)
    names(tab.os)[1] <- " "

    nrisk = nevents = rate = matrix(0, nrow=length(timepoints), ncol=length(na.omit(unique(byvar))))
    for (iii in 1:length(timepoints)) {
      for (jjj in 1:length(na.omit(unique(byvar)))) {
        nrisk[iii,jjj] <- length(time[time>=timepoints[iii] & !is.na(time) & !is.na(byvar) & byvar==sort(na.omit(unique(byvar)))[jjj]])
        nevents[iii,jjj] <- length(time[time>=timepoints[iii] & time<timepoints[iii+1] & event==1 & !is.na(time) & byvar==sort(na.omit(unique(byvar)))[jjj]])
        ratetime <- ifelse(time<=timepoints[iii], 0,
                           ifelse(time>timepoints[iii] & time<=timepoints[iii+1], time,
                                  ifelse(time>timepoints[iii+1], timepoints[iii+1], NA)))
        ratetime <- ratetime - timepoints[iii]
        ratetime[ratetime<0] <- 0
        rate[iii,jjj] <- round(nevents[iii,jjj]/sum(ratetime[byvar==sort(na.omit(unique(byvar)))[jjj]],na.rm=T) * ratemult,2)
       }
    }
  }

  return(list(tab.os, cbind(nrisk,nevents,rate), fit.os, survF))

}

Sample Basic R KM Plot

Then we need some basic R plot techniques.

To change the font size, set cex < 1 or > 1, cex=1 by default.

os.obj <- generate.survival.objects(data=data, time=data$time,
                                event=data$status,
                                byvar=data$trt,
                                conftype = "log",
                                timepoints=seq(0,36,6),ratemult = 30.437)

basic_km_setting = function(ylim, xlim, xaxis_range, yaxis_range) {
  ylim=ylim
  xlim=xlim
  xaxis_range=xaxis_range
  yaxis_range=yaxis_range
  
  return(list(ylim, xlim, xaxis_range, yaxis_range))
}

para = basic_km_setting(ylim = c(0,1), xlim = c(0,36), xaxis_range = seq(0,36,6), yaxis_range = seq(0,100,10)) # Set up the axis parameters of the plot

par(mar=c(11,10,3,3)) # adjust the size of the plot margins using the notation par(mar = c(bottom, left, top, right)

plot(os.obj[[3]], fun = function(x) x,  # x if KM plot, 1-x if cumulative
     xlab="Months since the first treatment", 
     ylab="Probability of OS (%)", 
     xaxt="n", bty="n", yaxt="n", cex.lab=1.4, ylim=para[[1]],xlim=para[[2]],
     mark.time=F, conf.int=F, lwd=3, col=c('orange', 'cornflowerblue'))

axis(2, at=para[[4]]/100, labels = para[[4]], 
     las=2, cex.axis=1.4)
axis(1, at = para[[3]], labels = para[[3]], cex.axis=1.4)
legend(24, 1, legend=c("Ctrl", "Trt"),
       col=c("orange", "cornflowerblue"), lty=1, cex=0.8, box.lty=0)
mtext(c("No. at risk"), side=1, line=5, at=c(-5), adj=c(0,0), cex=1, font=2)
mtext(c("Ctrl: Let's try \na very long \ngroup name", os.obj[[2]][,1]), side=1, line=8, at=c(-8,para[[3]]), adj=c(0,0), cex=1) # If your name is 3 line, line=5+3, at=c(-8) means -8 at the x-axis.
mtext(c("Trt  ", os.obj[[2]][,2]), side=1, line=9, at=c(-6,para[[3]]), adj=c(0,0), cex=1) # line = previous line+1 
text(18, .60, "Log-rank p = 0.928", cex=1, font=0.5) # text(7, .13, "arm name", cex=1.2, font=2) ## Add more text as needed

Session Info

sessionInfo()

## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.3     survminer_0.4.9 ggpubr_0.6.0    ggplot2_3.4.4  
## [5] survival_3.6-4 
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.5        utf8_1.2.3        generics_0.1.3    tidyr_1.3.1      
##  [5] xml2_1.3.4        rstatix_0.7.2     blogdown_1.18     lattice_0.22-6   
##  [9] digest_0.6.33     magrittr_2.0.3    evaluate_0.22     grid_4.4.0       
## [13] bookdown_0.35     fastmap_1.1.1     jsonlite_1.8.7    Matrix_1.7-0     
## [17] backports_1.4.1   ggtext_0.1.2      gridExtra_2.3     purrr_1.0.1      
## [21] fansi_1.0.4       scales_1.3.0      jquerylib_0.1.4   abind_1.4-5      
## [25] cli_3.6.1         crayon_1.5.2      KMsurv_0.1-5      rlang_1.1.1      
## [29] munsell_0.5.1     splines_4.4.0     withr_3.0.0       cachem_1.0.8     
## [33] yaml_2.3.6        tools_4.4.0       ggsignif_0.6.4    colorspace_2.1-0 
## [37] km.ci_0.5-6       broom_1.0.5       vctrs_0.6.3       R6_2.5.1         
## [41] zoo_1.8-11        lifecycle_1.0.4   car_3.1-2         pkgconfig_2.0.3  
## [45] pillar_1.9.0      bslib_0.5.1       gtable_0.3.5      Rcpp_1.0.10      
## [49] data.table_1.14.8 glue_1.6.2        highr_0.10        xfun_0.43        
## [53] tibble_3.2.1      tidyselect_1.2.1  rstudioapi_0.15.0 knitr_1.46       
## [57] farver_2.1.2      xtable_1.8-4      survMisc_0.5.6    htmltools_0.5.4  
## [61] labeling_0.4.3    rmarkdown_2.25    carData_3.0-5     compiler_4.4.0   
## [65] gridtext_0.1.5

R Code