Forest Plot

Table of Contents

Introduction

Forest plots are a key tool for visualizing the results of analyses. They provide a concise graphical summary of the estimated effects and their confidence intervals, making it easier to compare results at a glance.

In this guide, we will walk through the process of extracting results in a ready-to-generate forest plot format from Linear Models (LM), Generalized Linear Models (GLM) for binary outcomes, and Cox proportional hazards (PH) models for survival outcomes. The scenarios include:

  1. Univariate models (the most common situation).
  2. Subset analysis.
  3. Multivariate models.

Following that, we’ll explore three methods to generate forest plots in R using univariate model results:

  1. Using the forestplot package package.
  2. Using forestploter package.
  3. Using basic R method.

By the end of this guide, you will be equipped with the knowledge to create clear and informative forest plots using these versatile approaches.

We recommend using packages in the analysis report and then using basic R when you need more customizing while preparing your manuscript.

Sample Datasets:

  1. diamonds dataset in the yarrr package for GLM.
  2. lung cancer data in the survival package for Cox model.

Loading Library

library(tidyverse)
library(yarrr)
library(survival)
library(forestplot)
library(forestploter)

data("diamonds")
data("lung")

Snapshot of the Data

head(diamonds)
##   weight clarity color value
## 1   9.35    0.88     4 182.5
## 2  11.10    1.05     5 191.2
## 3   8.65    0.85     6 175.7
## 4  10.43    1.15     5 195.2
## 5  10.62    0.92     5 181.6
## 6  12.35    0.44     4 182.9
head(lung)
##   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1    3  306      2  74   1       1       90       100     1175      NA
## 2    3  455      2  68   1       0       90        90     1225      15
## 3    3 1010      1  56   1       0       90        90       NA      15
## 4    5  210      2  57   1       1       90        60     1150      11
## 5    1  883      2  60   1       0      100        90       NA       0
## 6   12 1022      1  74   1       1       50        80      513       0

Extracting Results from Univariate Models

LM

We will estimate the value of each diamond based on weight, clarity, and color using LMs.

diamonds_wrg = diamonds %>% 
  mutate(
    value.g190 = case_when(
      value > 190 ~ 1,
      TRUE ~ 0
    ),
    color = factor(color, levels = c("2","3","4","5","6","7","8"))
  )
# Variables in your preferred order
vars = c("weight","clarity","color")

# Function to extract results from a LM
extract_lm_results <- function(var, outcome, data) {
  formula <- as.formula(paste(outcome, "~", var))
  res <- lm(formula, data = data)
  mod.sum <- summary(res)
  
  # Extract the variable names from the model matrix to handle categorical variables
  var_names <- colnames(model.matrix(res))[-1] # Exclude the intercept
  
  # Extract coefficients, confidence intervals, and p-values for all levels
  results <- data.frame(
    Covariate = var_names,
    EST = coef(res)[-1], # Exclude the intercept
    CI_lower = confint(res)[-1, 1], # Exclude the intercept
    CI_upper = confint(res)[-1, 2], # Exclude the intercept
    p_value = mod.sum$coefficients[-1, "Pr(>|t|)"] # Exclude the intercept
  )
  
  return(results)
}

# Apply the function to all variables in the list and combine the results
all_lm_results <- do.call(rbind, lapply(vars, extract_lm_results, outcome = "value", data = diamonds_wrg))


# Optionally, arrange the results and prepare for forest plot
all_lm_results_arranged <- all_lm_results %>%
  arrange(desc(EST))

# Create formatted strings for forest plot annotations
forest_est <- sprintf(" %.2f", all_lm_results$EST)
forest_ci <- sprintf("   [%.2f - %.2f]", all_lm_results$CI_lower, all_lm_results$CI_upper)
forest_p <- sprintf(" %.3f", all_lm_results$p_value)

GLM of Binary Outcome

We will model the odds of having a value greater than 190 based on weight, clarity, and color using GLMs.

## Univariate GLM

# Variables in your preferred order
vars = c("weight","clarity","color")

# Function to extract results from a GLM
extract_glm_results <- function(var, outcome, data) {
  formula <- as.formula(paste(outcome, "~", var))
  res <- glm(formula, family = binomial, data = data)
  mod.sum <- summary(res)
  
  # Extract the variable names from the model matrix to handle categorical variables
  var_names <- colnames(model.matrix(res))[-1] # Exclude the intercept
  
  # Extract coefficients, confidence intervals, and p-values for all levels
  results <- data.frame(
    Covariate = var_names,
    OR = exp(coef(res))[-1], # Exclude the intercept
    CI_lower = exp(confint(res))[-1, 1], # Exclude the intercept
    CI_upper = exp(confint(res))[-1, 2], # Exclude the intercept
    p_value = mod.sum$coefficients[-1, "Pr(>|z|)"] # Exclude the intercept
  )
  
  return(results)
}

# Apply the function to all variables in the list and combine the results
all_glm_results <- do.call(rbind, lapply(vars, extract_glm_results, outcome = "value.g190", data = diamonds_wrg))


# Optionally, arrange the results and prepare for forest plot
all_glm_results_arranged <- all_glm_results %>%
  arrange(desc(OR))

# Create formatted strings for forest plot annotations
forest_or <- sprintf(" %.2f", all_glm_results$OR)
forest_ci <- sprintf("   [%.2f - %.2f]", all_glm_results$CI_lower, all_glm_results$CI_upper)
forest_p <- sprintf(" %.3f", all_glm_results$p_value)

Cox Model

We will model HR of overall survival based on age, sex, ph.ecog, and wt.loss using Cox PH models.

lung_wrg = lung %>% 
  mutate(
    status = status - 1,
    ph.ecog = factor(ph.ecog),
    sex = factor(sex,labels = c("Male","Female"))
  )
## Univariate Cox models

# Names of covariates
covariates <- c("age", "sex", "ph.ecog", "wt.loss")

# Edit your survival time variable name and censoring information.
univ_formulas_os <- sapply(covariates,
                             function(x) as.formula(paste('Surv(time, status)~', x)))

# Optionally, test the proportional hazards assumption for a Cox regression model fit (coxph).
proptest_univ_models_os <- lapply(univ_formulas_os, function(x){cox.zph(coxph(x, data = lung_wrg))})

# Fit each univariate model, edit your dataset name.
univ_models_os <- lapply(univ_formulas_os, function(x){coxph(x, data = lung_wrg)})

# Extract data function
extract_cox_results <- function(univ_models_os) {
  summary_model <- summary(univ_models_os)
  coef <- summary_model$coefficients
  conf_int <- summary_model$conf.int
  
  # Extract coefficients, HR, CI, and p-values
  covariates <- rownames(coef)
  HR <- conf_int[, "exp(coef)"]
  CI_lower <- conf_int[, "lower .95"]
  CI_upper <- conf_int[, "upper .95"]
  p_values <- coef[, "Pr(>|z|)"]
  
  # Combine results into a data frame
  results <- data.frame(
    Covariate = covariates,
    HR = HR,
    CI_lower = CI_lower,
    CI_upper = CI_upper,
    p_value = p_values
  )
  
  return(results)
}

# Apply the function to all models in the list and combine the results
all_cox_results <- do.call(rbind, lapply(univ_models_os, extract_cox_results))

# Optionally, arrange the results and prepare for forest plot
all_cox_results_arranged <- all_cox_results %>%
  arrange(desc(HR))

# Create formatted strings for forest plot annotations
forest_hr <- sprintf(" %.2f", all_cox_results$HR)
forest_ci <- sprintf("   [%.2f - %.2f]", all_cox_results$CI_lower, all_cox_results$CI_upper)
forest_p <- sprintf(" %.3f", all_cox_results$p_value)

forestplot Package

cochrane_from_rmeta <- structure(list(mean  = c(NA, all_cox_results$HR),
                                      lower = c(NA, all_cox_results$CI_lower),
                                      upper = c(NA, all_cox_results$CI_upper)),
                                 .Names = c("mean", "lower", "upper"),
                                 row.names = c(NA, -7L), # - (No. of row+1) L 
                                 class = "data.frame")

## Unfortunately, we'll need to manually enter our variable names and labels of comparisons, following the order of the results dataframe.
## Useful notes: <= sign is \u2264; >= sign is \u2265. https://www.compart.com/en/unicode 
tabletext <- cbind(c("Patient \nCharacteristics", # Title of first column in the table
                     "Age","Sex","ECOG Score","ECOG Score","ECOG Score", "Weight Loss"), # Var names

                   c(" ", "-", "Female vs. Male", "1 vs. 0",
                     "2 vs. 0", "3 vs. 0", "-"), # Levels vs. reference.
                   c("    HR", forest_hr),
                   c("95% CI", forest_ci),
                   c("P-value", forest_p))
# png("ForestPlot_Example1.png", width=11, height=8, units = 'in', pointsize=16, res = 300)
plot1 = forestplot(tabletext,
           mean = cochrane_from_rmeta$mean, ## HR or OR
           lower = cochrane_from_rmeta$lower, ## Lower CI
           upper = cochrane_from_rmeta$upper, ## Upper CI
           is.summary=c(TRUE,rep(FALSE,6), TRUE), ## rep(FALSE, No. of rows)
           boxsize = .2, ## Box size
           clip = c(0.125, 16), ## Lower and Upper limit of x-axis
           xticks = c(log(c(0.125,0.25,0.5,1,2,4,8,16))), ## x-axis in log-scale
           xlog=TRUE, ## Make the x-ticks evenly distributed in a log-scale
           line.margin = .01, ## Set the margin between rows
           lineheight=unit(2,'cm'), ## Height of the graph. By default this is auto and adjusts to the space that is left after adjusting for x-axis size and legend. 
           colgap = unit(1, "mm"), ## Sets the gap between columns, defaults to 6 mm.
           grid = 1, ## A discrete gray dashed grid at the level of the ticks that we specified.
           col=fpColors(box="black",line="black", summary="black"), ## Set all the colors that are used in the forest plot. See ??fpColors for more detail.
           xlab = "HR (95% CI)", ## X label
           txt_gp = fpTxtGp(ticks=gpar(cex=1),
                            xlab  = gpar(cex = 1)), ## Set fonts for all text elements that are used in the forest plot. See ??fpTxtGp for more detail.
           shapes_gp = fpShapesGp(lines = gpar(lwd = 3)) ## Sets graphical parameters (squares and lines widths, styles, etc.) of all shapes drawn (squares, lines, diamonds, etc.). This overrides col, lwd.xaxis, lwd.zero, lwd.ci and lty.ci.
)

# dev.off()

forestploter Package

TBA

Basic R

Preparing the data

forplot = all_cox_results # A dataframe forplot
forplot$samplesize = 228 # If there are varying sample sizes, display them in different size.
Nmat = data.frame(ptsz = seq(0.75, 1.75, 0.25), 
                  n = seq(500,2500,500))
forplot$ptsz=NA # ptsz = Point size

for (i in 1:nrow(forplot)) {
  forplot$ptsz[i] = Nmat$ptsz[min(which(Nmat$n >= as.numeric(forplot$samplesize[i])))]
} # Assign point size to each row.

forplot$HRprint =  paste(format(round(forplot$HR,2),nsmall = 2), " [", 
                         format(round(forplot$CI_lower,2),nsmall = 2), ", ", 
                         format(round(forplot$CI_upper,2),nsmall = 2), "]", sep="")

forplot$LL = forplot$CI_lower
forplot$UL = forplot$CI_upper

# Add spaces for the category labels to seperate different group/cohort
skiprow = rep(NA, dim(forplot)[2]) # An empty row
forplot = rbind(skiprow, forplot[1:nrow(forplot),])
# forplot = rbind(skiprow, forplot[1:4,], skiprow, skiprow,
#                 forplot[5:nrow(forplot),]) # Here's another example, we want to group 1-4 row, and 5-last row.


# Label of each row, leave blank " " if it's a break for group/cohort name.
forplot$label = c("Vairables", "Age","Sex","ECOG Score","ECOG Score","ECOG Score", "Weight Loss")
  
# Label of comparison levels
forplot$levels = c(" ","-", "Female vs. Male", 
                   "1 vs. 0", "2 vs. 0", "3 vs. 0", "-")

# The plot rows from top to bottom correspond to the dataframe rows from bottom to top.
forplot$rownum = nrow(forplot):1
forplot = forplot[order(forplot$rownum),]

Take a look at the ready-to-go data.

head(forplot)
##                  Covariate        HR  CI_lower   CI_upper      p_value
## wt.loss            wt.loss 1.0013201 0.9894608  1.0133214 8.281974e-01
## ph.ecog.ph.ecog3  ph.ecog3 9.0973236 1.2182118 67.9367026 3.136764e-02
## ph.ecog.ph.ecog2  ph.ecog2 2.5002408 1.6100739  3.8825571 4.483837e-05
## ph.ecog.ph.ecog1  ph.ecog1 1.4460563 0.9796877  2.1344343 6.335907e-02
## sex              sexFemale 0.5880028 0.4237178  0.8159848 1.491229e-03
## age                    age 1.0188965 1.0006903  1.0374339 4.185313e-02
##                  samplesize ptsz            HRprint        LL         UL
## wt.loss                 228 0.75 1.00 [0.99,  1.01] 0.9894608  1.0133214
## ph.ecog.ph.ecog3        228 0.75 9.10 [1.22, 67.94] 1.2182118 67.9367026
## ph.ecog.ph.ecog2        228 0.75 2.50 [1.61,  3.88] 1.6100739  3.8825571
## ph.ecog.ph.ecog1        228 0.75 1.45 [0.98,  2.13] 0.9796877  2.1344343
## sex                     228 0.75 0.59 [0.42,  0.82] 0.4237178  0.8159848
## age                     228 0.75 1.02 [1.00,  1.04] 1.0006903  1.0374339
##                        label          levels rownum
## wt.loss          Weight Loss               -      1
## ph.ecog.ph.ecog3  ECOG Score         3 vs. 0      2
## ph.ecog.ph.ecog2  ECOG Score         2 vs. 0      3
## ph.ecog.ph.ecog1  ECOG Score         1 vs. 0      4
## sex                      Sex Female vs. Male      5
## age                      Age               -      6

START PLOTS HERE.

par(mar=c(8, 15, 1, 11), xpd=NA) # Basic R plot margin setting.
plot(range(c(log(0.125),log(16))), # The x-axis range of the plot
     range(1,nrow(forplot)+1), type="n", xaxt="n", yaxt="n", xlab=" ", ylab="", bty="n") # The y-axis range of the plot

# polygon(x = c(-15,15,15,-15), y = c(0,0,6,6), col="azure2",
#         border='white',
#         xpd = TRUE)

# The points (solid black square) of plot, point size is proportion of sample size
points(log(forplot$HR),(1:nrow(forplot)), pch=15,
       cex=forplot$ptsz, 
       col="black")

# The reference line, the length of the line is nrow(forplot)+1.2, 0.6 is the starting point at y-axis.
segments(log(1), 0.6, log(1),nrow(forplot)+1.2, lty=2)

# Plot CIs
#### Use if statement to filter out the 'outliers', then use arrows() to draw the arrow.

#### Alternatively, add a break to x-axis.
segments(log(forplot$LL), (1:nrow(forplot)), log(forplot$UL), (1:nrow(forplot)), 
         lwd=2, col="black")

# X-axis ticks, here in log-ratio.
axis(1, at=log(c(0.125,0.25,0.5,1,2,4,8,16)), c(0.125,0.25,0.5,1,2,4,8,16), cex.axis=0.75) # Font size of x-axis ticks = 0.75.

# For the left part, add column names starting from the 2nd column. The first column was sorted by including empty rows in the dataset.
# mtext(side=2, at=nrow(forplot)+1, "HR (95% CI)", font=2, 
#       las = 2, line=6.25, adj=c(0,0), cex=0.95)
mtext(side=2, at=nrow(forplot), "Levels", font=2, 
      las = 2, line=4, adj=c(0,0), cex=0.95)


allrows = 1:nrow(forplot)
titles = c(7) # Location of group/cohort title
subrows = allrows[-titles] # rows except for title

## Optionally, if you have multiple columns on the left-hand side table.
# forplot$rate_in_men = format(forplot$rate_in_men, digits=4)
# forplot$rate_in_women = format(forplot$rate_in_women, digits=4)
# forplot$rate_in_men = gsub("NA","",forplot$rate_in_men)
# forplot$rate_in_women = gsub("NA","",forplot$rate_in_women)

mtext(side=2, at =(1:nrow(forplot)), forplot$levels,las=2, line=4, adj=c(0,0), cex=0.95)
# mtext(side=2, at =(1:nrow(forplot)), forplot$HRprint, las=2, line=1.5, adj=c(0,0), cex=0.95)

# at=... is the location of text at y-axis
mtext(side=2, at=(1:nrow(forplot))[titles],
      forplot$label[titles],
      las=2, font=2, line=13.5, cex=0.95, adj=c(0,0)
)

mtext(side=2, at=(1:nrow(forplot))[subrows],
      forplot$label[subrows],
      las=2, line=12.5, cex=0.95, adj=c(0,0)
)

mtext(side=4, at=1:nrow(forplot),
      forplot$HRprint,
      las=2, line=1, cex=0.95, adj=c(0,0))

mtext(side=4, at=1:nrow(forplot), 
      round(forplot$p_value, 3),
      las=2, line=8, cex=0.95, adj=c(0,0))

mtext(side=4, at=nrow(forplot)+0.5, c("Hazard Ratio"), las=2, line=1, 
      cex=0.95, adj=c(0,0), font=2)
mtext(side=4, at=nrow(forplot), c("[95% CI]"), las=2, line=1, 
      cex=0.95, adj=c(0,0), font=2)
mtext(side=4, at=nrow(forplot), "P", line=8,cex=0.95, adj=c(0,0), font=2, las=2)

# at=... is the location of text at x-axis.
mtext(side=1, at=log(1), "Hazard Ratio", line=3)

# mtext(side=1, at=log(2), "More Hazard", line=5)
# mtext(side=1, at=log(0.35), "More Hazard", line=5)
# mtext(side=1, at=log(2), "In Females", line=6)
# mtext(side=1, at=log(0.35), "In Males", line=6)
# arrows(log(1.25), -2, log(2), -2, xpd=TRUE, length=0.1)
# arrows(log(0.75), -2, log(0.1), -2, xpd=TRUE, length=0.1)

Extracting Results from Subset Analysis

TBA

Extracting Results from Multivariate Models

TBA

Session Info

sessionInfo()
## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22000)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] forestploter_1.1.1     forestplot_3.1.3       abind_1.4-5           
##  [4] checkmate_2.2.0        survival_3.5-8         yarrr_0.1.5           
##  [7] circlize_0.4.16        BayesFactor_0.9.12-4.7 Matrix_1.7-0          
## [10] coda_0.19-4            jpeg_0.1-10            lubridate_1.9.0       
## [13] timechange_0.1.1       forcats_1.0.0          stringr_1.5.0         
## [16] dplyr_1.1.3            purrr_1.0.1            readr_2.1.4           
## [19] tidyr_1.3.0            tibble_3.2.1           ggplot2_3.4.3         
## [22] tidyverse_2.0.0       
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4        shape_1.4.6         xfun_0.43          
##  [4] bslib_0.5.1         GlobalOptions_0.1.2 lattice_0.22-6     
##  [7] tzdb_0.3.0          vctrs_0.6.3         tools_4.4.0        
## [10] generics_0.1.3      parallel_4.4.0      fansi_1.0.4        
## [13] highr_0.10          pkgconfig_2.0.3     lifecycle_1.0.3    
## [16] compiler_4.4.0      MatrixModels_0.5-3  munsell_0.5.0      
## [19] htmltools_0.5.4     sass_0.4.5          yaml_2.3.6         
## [22] pillar_1.9.0        jquerylib_0.1.4     cachem_1.0.8       
## [25] tidyselect_1.2.0    digest_0.6.33       mvtnorm_1.2-3      
## [28] stringi_1.7.12      bookdown_0.35       splines_4.4.0      
## [31] fastmap_1.1.1       colorspace_2.1-0    cli_3.6.1          
## [34] magrittr_2.0.3      utf8_1.2.3          withr_2.5.1        
## [37] scales_1.2.1        backports_1.4.1     rmarkdown_2.25     
## [40] gridExtra_2.3       blogdown_1.18       hms_1.1.3          
## [43] pbapply_1.7-2       evaluate_0.22       knitr_1.46         
## [46] rlang_1.1.1         Rcpp_1.0.10         glue_1.6.2         
## [49] rstudioapi_0.15.0   jsonlite_1.8.7      R6_2.5.1