Forest Plot
Table of Contents
Introduction
Forest plots are a key tool for visualizing the results of analyses. They provide a concise graphical summary of the estimated effects and their confidence intervals, making it easier to compare results at a glance.
In this guide, we will walk through the process of extracting results in a ready-to-generate forest plot format from Linear Models (LM), Generalized Linear Models (GLM) for binary outcomes, and Cox proportional hazards (PH) models for survival outcomes. The scenarios include:
- Univariate models (the most common situation).
- Subset analysis.
- Multivariate models.
Following that, we’ll explore three methods to generate forest plots in R using univariate model results:
- Using the forestplot package package.
- Using forestploter package.
- Using basic R method.
By the end of this guide, you will be equipped with the knowledge to create clear and informative forest plots using these versatile approaches.
We recommend using packages in the analysis report and then using basic R when you need more customizing while preparing your manuscript.
Sample Datasets:
diamonds
dataset in theyarrr
package for GLM.lung
cancer data in thesurvival
package for Cox model.
Loading Library
library(tidyverse)
library(yarrr)
library(survival)
library(forestplot)
library(forestploter)
data("diamonds")
data("lung")
Snapshot of the Data
head(diamonds)
## weight clarity color value
## 1 9.35 0.88 4 182.5
## 2 11.10 1.05 5 191.2
## 3 8.65 0.85 6 175.7
## 4 10.43 1.15 5 195.2
## 5 10.62 0.92 5 181.6
## 6 12.35 0.44 4 182.9
head(lung)
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
## 3 3 1010 1 56 1 0 90 90 NA 15
## 4 5 210 2 57 1 1 90 60 1150 11
## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0
Extracting Results from Univariate Models
LM
We will estimate the value
of each diamond based on weight
, clarity
, and color
using LMs.
diamonds_wrg = diamonds %>%
mutate(
value.g190 = case_when(
value > 190 ~ 1,
TRUE ~ 0
),
color = factor(color, levels = c("2","3","4","5","6","7","8"))
)
# Variables in your preferred order
vars = c("weight","clarity","color")
# Function to extract results from a LM
extract_lm_results <- function(var, outcome, data) {
formula <- as.formula(paste(outcome, "~", var))
res <- lm(formula, data = data)
mod.sum <- summary(res)
# Extract the variable names from the model matrix to handle categorical variables
var_names <- colnames(model.matrix(res))[-1] # Exclude the intercept
# Extract coefficients, confidence intervals, and p-values for all levels
results <- data.frame(
Covariate = var_names,
EST = coef(res)[-1], # Exclude the intercept
CI_lower = confint(res)[-1, 1], # Exclude the intercept
CI_upper = confint(res)[-1, 2], # Exclude the intercept
p_value = mod.sum$coefficients[-1, "Pr(>|t|)"] # Exclude the intercept
)
return(results)
}
# Apply the function to all variables in the list and combine the results
all_lm_results <- do.call(rbind, lapply(vars, extract_lm_results, outcome = "value", data = diamonds_wrg))
# Optionally, arrange the results and prepare for forest plot
all_lm_results_arranged <- all_lm_results %>%
arrange(desc(EST))
# Create formatted strings for forest plot annotations
forest_est <- sprintf(" %.2f", all_lm_results$EST)
forest_ci <- sprintf(" [%.2f - %.2f]", all_lm_results$CI_lower, all_lm_results$CI_upper)
forest_p <- sprintf(" %.3f", all_lm_results$p_value)
GLM of Binary Outcome
We will model the odds of having a value greater than 190
based on weight
, clarity
, and color
using GLMs.
## Univariate GLM
# Variables in your preferred order
vars = c("weight","clarity","color")
# Function to extract results from a GLM
extract_glm_results <- function(var, outcome, data) {
formula <- as.formula(paste(outcome, "~", var))
res <- glm(formula, family = binomial, data = data)
mod.sum <- summary(res)
# Extract the variable names from the model matrix to handle categorical variables
var_names <- colnames(model.matrix(res))[-1] # Exclude the intercept
# Extract coefficients, confidence intervals, and p-values for all levels
results <- data.frame(
Covariate = var_names,
OR = exp(coef(res))[-1], # Exclude the intercept
CI_lower = exp(confint(res))[-1, 1], # Exclude the intercept
CI_upper = exp(confint(res))[-1, 2], # Exclude the intercept
p_value = mod.sum$coefficients[-1, "Pr(>|z|)"] # Exclude the intercept
)
return(results)
}
# Apply the function to all variables in the list and combine the results
all_glm_results <- do.call(rbind, lapply(vars, extract_glm_results, outcome = "value.g190", data = diamonds_wrg))
# Optionally, arrange the results and prepare for forest plot
all_glm_results_arranged <- all_glm_results %>%
arrange(desc(OR))
# Create formatted strings for forest plot annotations
forest_or <- sprintf(" %.2f", all_glm_results$OR)
forest_ci <- sprintf(" [%.2f - %.2f]", all_glm_results$CI_lower, all_glm_results$CI_upper)
forest_p <- sprintf(" %.3f", all_glm_results$p_value)
Cox Model
We will model HR of overall survival
based on age
, sex
, ph.ecog
, and wt.loss
using Cox PH models.
lung_wrg = lung %>%
mutate(
status = status - 1,
ph.ecog = factor(ph.ecog),
sex = factor(sex,labels = c("Male","Female"))
)
## Univariate Cox models
# Names of covariates
covariates <- c("age", "sex", "ph.ecog", "wt.loss")
# Edit your survival time variable name and censoring information.
univ_formulas_os <- sapply(covariates,
function(x) as.formula(paste('Surv(time, status)~', x)))
# Optionally, test the proportional hazards assumption for a Cox regression model fit (coxph).
proptest_univ_models_os <- lapply(univ_formulas_os, function(x){cox.zph(coxph(x, data = lung_wrg))})
# Fit each univariate model, edit your dataset name.
univ_models_os <- lapply(univ_formulas_os, function(x){coxph(x, data = lung_wrg)})
# Extract data function
extract_cox_results <- function(univ_models_os) {
summary_model <- summary(univ_models_os)
coef <- summary_model$coefficients
conf_int <- summary_model$conf.int
# Extract coefficients, HR, CI, and p-values
covariates <- rownames(coef)
HR <- conf_int[, "exp(coef)"]
CI_lower <- conf_int[, "lower .95"]
CI_upper <- conf_int[, "upper .95"]
p_values <- coef[, "Pr(>|z|)"]
# Combine results into a data frame
results <- data.frame(
Covariate = covariates,
HR = HR,
CI_lower = CI_lower,
CI_upper = CI_upper,
p_value = p_values
)
return(results)
}
# Apply the function to all models in the list and combine the results
all_cox_results <- do.call(rbind, lapply(univ_models_os, extract_cox_results))
# Optionally, arrange the results and prepare for forest plot
all_cox_results_arranged <- all_cox_results %>%
arrange(desc(HR))
# Create formatted strings for forest plot annotations
forest_hr <- sprintf(" %.2f", all_cox_results$HR)
forest_ci <- sprintf(" [%.2f - %.2f]", all_cox_results$CI_lower, all_cox_results$CI_upper)
forest_p <- sprintf(" %.3f", all_cox_results$p_value)
forestplot Package
cochrane_from_rmeta <- structure(list(mean = c(NA, all_cox_results$HR),
lower = c(NA, all_cox_results$CI_lower),
upper = c(NA, all_cox_results$CI_upper)),
.Names = c("mean", "lower", "upper"),
row.names = c(NA, -7L), # - (No. of row+1) L
class = "data.frame")
## Unfortunately, we'll need to manually enter our variable names and labels of comparisons, following the order of the results dataframe.
## Useful notes: <= sign is \u2264; >= sign is \u2265. https://www.compart.com/en/unicode
tabletext <- cbind(c("Patient \nCharacteristics", # Title of first column in the table
"Age","Sex","ECOG Score","ECOG Score","ECOG Score", "Weight Loss"), # Var names
c(" ", "-", "Female vs. Male", "1 vs. 0",
"2 vs. 0", "3 vs. 0", "-"), # Levels vs. reference.
c(" HR", forest_hr),
c("95% CI", forest_ci),
c("P-value", forest_p))
# png("ForestPlot_Example1.png", width=11, height=8, units = 'in', pointsize=16, res = 300)
plot1 = forestplot(tabletext,
mean = cochrane_from_rmeta$mean, ## HR or OR
lower = cochrane_from_rmeta$lower, ## Lower CI
upper = cochrane_from_rmeta$upper, ## Upper CI
is.summary=c(TRUE,rep(FALSE,6), TRUE), ## rep(FALSE, No. of rows)
boxsize = .2, ## Box size
clip = c(0.125, 16), ## Lower and Upper limit of x-axis
xticks = c(log(c(0.125,0.25,0.5,1,2,4,8,16))), ## x-axis in log-scale
xlog=TRUE, ## Make the x-ticks evenly distributed in a log-scale
line.margin = .01, ## Set the margin between rows
lineheight=unit(2,'cm'), ## Height of the graph. By default this is auto and adjusts to the space that is left after adjusting for x-axis size and legend.
colgap = unit(1, "mm"), ## Sets the gap between columns, defaults to 6 mm.
grid = 1, ## A discrete gray dashed grid at the level of the ticks that we specified.
col=fpColors(box="black",line="black", summary="black"), ## Set all the colors that are used in the forest plot. See ??fpColors for more detail.
xlab = "HR (95% CI)", ## X label
txt_gp = fpTxtGp(ticks=gpar(cex=1),
xlab = gpar(cex = 1)), ## Set fonts for all text elements that are used in the forest plot. See ??fpTxtGp for more detail.
shapes_gp = fpShapesGp(lines = gpar(lwd = 3)) ## Sets graphical parameters (squares and lines widths, styles, etc.) of all shapes drawn (squares, lines, diamonds, etc.). This overrides col, lwd.xaxis, lwd.zero, lwd.ci and lty.ci.
)
# dev.off()
forestploter Package
TBA
Basic R
Preparing the data
forplot = all_cox_results # A dataframe forplot
forplot$samplesize = 228 # If there are varying sample sizes, display them in different size.
Nmat = data.frame(ptsz = seq(0.75, 1.75, 0.25),
n = seq(500,2500,500))
forplot$ptsz=NA # ptsz = Point size
for (i in 1:nrow(forplot)) {
forplot$ptsz[i] = Nmat$ptsz[min(which(Nmat$n >= as.numeric(forplot$samplesize[i])))]
} # Assign point size to each row.
forplot$HRprint = paste(format(round(forplot$HR,2),nsmall = 2), " [",
format(round(forplot$CI_lower,2),nsmall = 2), ", ",
format(round(forplot$CI_upper,2),nsmall = 2), "]", sep="")
forplot$LL = forplot$CI_lower
forplot$UL = forplot$CI_upper
# Add spaces for the category labels to seperate different group/cohort
skiprow = rep(NA, dim(forplot)[2]) # An empty row
forplot = rbind(skiprow, forplot[1:nrow(forplot),])
# forplot = rbind(skiprow, forplot[1:4,], skiprow, skiprow,
# forplot[5:nrow(forplot),]) # Here's another example, we want to group 1-4 row, and 5-last row.
# Label of each row, leave blank " " if it's a break for group/cohort name.
forplot$label = c("Vairables", "Age","Sex","ECOG Score","ECOG Score","ECOG Score", "Weight Loss")
# Label of comparison levels
forplot$levels = c(" ","-", "Female vs. Male",
"1 vs. 0", "2 vs. 0", "3 vs. 0", "-")
# The plot rows from top to bottom correspond to the dataframe rows from bottom to top.
forplot$rownum = nrow(forplot):1
forplot = forplot[order(forplot$rownum),]
Take a look at the ready-to-go data.
head(forplot)
## Covariate HR CI_lower CI_upper p_value
## wt.loss wt.loss 1.0013201 0.9894608 1.0133214 8.281974e-01
## ph.ecog.ph.ecog3 ph.ecog3 9.0973236 1.2182118 67.9367026 3.136764e-02
## ph.ecog.ph.ecog2 ph.ecog2 2.5002408 1.6100739 3.8825571 4.483837e-05
## ph.ecog.ph.ecog1 ph.ecog1 1.4460563 0.9796877 2.1344343 6.335907e-02
## sex sexFemale 0.5880028 0.4237178 0.8159848 1.491229e-03
## age age 1.0188965 1.0006903 1.0374339 4.185313e-02
## samplesize ptsz HRprint LL UL
## wt.loss 228 0.75 1.00 [0.99, 1.01] 0.9894608 1.0133214
## ph.ecog.ph.ecog3 228 0.75 9.10 [1.22, 67.94] 1.2182118 67.9367026
## ph.ecog.ph.ecog2 228 0.75 2.50 [1.61, 3.88] 1.6100739 3.8825571
## ph.ecog.ph.ecog1 228 0.75 1.45 [0.98, 2.13] 0.9796877 2.1344343
## sex 228 0.75 0.59 [0.42, 0.82] 0.4237178 0.8159848
## age 228 0.75 1.02 [1.00, 1.04] 1.0006903 1.0374339
## label levels rownum
## wt.loss Weight Loss - 1
## ph.ecog.ph.ecog3 ECOG Score 3 vs. 0 2
## ph.ecog.ph.ecog2 ECOG Score 2 vs. 0 3
## ph.ecog.ph.ecog1 ECOG Score 1 vs. 0 4
## sex Sex Female vs. Male 5
## age Age - 6
START PLOTS HERE.
par(mar=c(8, 15, 1, 11), xpd=NA) # Basic R plot margin setting.
plot(range(c(log(0.125),log(16))), # The x-axis range of the plot
range(1,nrow(forplot)+1), type="n", xaxt="n", yaxt="n", xlab=" ", ylab="", bty="n") # The y-axis range of the plot
# polygon(x = c(-15,15,15,-15), y = c(0,0,6,6), col="azure2",
# border='white',
# xpd = TRUE)
# The points (solid black square) of plot, point size is proportion of sample size
points(log(forplot$HR),(1:nrow(forplot)), pch=15,
cex=forplot$ptsz,
col="black")
# The reference line, the length of the line is nrow(forplot)+1.2, 0.6 is the starting point at y-axis.
segments(log(1), 0.6, log(1),nrow(forplot)+1.2, lty=2)
# Plot CIs
#### Use if statement to filter out the 'outliers', then use arrows() to draw the arrow.
#### Alternatively, add a break to x-axis.
segments(log(forplot$LL), (1:nrow(forplot)), log(forplot$UL), (1:nrow(forplot)),
lwd=2, col="black")
# X-axis ticks, here in log-ratio.
axis(1, at=log(c(0.125,0.25,0.5,1,2,4,8,16)), c(0.125,0.25,0.5,1,2,4,8,16), cex.axis=0.75) # Font size of x-axis ticks = 0.75.
# For the left part, add column names starting from the 2nd column. The first column was sorted by including empty rows in the dataset.
# mtext(side=2, at=nrow(forplot)+1, "HR (95% CI)", font=2,
# las = 2, line=6.25, adj=c(0,0), cex=0.95)
mtext(side=2, at=nrow(forplot), "Levels", font=2,
las = 2, line=4, adj=c(0,0), cex=0.95)
allrows = 1:nrow(forplot)
titles = c(7) # Location of group/cohort title
subrows = allrows[-titles] # rows except for title
## Optionally, if you have multiple columns on the left-hand side table.
# forplot$rate_in_men = format(forplot$rate_in_men, digits=4)
# forplot$rate_in_women = format(forplot$rate_in_women, digits=4)
# forplot$rate_in_men = gsub("NA","",forplot$rate_in_men)
# forplot$rate_in_women = gsub("NA","",forplot$rate_in_women)
mtext(side=2, at =(1:nrow(forplot)), forplot$levels,las=2, line=4, adj=c(0,0), cex=0.95)
# mtext(side=2, at =(1:nrow(forplot)), forplot$HRprint, las=2, line=1.5, adj=c(0,0), cex=0.95)
# at=... is the location of text at y-axis
mtext(side=2, at=(1:nrow(forplot))[titles],
forplot$label[titles],
las=2, font=2, line=13.5, cex=0.95, adj=c(0,0)
)
mtext(side=2, at=(1:nrow(forplot))[subrows],
forplot$label[subrows],
las=2, line=12.5, cex=0.95, adj=c(0,0)
)
mtext(side=4, at=1:nrow(forplot),
forplot$HRprint,
las=2, line=1, cex=0.95, adj=c(0,0))
mtext(side=4, at=1:nrow(forplot),
round(forplot$p_value, 3),
las=2, line=8, cex=0.95, adj=c(0,0))
mtext(side=4, at=nrow(forplot)+0.5, c("Hazard Ratio"), las=2, line=1,
cex=0.95, adj=c(0,0), font=2)
mtext(side=4, at=nrow(forplot), c("[95% CI]"), las=2, line=1,
cex=0.95, adj=c(0,0), font=2)
mtext(side=4, at=nrow(forplot), "P", line=8,cex=0.95, adj=c(0,0), font=2, las=2)
# at=... is the location of text at x-axis.
mtext(side=1, at=log(1), "Hazard Ratio", line=3)
# mtext(side=1, at=log(2), "More Hazard", line=5)
# mtext(side=1, at=log(0.35), "More Hazard", line=5)
# mtext(side=1, at=log(2), "In Females", line=6)
# mtext(side=1, at=log(0.35), "In Males", line=6)
# arrows(log(1.25), -2, log(2), -2, xpd=TRUE, length=0.1)
# arrows(log(0.75), -2, log(0.1), -2, xpd=TRUE, length=0.1)
Extracting Results from Subset Analysis
TBA
Extracting Results from Multivariate Models
TBA
Session Info
sessionInfo()
## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22000)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] forestploter_1.1.1 forestplot_3.1.3 abind_1.4-5
## [4] checkmate_2.2.0 survival_3.5-8 yarrr_0.1.5
## [7] circlize_0.4.16 BayesFactor_0.9.12-4.7 Matrix_1.7-0
## [10] coda_0.19-4 jpeg_0.1-10 lubridate_1.9.0
## [13] timechange_0.1.1 forcats_1.0.0 stringr_1.5.0
## [16] dplyr_1.1.3 purrr_1.0.1 readr_2.1.4
## [19] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.3
## [22] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.4 shape_1.4.6 xfun_0.43
## [4] bslib_0.5.1 GlobalOptions_0.1.2 lattice_0.22-6
## [7] tzdb_0.3.0 vctrs_0.6.3 tools_4.4.0
## [10] generics_0.1.3 parallel_4.4.0 fansi_1.0.4
## [13] highr_0.10 pkgconfig_2.0.3 lifecycle_1.0.3
## [16] compiler_4.4.0 MatrixModels_0.5-3 munsell_0.5.0
## [19] htmltools_0.5.4 sass_0.4.5 yaml_2.3.6
## [22] pillar_1.9.0 jquerylib_0.1.4 cachem_1.0.8
## [25] tidyselect_1.2.0 digest_0.6.33 mvtnorm_1.2-3
## [28] stringi_1.7.12 bookdown_0.35 splines_4.4.0
## [31] fastmap_1.1.1 colorspace_2.1-0 cli_3.6.1
## [34] magrittr_2.0.3 utf8_1.2.3 withr_2.5.1
## [37] scales_1.2.1 backports_1.4.1 rmarkdown_2.25
## [40] gridExtra_2.3 blogdown_1.18 hms_1.1.3
## [43] pbapply_1.7-2 evaluate_0.22 knitr_1.46
## [46] rlang_1.1.1 Rcpp_1.0.10 glue_1.6.2
## [49] rstudioapi_0.15.0 jsonlite_1.8.7 R6_2.5.1