Tips of table1 R Package and Summary Tables

Last updated on Oct 14, 2024 10 min read Biostatistics

Table of Contents

Introduction

In this post, I will introduce how I utilize the table1 R package, which is designed to create publication-ready tables summarizing descriptive statistics and baseline characteristics of a study population. The package offers a user-friendly interface for generating customizable tables that can include categorical and continuous variables, stratified analyses, and statistical tests for group differences. Additionally, I will provide tips on using the table1 package.

Beyond the examples provided here, should you wish for further customization, like merging two tables with different sample sizes or conducting other statistical tests, you may need to create your own code using basic R.
- More methods and render code will be updated, later… Hopefully!

Before we start

The package author has offered an excellent tutorial, please take a look to gain insights into using this package: https://benjaminrich.github.io/table1/vignettes/table1-examples.html
I will use the same dataset as the package author in this post for illustration.
I recommend preparing the data for table1 in a separate dataset. For instance, name the raw dataset as dat, the table1-specific dataset as table1_dat, and the analysis dataset as analysis_dat. This separation helps when we want to apply different reference levels, variable labels, etc, for different purposes:
- Set variables as factors, with levels ordered as they will appear in the table.
- Use the expss::apply_labels function to add labels to the column names

Sample Datasets

## Data from package
library(boot) 

melanoma2 <- melanoma
 
# Factor the basic variables that
# we're interested in
melanoma2$status <- 
  factor(melanoma2$status, 
         levels=c(2,1,3),
         labels=c("Alive", # Reference
                  "Melanoma death", 
                  "Non-melanoma death"))


## Simulated data
f <- function(x, n, ...) factor(sample(x, n, replace=T, ...), levels=x)
set.seed(427)

n <- 146
dat <- data.frame(id=1:n)
dat$treat <- f(c("Placebo", "Treated"), n, prob=c(1, 2)) # 2:1 randomization
dat$age   <- sample(18:65, n, replace=TRUE)
dat$sex   <- f(c("Female", "Male"), n, prob=c(.6, .4))  # 60% female
dat$wt    <- round(exp(rnorm(n, log(70), 0.23)), 1)

# Add some missing data
dat$wt[sample.int(n, 5)] <- NA

dat = dat %>% 
  mutate(treat = as.factor(treat),
         sex = factor(sex, levels = c("Female", "Male")))


table1_dat = expss::apply_labels(dat,
                                 age="Age",
                                 sex="Sex",
                                 wt="Weight",
                                 treat="Treatment Group")


units(table1_dat$age)   <- "years"
units(table1_dat$wt)    <- "kg"

Render for Descriptive Statistics

Render option receives a function that specifies how to calculate summary statistics or p-value of each variable as an input.
The render.missing=NULL option will remove the “Missing” rows from the table, note however that the percentages will not change and will therefore not add up to 100% if there are missing values.

Render of Continuous Variables

We will show N, mean (SD), min-max, median [Q1, Q3]. Uncomment the missing line if you want to show missing.

render.cont <- function(x, name, table1_data, ...) {
    MIN <- min(x, na.rm = T)
    MAX <- max(x, na.rm = T)
    median <- median(x, na.rm = T)
    Q1 <- quantile(x, 0.25, na.rm = T)
    Q3 <- quantile(x, 0.75, na.rm = T)
    N = length(x) - sum(is.na(x))
    MEAN = mean(x, na.rm = T)
    SD = sd(x, na.rm = T)
    nmiss <- sum(is.na(x))
    miss = (nmiss/length(x))*100

    out <- c(#"N"=paste0("(N=",N,")"),
             "N"=paste0(" "),
             "Mean (SD)" = paste0(sprintf("%.1f",MEAN), " (", sprintf("%.1f",SD),")"),
             "Min - Max" = paste0(sprintf("%.1f",MIN), " - ", sprintf("%.1f",MAX)),
             "Median [Q1, Q3]" = paste0(sprintf("%.1f", median), " [", sprintf("%.1f", Q1), ", ", sprintf("%.1f", Q3), "]"))
         #    "Missing" = paste0(sprintf("%.0f",nmiss), " (", sprintf("%.1f",miss),"%)"))

    out
}

Render of Categorical Variables

will show N and column %.

render.cat = function(x) { 
  N = length(x) - sum(is.na(x))
  FREQ_PCT = sub('.', '.', c(sapply(stats.default(x), 
  function(y) with(y, sprintf("%d (%0.1f %%)", FREQ, PCT)))), fixed = TRUE)
  
  out = c(#"N"=paste0("(N=",N,")"),
          "N"=paste0(" "),
          FREQ_PCT)
  out
}

Render for Statistical Tests

Two-Sample Parametric Tests

Calculating p-values parametric tests to compare two independent groups. If the variable is numeric, run two-sample t-test, if the variable is categorical, run chi-square test.

pvalue_para <- function(x, ...) {
    # Construct vectors of data y, and groups (strata) g
    y <- unlist(x)
    g <- factor(rep(1:length(x), times=sapply(x, length)))
    if (is.numeric(y)) {
        # For numeric variables, perform a standard 2-sample t-test
        p <- t.test(y ~ g)$p.value
    } else {
      if (nrow(table(y,g)) !=1) {
        # For categorical variables, perform a chi-squared test of independence
        p <- chisq.test(table(y, g))$p.value
      }
      else {p <- ""}
    }
    # Format the p-value, using an HTML entity for the less-than sign.
    # The initial empty string places the output on the line below the variable label.
    if (!p %in% "") {
    c("", sub("<", "<", format.pval(p, digits=3, eps=0.001)))
    }
}

Two-Sample Non-Parametric Tests

Calculating p-values from non-parametric tests to compare two independent groups. If the variable is numeric, run Wilcoxon Rank Sum test, if the variable is categorical, run Fisher’s exact test.

pvalue_nonpara <- function(x, ...) {
    # Construct vectors of data y, and groups (strata) g
    y <- unlist(x)
    g <- factor(rep(1:length(x), times=sapply(x, length)))
    if (is.numeric(y)) {
        # For numeric variables, perform a standard 2-sample t-test
        p <- wilcox.test(y ~ g)$p.value # non-parameric: wilcoxon rank sum test
    } else {
      if (nrow(table(y,g)) !=1) {
        # For categorical variables, perform a chi-squared test of independence
        p <- fisher.test(table(y, g))$p.value # non-parameric: fisher's exact test
      }
      else {p <- ""} # If only one level has non-missing counts, skip the testing.
    }
    # Format the p-value, using an HTML entity for the less-than sign.
    # The initial empty string places the output on the line below the variable label.
    if (!p %in% "") {
    c("", sub("<", "<", format.pval(p, digits=3, eps=0.001)))
    }
}

Multi-Sample Parametric Tests

Calculating p-values parametric tests to assess the difference between the means of more than two groups. If the variable is numeric, run ANOVA test, if the variable is categorical, run chi-square test to assess whether the population proportions are equal.

pvalueANOVA <- function(x, ...) {
  # Construct vectors of data y, and groups (strata) g
  y <- unlist(x)
  g <- factor(rep(1:length(x), times=sapply(x, length)))
  
  if (is.numeric(y)) {
    # For numeric variables, perform an ANOVA test
    ano <- aov(y ~ g)
    p <- summary(ano)[[1]][[5]][1]
    
  } else {
    if (nrow(table(y,g)) >=2) {
    # For categorical variables, perform a chi-squared test of independence
    p <- chisq.test(table(y, g))$p.value
    }
    else {p <- ""}
  }
  # Format the p-value, using an HTML entity for the less-than sign.
  # The initial empty string places the output on the line below the variable label.
  c("", sub("<", "&lt;", format.pval(p, digits=3, eps=0.001)))
}

Multi-Sample Non-Parametric Tests

Calculating p-values non-parametric tests to assess the difference between the means of more than two groups. If the variable is numeric, run Kruskal-Wallis test, if the variable is categorical, run Fisher’s exact test to assess whether the population proportions are equal.

pvalueKW <- function(x, ...) {
  # Construct vectors of data y, and groups (strata) g
  y <- unlist(x)
  g <- factor(rep(1:length(x), times=sapply(x, length)))
  
  if (is.numeric(y)) {
    # For numeric variables, perform a Kruskal-Wallis test
    km <- kruskal.test(y ~ g)
    p <- km$p.value
    
  } else {
    if (nrow(table(y,g)) >=2) {
    # For categorical variables, perform a fisher's exact test of independence
    p <- fisher.test(table(y, g))$p.value
    }
    else {p <- ""}
  }
  # Format the p-value, using an HTML entity for the less-than sign.
  # The initial empty string places the output on the line below the variable label.
  c("", sub("<", "&lt;", format.pval(p, digits=3, eps=0.001)))
}

Sample table1 Code with Overall

The render.missing=NULL option will remove the “Missing” rows from the table, note however that the percentages will not change and will therefore not add up to 100% if there are missing values.

table1_ex1 = table1(~ age + sex + wt | treat, data=table1_dat,
                   render.continuous=render.cont, # optional, your continuous variable render
                   render.categorical=render.cat, # optional, your categorical variable render
                   # render.missing=NULL, 
                   overall = "Modify Overall Column Name Here"
                   )

table1_ex1

	Placebo (N=52)	Treated (N=94)	Modify Overall Column Name Here (N=146)
Age (years)
Mean (SD)	39.2 (14.2)	40.1 (13.3)	39.8 (13.6)
Min - Max	18.0 - 65.0	18.0 - 65.0	18.0 - 65.0
Median [Q1, Q3]	37.5 [26.8, 50.5]	39.5 [30.0, 50.0]	39.0 [28.2, 50.0]
Sex
Female	34 (65.4 %)	53 (56.4 %)	87 (59.6 %)
Male	18 (34.6 %)	41 (43.6 %)	59 (40.4 %)
Weight (kg)
Mean (SD)	68.1 (16.3)	68.3 (16.7)	68.2 (16.5)
Min - Max	37.5 - 116.3	40.0 - 118.8	37.5 - 118.8
Median [Q1, Q3]	66.7 [57.2, 77.0]	64.9 [57.4, 75.9]	66.2 [57.3, 76.4]
Missing	2 (3.8%)	3 (3.2%)	5 (3.4%)

Sample table1 Code with Customized render and p-value Calcualtion

⚠️**To conduct tests, the ‘overall’ option must be set to FALSE.**⚠️

table1_ex2 = table1(~ age + sex + wt | treat, data=table1_dat,
                   render.continuous=render.cont, # optional, your continuous variable render
                   render.categorical=render.cat, # optional, your categorical variable render
                   overall = F,
                   extra.col=list(`Parametric P-value`=pvalue_para,
                                  `Non-Parametric P-value`=pvalue_nonpara)
                   )

table1_ex2

	Placebo (N=52)	Treated (N=94)	Parametric P-value	Non-Parametric P-value
Age (years)
Mean (SD)	39.2 (14.2)	40.1 (13.3)	0.719	0.66
Min - Max	18.0 - 65.0	18.0 - 65.0
Median [Q1, Q3]	37.5 [26.8, 50.5]	39.5 [30.0, 50.0]
Sex
Female	34 (65.4 %)	53 (56.4 %)	0.376	0.379
Male	18 (34.6 %)	41 (43.6 %)
Weight (kg)
Mean (SD)	68.1 (16.3)	68.3 (16.7)	0.944	0.936
Min - Max	37.5 - 116.3	40.0 - 118.8
Median [Q1, Q3]	66.7 [57.2, 77.0]	64.9 [57.4, 75.9]
Missing	2 (3.8%)	3 (3.2%)

table1_ex3 = table1(~ factor(sex) + age + factor(ulcer) + thickness | status, data=melanoma2,
                   render.continuous=render.cont, # optional, your continuous variable render
                   render.categorical=render.cat, # optional, your categorical variable render
                   overall = F,
                   extra.col=list(`Parametric P-value`=pvalueANOVA,
                                  `Non-Parametric P-value`=pvalueKW)
                   )

table1_ex3

	Alive (N=134)	Melanoma death (N=57)	Non-melanoma death (N=14)	Parametric P-value	Non-Parametric P-value
factor(sex)
0	91 (67.9 %)	28 (49.1 %)	7 (50.0 %)	0.0335	0.0325
1	43 (32.1 %)	29 (50.9 %)	7 (50.0 %)
age
Mean (SD)	50.0 (15.9)	55.1 (17.9)	65.3 (10.9)	0.0016	0.00148
Min - Max	4.0 - 84.0	14.0 - 95.0	49.0 - 86.0
Median [Q1, Q3]	52.0 [40.0, 61.8]	56.0 [44.0, 68.0]	65.0 [57.0, 71.8]
factor(ulcer)
0	92 (68.7 %)	16 (28.1 %)	7 (50.0 %)	<0.001	<0.001
1	42 (31.3 %)	41 (71.9 %)	7 (50.0 %)
thickness
Mean (SD)	2.2 (2.3)	4.3 (3.6)	3.7 (3.6)	<0.001	<0.001
Min - Max	0.1 - 12.9	0.3 - 17.4	0.2 - 12.6
Median [Q1, Q3]	1.4 [0.8, 2.9]	3.5 [2.2, 4.8]	2.3 [1.3, 5.8]

Knit to PDF or HTML

Table1 object can be converted to kable or flextable using build-in functions in the table1 package. Both of them provide features to easily create tables for reporting and publications.

I will convert my table1 object to a flexible format if I’m generating a PDF report. If the knit output is in HTML format, all three methods (table1, kable, flextable) will perform equally well.

kable and kableExtra vignettes: https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
flextable book: https://ardata-fr.github.io/flextable-book/

Table1 to Kable

t1kable(your_table1_object)

Table1 to FlexTable

t1flex(your_table1_object)

table1_ex1 %>% 
  t1flex() %>%
  line_spacing(space = 0, part = "body") %>% 
  bg(bg = "white", part = "all") # Change the background color to 'white' to display the table properly in 'Dark' mode. This step is unnecessary if you're generating your own report.

	Placebo (N=52)	Treated (N=94)	Modify Overall Column Name Here (N=146)
Age (years)
Mean (SD)	39.2 (14.2)	40.1 (13.3)	39.8 (13.6)
Min - Max	18.0 - 65.0	18.0 - 65.0	18.0 - 65.0
Median [Q1, Q3]	37.5 [26.8, 50.5]	39.5 [30.0, 50.0]	39.0 [28.2, 50.0]
Sex
Female	34 (65.4 %)	53 (56.4 %)	87 (59.6 %)
Male	18 (34.6 %)	41 (43.6 %)	59 (40.4 %)
Weight (kg)
Mean (SD)	68.1 (16.3)	68.3 (16.7)	68.2 (16.5)
Min - Max	37.5 - 116.3	40.0 - 118.8	37.5 - 118.8
Median [Q1, Q3]	66.7 [57.2, 77.0]	64.9 [57.4, 75.9]	66.2 [57.3, 76.4]
Missing	2 (3.8%)	3 (3.2%)	5 (3.4%)

R Code