R Markdown and PDF File and Project Manegement

Last updated on Sep 5, 2024 10 min read Instructions

Table of Contents

Introduction to R Markdown

R Markdown is a popular tool for generating reports, documents, and presentations using R code and markdown syntax. One of its key features is the ability to knit documents to a variety of formats, including PDF, HTML and WORD. While R Markdown has many benefits, there are also some potential drawbacks to consider:

Pros:

Easy to use: R Markdown is simple to learn and use, even for beginners.
Reproducible research: R Markdown allows you to easily reproduce your analyses and results, making it ideal for scientific research.
Flexible formatting: Markdown syntax is flexible and allows you to format text, tables, and code blocks to create a professional-looking document.
Automatic table of contents and numbered tables/figures: R Markdown generates a table of contents for your document automatically, saving you time and effort. It will also automatic generate order numbers of tables/figures.

Cons:

Limited layout control: While R Markdown allows for flexible formatting, it can be challenging to customize the layout and design of your document beyond the default options.
When knitting to PDF: While R Markdown can knit to PDF, the resulting document may not always display tables and plots in the correct location, making it less suitable for some types of reports.
Steep learning curve: While R Markdown is easy to use for basic tasks, more advanced features such as customizing templates or adding LaTeX code can require a significant amount of time and effort to learn.

In this post, we will introduce the basic settings and structures of R Markdown for knitting to a PDF file. Additionally, we will provide some helpful tips to overcome potential issues such as the limitations in displaying tables and plots correctly. Tables/figures are rough examples here, to create publication ready tables and plots, please refer to KM plot post and [Link to table1 post].

This instruction is tailored to our everyday work setting. For broader guidelines, please refer to R Markdown: The Definitive Guide.

FYI, R for Clinical Study Reports and Submission provides R code to generate FDA submission ready tables. In this post, we will focus on publication-ready tables/figures.

Project Management

First of all, you need to create your own R project in your project folder, by following this instruction: Starting Your R Project.

Session 10-12 gives us great idea about how to do project management.

Design clean project/code architecture

Insert screenshot of a data analysis project

Normally, we will have 5-6 sub-folders for one project:

Necessary Folders: Protocol folder for protocols, Data folder for data,R Code folder for R codes, Report or Results folder to save all reports versions, Resources folder for user-written R functions.
Add other folders as needed.

The R code can become quite lengthy as the project progresses, therefore, it is HIGHLY RECOMMENDED to follow these steps:

Utilize multiple .Rmd files, such as one for the analysis report, one for data cleaning, and another for exploratory analysis.
In the main .Rmd file (the one used to generate the analysis report), use the code xfun::Rscript_call(purl, list(input = 'Data Cleaning.Rmd', output = paste0(resourceDir,'Data Cleaning.R'))) to extract only the code segments, excluding YAML settings, setup code chunks, and text. This will automatically update the corresponding .R file for your desired segment. Then, use the following code to read and execute the data cleaning code: source(paste0(resourceDir,"Data Cleaning.R"))

GitHub Project

We can link our R project to GitHub to save everything within the project, but please DON’T upload any study-related code, data, or materials.

GitHub Project Board

Create a Project

Step 1: Create the Projet Folder

Our R Project will be stored in the R Code folder. You might want to have other folders such as Data, R Functions, Protocol, and Reports.

Step 2: Create a R Project and tell R Studio where is the working folder.

Choose Existing Directory.

Browse the R Code folder.

Step 3: Create a R Markdown file

Any of these settings can be changed easily later, so just select “OK” to create the R Markdown file.

Finally…!

Next time you double click the .Rproj file to re-open and keep working on the project, the RStudio will know current folder is the direction, and any previously unclosed files will be shown in the tab.

Starting Part

On the top of RMarkdown is YAML parameters of this document. You may change the pdf_document to html_document or word_document if you want to knit to other formats. toc: yes means display table of contents.

FYI… https://zsmith27.github.io/rmarkdown_crash-course/lesson-4-yaml-headers.html

---
title: "Your Title"
author: "Your Name"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  pdf_document:
    toc: yes
    toc_depth: 5
    number_sections: yes
    fig_width: 7  # OR 9.33
    fig_height: 6    # OR 8
    fig_crop: no
header-includes: 
- \usepackage{float}
---

The KM plot code here is for demonstration, please refer to https://oncologyqs.github.io/post/biostatistics/2023-03-24-km-plots/ for the most recent update KM plot code.

Next, on the top of each RMarkdown, I prefer having all my packages loaded, load my own functions, and any other customized setting. I then read all data in a separate code chunk.

The setup code chunk will be run automatcally if you run other codes.

knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE, fig.pos = 'H')

library(tidyverse)
library(knitr)
library(readxl)
library(kableExtra)
library(table1)
library(expss)
library(flextable)
library(PropCIs)
library(survival)
library(survminer)
library(patchwork)
library(tibble)
library(forestplot)
library(survivalAnalysis)
library(stringr)
library(tidytidbits)
library(ggplot2)
library(ggpubr)
library(reshape2)
library(ggbreak)
library(finalfit)
library(stringr)
library(cobalt)
library(consort)


dataDir = paste0(dirname(rprojroot::find_rstudio_root_file()),"/Data/")
resourceDir = paste0(dirname(rprojroot::find_rstudio_root_file()),"/Resources/")

# source(paste0(resourceDir,"OneGroup_survPlot.R"))
# source(paste0(resourceDir,"TwoGroups_survPlot.R"))

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))
}

as.date=function(x) as.Date(x,origin='1899-12-30') # The origin date of excel file is 1899-12-30. The origin date of R is 1970-01-01. PLEASE carefully check the origin date of your data!

custom_theme <- theme_cleantable() +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.ticks.x = element_blank(),
        axis.line.x = element_blank(),
        axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        plot.title = element_text(size=12)) #change the fontsize of "Number at risk"

data = veteran %>% 
  mutate(
    trt = factor(trt, levels = c(1,2), labels = c("Control", "Treatment")),
    time = time/30.437 # One month = 30.437 days
  )

data_control = data %>% 
  filter(trt == "Control")

data_trt = data %>% 
  filter(trt == "Treatment")

Note:

I will keep brief notes here for important information related to the report.

I will provide a separate Word document detailing where each variable can be found, how key variables were derived (according to the protocol), and the statistical methods used in each step. Below is an example:

Protocol version: xxx
Data version: xxx
Data cut-off date: xxx

The veteran dataset contains survival data from a two-treatment, randomized trial for lung cancer. The variables in veteran are:

trt (Column A): 1=standard 2=test
celltype (Column B): 1=squamous, 2=small cell, 3=adeno, 4=large
time (Column C): survival time in days
status (Column D): censoring status 0=censored, 1=death
karno (Column E): Karnofsky performance score (100=good)
diagtime (Column F): months from diagnosis to randomization
age (Column G): in years
prior (Column H): prior therapy 0=no, 10=yes

Data Wrangling Rmd

Do your data wrangling in one or multiple code chunks. I recommend doing this in a separate Rmd file. Otherwise the you will get a VERY LONG Rmd file.

data_table1_demo = data %>% 
  mutate(
    prior = case_when(prior == 0 ~ "No",
                      TRUE ~ "Yes"),
    prior = factor(prior, levels = c("Yes", "No")),
    trt = factor(trt, levels = c("Treatment", "Control"))
  )

To extract only the code segments, excluding YAML settings, setup code chunks, and text. This will automatically update the corresponding .R file for your desired segment.

xfun::Rscript_call(purl, list(input = 'Data Cleaning.Rmd', output = paste0(resourceDir,'Data Cleaning.R')))

Then, use the following code to read and execute the data cleaning code:

source(paste0(resourceDir,"Data Cleaning.R"))

Analysis Part

I will show some tables and figures.

Please remember to use the following line to force the new session starts printing from a new page. Please note that this line is not code, you need it in the text part.

\newpage

Baseline Characteristics Table

Knit to PDF cannot recognize the location of the table. We need to convert the table to kable using The first 2 commented lines or flextable using The last 2 commented lines to keep the table position.

table1(~ age + celltype + prior | trt,
       data = data_table1_demo,
       caption = "Table 1 Sample")

Table 1 Sample
	Treatment (N=68)	Control (N=69)	Overall (N=137)
age
Mean (SD)	59.1 (10.3)	57.5 (10.8)	58.3 (10.5)
Median [Min, Max]	62.0 [35.0, 81.0]	62.0 [34.0, 81.0]	62.0 [34.0, 81.0]
celltype
squamous	20 (29.4%)	15 (21.7%)	35 (25.5%)
smallcell	18 (26.5%)	30 (43.5%)	48 (35.0%)
adeno	18 (26.5%)	9 (13.0%)	27 (19.7%)
large	12 (17.6%)	15 (21.7%)	27 (19.7%)
prior
Yes	19 (27.9%)	21 (30.4%)	40 (29.2%)
No	49 (72.1%)	48 (69.6%)	97 (70.8%)

       # t1kable() %>%
       # kable_styling(latex_options = c("HOLD_position"))
       # t1flex() %>% 
       # line_spacing(space = 0, part = "body")

KM Plot of Treatment Group

Unlike tables, usually a plot will not cause any trouble. In case it shows up before its subtitle, you just need to add some description between your subtitle and code like this.

sample_onegroup = OneGroup_survPlot(data = data_trt, outcomeTime = "time",outcome = "status")

KM Plot of Two Groups

sample_twogroups = TwoGroups_survPlot(data = data, outcomeTime = "time", outcome = "status", 
                   xlab = "Months since the first treatment", ylab = "Probability of OS", 
                   VarName = "trt", 
                   ExpName = "Arm", ExpLabels = c("Ctrl", "Trt"), 
                   HRText = F, MedianText = T, 
                   labelX = 8, title = "Sample KM Plot of Two Groups", 
                   BreakBy = 6)

Cox Prop Model Table

CoxTable(data=data, VarName = "trt", ExpLabels = c("Control", "Treatment"))

Group	HR (95% CI)	P-value
Control	REF
Treatment	1.02 \[0.71, 1.45\]	0.928

Output PDF File

Session Info

sessionInfo()

## R version 4.4.0 (2024-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 22631)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] consort_1.2.1          cobalt_4.5.5.9000      finalfit_1.0.6        
##  [4] ggbreak_0.1.2          reshape2_1.4.4         tidytidbits_0.3.2     
##  [7] survivalAnalysis_0.3.0 forestplot_3.1.3       abind_1.4-5           
## [10] checkmate_2.2.0        patchwork_1.1.3        survminer_0.4.9       
## [13] ggpubr_0.6.0           survival_3.6-4         PropCIs_0.3-0         
## [16] flextable_0.9.3        expss_0.11.6           maditr_0.8.3          
## [19] table1_1.4.3           kableExtra_1.4.0       readxl_1.4.2          
## [22] knitr_1.46             lubridate_1.9.0        timechange_0.1.1      
## [25] forcats_1.0.0          stringr_1.5.1          dplyr_1.1.3           
## [28] purrr_1.0.1            readr_2.1.4            tidyr_1.3.1           
## [31] tibble_3.2.1           ggplot2_3.4.4          tidyverse_2.0.0       
## 
## loaded via a namespace (and not attached):
##   [1] shape_1.4.6             rstudioapi_0.15.0       jsonlite_1.8.7         
##   [4] magrittr_2.0.3          jomo_2.7-6              farver_2.1.2           
##   [7] nloptr_2.0.3            rmarkdown_2.25          fs_1.6.3               
##  [10] ragg_1.2.5              vctrs_0.6.3             minqa_1.2.6            
##  [13] memoise_2.0.1           askpass_1.2.0           rstatix_0.7.2          
##  [16] blogdown_1.18           htmltools_0.5.4         curl_5.0.2             
##  [19] broom_1.0.5             cellranger_1.1.0        Formula_1.2-5          
##  [22] gridGraphics_0.5-1      mitml_0.4-5             sass_0.4.5             
##  [25] bslib_0.5.1             htmlwidgets_1.6.2       plyr_1.8.9             
##  [28] zoo_1.8-11              cachem_1.0.8            uuid_1.1-1             
##  [31] iterators_1.0.14        mime_0.12               lifecycle_1.0.4        
##  [34] pkgconfig_2.0.3         Matrix_1.7-0            R6_2.5.1               
##  [37] fastmap_1.1.1           shiny_1.7.5             digest_0.6.33          
##  [40] aplot_0.2.1             colorspace_2.1-0        rprojroot_2.0.3        
##  [43] textshaping_0.3.6       labeling_0.4.3          fansi_1.0.4            
##  [46] km.ci_0.5-6             compiler_4.4.0          fontquiver_0.2.1       
##  [49] withr_3.0.0             htmlTable_2.4.1         backports_1.4.1        
##  [52] carData_3.0-5           highr_0.10              pan_1.9                
##  [55] ggsignif_0.6.4          MASS_7.3-60.2           openssl_2.1.1          
##  [58] gfonts_0.2.0            tools_4.4.0             zip_2.3.0              
##  [61] httpuv_1.6.11           nnet_7.3-19             glue_1.6.2             
##  [64] nlme_3.1-164            promises_1.2.1          gridtext_0.1.5         
##  [67] generics_0.1.3          gtable_0.3.5            tzdb_0.3.0             
##  [70] KMsurv_0.1-5            data.table_1.14.8       hms_1.1.3              
##  [73] xml2_1.3.4              car_3.1-2               utf8_1.2.3             
##  [76] foreach_1.5.2           pillar_1.9.0            yulab.utils_0.1.0      
##  [79] later_1.3.1             splines_4.4.0           ggtext_0.1.2           
##  [82] lattice_0.22-6          tidyselect_1.2.1        fontLiberation_0.1.0   
##  [85] fontBitstreamVera_0.1.1 gridExtra_2.3           bookdown_0.35          
##  [88] svglite_2.1.1           crul_1.4.0              xfun_0.43              
##  [91] matrixStats_0.63.0      stringi_1.7.12          boot_1.3-30            
##  [94] ggfun_0.1.3             yaml_2.3.6              codetools_0.2-20       
##  [97] evaluate_0.22           httpcode_0.3.0          officer_0.6.6          
## [100] gdtools_0.3.3           ggplotify_0.1.2         cli_3.6.1              
## [103] rpart_4.1.23            xtable_1.8-4            systemfonts_1.0.4      
## [106] munsell_0.5.1           jquerylib_0.1.4         survMisc_0.5.6         
## [109] Rcpp_1.0.10             ellipsis_0.3.2          lme4_1.1-34            
## [112] glmnet_4.1-8            viridisLite_0.4.2       scales_1.3.0           
## [115] crayon_1.5.2            rlang_1.1.1             cowplot_1.1.1          
## [118] mice_3.16.0

R Code Markdown