R Markdown and PDF File and Project Manegement
Table of Contents
Introduction to R Markdown
R Markdown is a popular tool for generating reports, documents, and presentations using R code and markdown syntax. One of its key features is the ability to knit documents to a variety of formats, including PDF, HTML and WORD. While R Markdown has many benefits, there are also some potential drawbacks to consider:
Pros:
- Easy to use: R Markdown is simple to learn and use, even for beginners.
- Reproducible research: R Markdown allows you to easily reproduce your analyses and results, making it ideal for scientific research.
- Flexible formatting: Markdown syntax is flexible and allows you to format text, tables, and code blocks to create a professional-looking document.
- Automatic table of contents and numbered tables/figures: R Markdown generates a table of contents for your document automatically, saving you time and effort. It will also automatic generate order numbers of tables/figures.
Cons:
- Limited layout control: While R Markdown allows for flexible formatting, it can be challenging to customize the layout and design of your document beyond the default options.
- When knitting to PDF: While R Markdown can knit to PDF, the resulting document may not always display tables and plots in the correct location, making it less suitable for some types of reports.
- Steep learning curve: While R Markdown is easy to use for basic tasks, more advanced features such as customizing templates or adding LaTeX code can require a significant amount of time and effort to learn.
In this post, we will introduce the basic settings and structures of R Markdown for knitting to a PDF file. Additionally, we will provide some helpful tips to overcome potential issues such as the limitations in displaying tables and plots correctly. Tables/figures are rough examples here, to create publication ready tables and plots, please refer to KM plot post and [Link to table1 post].
This instruction is tailored to our everyday work setting. For broader guidelines, please refer to R Markdown: The Definitive Guide.
FYI, R for Clinical Study Reports and Submission provides R code to generate FDA submission ready tables. In this post, we will focus on publication-ready tables/figures.
Project Management
First of all, you need to create your own R project in your project folder, by following this instruction: Starting Your R Project.
Session 10-12 gives us great idea about how to do project management.
Design clean project/code architecture
Insert screenshot of a data analysis project
Normally, we will have 5-6 sub-folders for one project:
- Necessary Folders:
Protocol
folder for protocols,Data
folder for data,R Code
folder for R codes,Report
orResults
folder to save all reports versions,Resources
folder for user-written R functions. - Add other folders as needed.
The R code can become quite lengthy as the project progresses, therefore, it is HIGHLY RECOMMENDED to follow these steps:
- Utilize multiple .Rmd files, such as one for the analysis report, one for data cleaning, and another for exploratory analysis.
- In the main .Rmd file (the one used to generate the analysis report), use the code
xfun::Rscript_call(purl, list(input = 'Data Cleaning.Rmd', output = paste0(resourceDir,'Data Cleaning.R')))
to extract only the code segments, excluding YAML settings, setup code chunks, and text. This will automatically update the corresponding .R file for your desired segment. Then, use the following code to read and execute the data cleaning code:source(paste0(resourceDir,"Data Cleaning.R"))
GitHub Project
We can link our R project to GitHub to save everything within the project, but please DON’T upload any study-related code, data, or materials.
Create a Project
Step 1: Create the Projet Folder
Our R Project will be stored in the R Code folder. You might want to have other folders such as Data
, R Functions
, Protocol
, and Reports
.
Step 2: Create a R Project and tell R Studio where is the working folder.
Choose Existing Directory
.
Browse the R Code folder.
Step 3: Create a R Markdown file
Any of these settings can be changed easily later, so just select “OK” to create the R Markdown file.
Finally…!
Next time you double click the .Rproj file to re-open and keep working on the project, the RStudio will know current folder is the direction, and any previously unclosed files will be shown in the tab.
Starting Part
On the top of RMarkdown is YAML parameters of this document. You may change the pdf_document
to html_document
or word_document
if you want to knit to other formats. toc: yes
means display table of contents.
FYI… https://zsmith27.github.io/rmarkdown_crash-course/lesson-4-yaml-headers.html
---
title: "Your Title"
author: "Your Name"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
pdf_document:
toc: yes
toc_depth: 5
number_sections: yes
fig_width: 7 # OR 9.33
fig_height: 6 # OR 8
fig_crop: no
header-includes:
- \usepackage{float}
---
- The KM plot code here is for demonstration, please refer to https://oncologyqs.github.io/post/biostatistics/2023-03-24-km-plots/ for the most recent update KM plot code.
Next, on the top of each RMarkdown, I prefer having all my packages loaded, load my own functions, and any other customized setting. I then read all data in a separate code chunk.
The setup code chunk will be run automatcally if you run other codes.
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE, fig.pos = 'H')
library(tidyverse)
library(knitr)
library(readxl)
library(kableExtra)
library(table1)
library(expss)
library(flextable)
library(PropCIs)
library(survival)
library(survminer)
library(patchwork)
library(tibble)
library(forestplot)
library(survivalAnalysis)
library(stringr)
library(tidytidbits)
library(ggplot2)
library(ggpubr)
library(reshape2)
library(ggbreak)
library(finalfit)
library(stringr)
library(cobalt)
library(consort)
dataDir = paste0(dirname(rprojroot::find_rstudio_root_file()),"/Data/")
resourceDir = paste0(dirname(rprojroot::find_rstudio_root_file()),"/Resources/")
# source(paste0(resourceDir,"OneGroup_survPlot.R"))
# source(paste0(resourceDir,"TwoGroups_survPlot.R"))
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
as.date=function(x) as.Date(x,origin='1899-12-30') # The origin date of excel file is 1899-12-30. The origin date of R is 1970-01-01. PLEASE carefully check the origin date of your data!
custom_theme <- theme_cleantable() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks.x = element_blank(),
axis.line.x = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
plot.title = element_text(size=12)) #change the fontsize of "Number at risk"
data = veteran %>%
mutate(
trt = factor(trt, levels = c(1,2), labels = c("Control", "Treatment")),
time = time/30.437 # One month = 30.437 days
)
data_control = data %>%
filter(trt == "Control")
data_trt = data %>%
filter(trt == "Treatment")
Note:
I will keep brief notes here for important information related to the report.
I will provide a separate Word document detailing where each variable can be found, how key variables were derived (according to the protocol), and the statistical methods used in each step. Below is an example:
Protocol version: xxx
Data version: xxx
Data cut-off date: xxx
The veteran
dataset contains survival data from a two-treatment, randomized trial for lung cancer. The variables in veteran
are:
trt
(Column A): 1=standard 2=testcelltype
(Column B): 1=squamous, 2=small cell, 3=adeno, 4=largetime
(Column C): survival time in daysstatus
(Column D): censoring status 0=censored, 1=deathkarno
(Column E): Karnofsky performance score (100=good)diagtime
(Column F): months from diagnosis to randomizationage
(Column G): in yearsprior
(Column H): prior therapy 0=no, 10=yes
Data Wrangling Rmd
Do your data wrangling in one or multiple code chunks. I recommend doing this in a separate Rmd file. Otherwise the you will get a VERY LONG Rmd file.
data_table1_demo = data %>%
mutate(
prior = case_when(prior == 0 ~ "No",
TRUE ~ "Yes"),
prior = factor(prior, levels = c("Yes", "No")),
trt = factor(trt, levels = c("Treatment", "Control"))
)
To extract only the code segments, excluding YAML settings, setup code chunks, and text. This will automatically update the corresponding .R file for your desired segment.
xfun::Rscript_call(purl, list(input = 'Data Cleaning.Rmd', output = paste0(resourceDir,'Data Cleaning.R')))
Then, use the following code to read and execute the data cleaning code:
source(paste0(resourceDir,"Data Cleaning.R"))
Analysis Part
I will show some tables and figures.
Please remember to use the following line to force the new session starts printing from a new page. Please note that this line is not code, you need it in the text part.
\newpage
Baseline Characteristics Table
Knit to PDF cannot recognize the location of the table. We need to convert the table to kable
using The first 2 commented lines or flextable
using The last 2 commented lines to keep the table position.
table1(~ age + celltype + prior | trt,
data = data_table1_demo,
caption = "Table 1 Sample")
Treatment (N=68) | Control (N=69) | Overall (N=137) | |
---|---|---|---|
age | |||
Mean (SD) | 59.1 (10.3) | 57.5 (10.8) | 58.3 (10.5) |
Median [Min, Max] | 62.0 [35.0, 81.0] | 62.0 [34.0, 81.0] | 62.0 [34.0, 81.0] |
celltype | |||
squamous | 20 (29.4%) | 15 (21.7%) | 35 (25.5%) |
smallcell | 18 (26.5%) | 30 (43.5%) | 48 (35.0%) |
adeno | 18 (26.5%) | 9 (13.0%) | 27 (19.7%) |
large | 12 (17.6%) | 15 (21.7%) | 27 (19.7%) |
prior | |||
Yes | 19 (27.9%) | 21 (30.4%) | 40 (29.2%) |
No | 49 (72.1%) | 48 (69.6%) | 97 (70.8%) |
# t1kable() %>%
# kable_styling(latex_options = c("HOLD_position"))
# t1flex() %>%
# line_spacing(space = 0, part = "body")
KM Plot of Treatment Group
Unlike tables, usually a plot will not cause any trouble. In case it shows up before its subtitle, you just need to add some description between your subtitle and code like this.
sample_onegroup = OneGroup_survPlot(data = data_trt, outcomeTime = "time",outcome = "status")
KM Plot of Two Groups
sample_twogroups = TwoGroups_survPlot(data = data, outcomeTime = "time", outcome = "status",
xlab = "Months since the first treatment", ylab = "Probability of OS",
VarName = "trt",
ExpName = "Arm", ExpLabels = c("Ctrl", "Trt"),
HRText = F, MedianText = T,
labelX = 8, title = "Sample KM Plot of Two Groups",
BreakBy = 6)
Cox Prop Model Table
CoxTable(data=data, VarName = "trt", ExpLabels = c("Control", "Treatment"))
Group | HR (95% CI) | P-value |
---|---|---|
Control | REF | |
Treatment | 1.02 \[0.71, 1.45\] | 0.928 |