R Workflow¶
Nowadays, there seems to be an R package for anything and everything. While this makes starting a project in R seem quick and easy, there are considerations to take into account that will make your life easier in the long run.
Working with OnDemand, not against¶
Having multiple R OnDemand sessions running that utilise the same library may
cause issues. It is best to ensure that all idle sessions have been exited
properly and the jobs have been deleted from the Active Jobs
page which can
be found under the Jobs
tab in OnDemand. If multiple sessions are needed, try
to ensure that they don't interfere with each other.
Understanding your environment¶
When beginning a project, keep the scope in mind and build your environment accordingly. Trying to build a universal environment that holds all the packages you could ever need will lead to incompatibilities at some point in the future.
Getting comfortable clearing your environment (not just sweeping your session variables) is essential to having a functional library. This is done with the following command:
rm -rf ~/R/x86_64-pc-linux-gnu-library
Note this removes all R libraries. Specific libraries can be removed by following this method, documented in a previous blog post.
Simplify the installation¶
Once you've cleared your environment, creating an installation script for a given environment will reduce your installation time, and ensure a clean slate.
Below demonstrates installing an R environment within an Apocrita job script, using 4 cores:
#!/bin/bash
#$ -cwd
#$ -pe smp 4
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G
#$ -j y
mkdir -p /data/home/$USER/R/x86_64-pc-linux-gnu-library/4.2
module load R
Rscript 01_install_env.R
The mkdir
command is run only once after your environment has been cleared.
Also, note that the above mkdir
command is creating a R 4.2 environment,
which should match the working R version.
Below shows the contents of the 01_install_env.R
file:
install.packages("dplyr")
install.packages("DESeq2")
install.packages("ggplot2")
install.packages("pheatmap")
Defaults for quick and consistent installation have already been specified for
you, such as setting the mirror to
cloud
and using
multiple cores.
Save to file frequently¶
As natural breakpoints in your project arise, take the opportunity to save the relevant datasets and/or plots to file.
Reading from file is much more effective and reproducible than repeating a series of commands time and time again.
If you are concerned about space, you can write and read to a gzipped file
using write.csv()
and read.table()
respectively.
This will result in more intermediary files being created so keeping your raw
and processed
data directories organised is essential.
Modularise your scripts¶
Instead of creating a script that is a collection of processes, we recommend creating shorter scripts, each one with a specific purpose. By starting each script name with a number, it is easier to keep track of the workflow. For example, a possible differential expression workflow:
01_install_env.R
02_preprocess_data.R
03_DE_analysis.R
04a_plot_volcano.R
04b_plot_heatmaps.R
From there, only load packages that are relevant to that script. In this
example, DESeq2
is only loaded in script #3 and ggplot2
(or likewise) is
only loaded in scripts #4a and #4b.
Start with small data¶
Build your scripts in OnDemand or an interactive session with only a portion of the data you will ultimately be analysing. This way, you can test and tweak functions and visualisations.
Before moving onto the entire dataset, it is helpful to understand all the data
types using Exploratory Data Analysis (EDA) techniques (table()
etc.) to
predict possible errors.
Example full workflow¶
Once the scripts are finalised and the data are properly documented with data dictionaries or similar, running a qsub job script on the entirety of the data will be the most reproducible way to finalise the project. I would always recommend doing this before moving onto another script or project. To help with debugging, ensure that logs are being written to file.
Each script will be writing to file and the next will be reading from file. All that is needed to run the entire project would be:
#!/bin/bash
#$ -cwd
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G
#$ -j y
module load R
Rscript 02_preprocess_data.R
Rscript 03_DE_analysis.R
Rscript 04a_plot_volcano.R
Rscript 04b_plot_heatmaps.R
Note that one core is being requested for the full analysis, as these scripts only have the capability to run using one core.
Exit intentionally¶
While idle RStudio sessions will no longer timeout, it is best to quit
each session instead of letting them run out. To do this, save your work,
remove no longer needed session variables, select "Quit Session...", and close
the browser tab. This can be done even if your code is still running but you
don't plan on interacting with it. When you are ready to work on it and
resources still remain on your job, you can get back to your RStudio session
from your My Interactive Sessions
tab.
Extra: Global Options¶
With the recommendations listed above, resuming work in R (interactively or OnDemand) should be painless. To avoid any unpredictable hiccups, it is best to keep your Global Options as shown below:
Note that any "Restore" option should be unchecked. This ensures RStudio does not attempt to load previous data or session configurations (which has been known to cause problems).
However, if you are more comfortable maintaining old sessions (and settings), be sure to clear any extraneous variables before ending each session. You will not be able to enter a new session if the stored variables exceed the amount of resources requested.
Title image: Cris DiNoto on unsplash