Skip to content
Snippets Groups Projects
Commit ea6410f1 authored by Dominik Brilhaus's avatar Dominik Brilhaus
Browse files

add and update reader

parent 689b3a7e
No related branches found
No related tags found
No related merge requests found
Showing
with 5230 additions and 0 deletions
# RNAseq_workshop
Protocol and other material for RNAseq workshop (Illumina + some long read).
## why latex
we now have both a) syntax highlighting and b) the ability to copy
paste from code blocks and get valid commands.
I'd give them a printed version only until they are through the linux section,
and after that they can have the pdf as well.
## to make the pdf (needs texlive, and probably texlive-extra-utils)
pdflatex RNAseqWorkshop.tex
### support files
RNAseqWorkshop.tex imports everything in sections. So for instance,
all of the content of the biological data extraction section is
organized in "sections/03_biological_data_extraction.tex".
As this section had
a lot of code, it currently imbeds scripts from "R/". But that
can definitely be changed if desired.
## support scripts for the course
The course uses a few scripts from us that should be put in the
$PATH for the .py scripts under "python/".
The exception to this is the join_r_scripts.py in the main directory,
which just creates "musings.R", which should be provided in the RNAdata
directory the way the script is written (although since we can now
copy paste from the pdf when latex/listings is used,
we have the theoretical option to drop this).
No preview for this file type
\documentclass[]{article}
\usepackage{amssymb,amsmath}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}%required
\usepackage{textcomp}
\usepackage{listings} % syntax highlighting
% \usepackage{markdown}
\usepackage{color}
\usepackage{pxfonts}
\usepackage{graphicx}
\usepackage{wrapfig} % wrapping text around figures
\graphicspath{{figures/}}
\usepackage{booktabs} % tables
\usepackage{hyperref} % for links
% title page formatting
%\usepackage{fancyhdr}
\usepackage[bottom]{footmisc} % foot note
%\usepackage{color} %Colorbox?
% xcolor-hypersetup for making links not ugly
\usepackage{xcolor}
\hypersetup{
colorlinks,
linkcolor={red!50!black},
citecolor={blue!50!black},
urlcolor={blue!80!black}
}
\usepackage[parfill]{parskip} % spaces, not indentation
\usepackage{geometry} % normal sized margins
\geometry{
a4paper,
total={160mm,237mm},
left=25mm,
top=35mm,
}
% most from: http://latexcolor.com/
\definecolor{mygray}{gray}{0.4}
\definecolor{lightgray}{gray}{0.9}
\definecolor{aliceblue}{rgb}{0.94, 0.97, 1.0}
\definecolor{bananamania}{rgb}{0.98, 0.91, 0.71}
\definecolor{cerulean}{rgb}{0.0, 0.48, 0.65}
\definecolor{seashell}{rgb}{1.0, 0.96, 0.93}
\lstdefinestyle{basestyle}{
showstringspaces=false,
columns=fullflexible,
literate={*}{\char42}1,% {-}{\char45}1,
upquote=true,
%literate={`}{\char18}1 {'}{\char13}1, % \char18 = ` and \char13 = '
frame=
}
% setting up style sets
\lstdefinestyle{bashstyle}{language=bash,
style=basestyle,
commentstyle=\color{mygray},
basicstyle=\ttfamily,
backgroundcolor=\color{lightgray},
keywordstyle=\color{cerulean}\bfseries,
stringstyle=\color{magenta}
}
\lstdefinestyle{Rstyle}{language=R,
style=basestyle,
commentstyle=\color{mygray},
basicstyle=\ttfamily,
backgroundcolor=\color{aliceblue},
keywordstyle=\bfseries,
stringstyle=\color{magenta},
}
\newcommand{\il}[1]{\colorbox{seashell}{\lstinline[columns=fixed]{#1}}}
% title page prep
\newcommand{\HRule}[1]{\rule{\linewidth}{#1}}
\setcounter{tocdepth}{2} % table of content depth to just subsection
%-------------------------------------------------------------------------------
% HEADER & FOOTER
%-------------------------------------------------------------------------------
%\pagestyle{fancy}
%\fancyhf{}
%\setlength\headheight{15pt}
%\fancyhead[L]{RNAseq Workshop}
%\fancyfoot[R]{Page \thepage\ of \pageref{LastPage}}
%\title{RNAseq Workshop 2022}
\title{ \normalsize \textsc{Heinrich Heine University, Düsseldorf}
\\ [1.0cm]
\textsc{Cluster of Excellence on Plant Sciences}
\\ [4.0cm]
\HRule{0.5pt} \\
\LARGE \textbf{RNAseq Workshop 2022}
\HRule{2pt} \\ [0.5cm]
\normalsize \today \vspace*{5\baselineskip}
\\ [1.0cm]
}
\author{Alisandra Denton \\ Dominik Brilhaus}
\date{}
\begin{document}
\maketitle
\newpage
\tableofcontents
\section{License}
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License. To view a
copy of this license, visit
\url{https://creativecommons.org/licenses/by-nc-sa/4.0/}
In non-legal terminology: You are welcome to use, distribute,
and modify this material for non- commercial purposes.
\section{Acknowledgements}
The authors would like to thank Juliane Schmid and the CEPLAS team
for organizing.
This course builds directly or indirectly on a series of workshops that
we've been involved with for over a decade. Particularly the 2018 iteration
under the same name. Even though we do not have a list of contributors
to the various workshops, we'd like to thank all of you all the same,
and acknowledge this wasn't put together in a vacuum.
\section{Preface}
Here we want to jump start your big data and RNAseq analyses. We dive
head first into doing this on the command line in linux (via Docker).
While there maybe more "user-friendly" options out there to complete
the same tasks (google "galaxy rnaseq" if you're interested!), and there
are certainly more precise and detailed deeper levels (e.g. algorithms
and source code), we choose the command line for several reasons:
\begin{itemize}
\item it is customizable and flexible
\item we work at a good level to gain a conceptual understanding
\item you might as well try the command line in a setting where you can
ask for help, you can always come back to galaxy \& co. later.
\end{itemize}
At this point, we will not strive for the most elegant, nor the most
efficient code; but rather for simplicity and understandability for
beginners. We may write ugly, redundant, copy-paste code; but it will
get the job done and allow us to analyze our own data following a more
or less standardized pipeline.
There are many good resources out there, in so far as you find it
useful we recommend using some of them, such as \url{https://linuxsurvival.com/},
to help cement what you have learned in the course.
We also provide a set of handouts that may be useful to you during
or after the workshop in the ARC under \il{_handouts/}
The Docker setup should also allow you to take home and practice
the material anywhere you can get docker installed: \url{https://www.docker.com/}
\include{sections/00_section_arc}
\include{sections/01_section_linux}
\include{sections/02_section_illumina}
\include{sections/03_biological_data_extraction}
\include{sections/04_section_longreads}
\section{Questions and answers}
\paragraph{Congratulations, you've made it through the workshop material!}
\paragraph{Do you have questions on what we've covered?}
\paragraph{Do you have questions on transferring the information here to your own data?}
\end{document}
File added
This diff is collapsed.
File added
This diff is collapsed.
File added
This diff is collapsed.
head(df0)
# now we will learn about how ggplot makes figures
# the minimal information is where is the data, what is the x variable
# and what is the y variable and what we want to plot
# let's try
# this will make and save plot object
baseplot <- ggplot(data = df0, aes(x = hp, y = gas_mileage)) +
geom_point()
# the first line indicates the data, what x is and what y is
# the second line says we want points
# this will display plot object
baseplot
# now we can customize that plot and make it prettier
# first off, different style, we do that by using theme
baseplot + theme_minimal() # display plot object + modification
# or alternatively, if you prefer
baseplot + theme_bw()
# now we'll save theme_bw with the object for later use
baseplot <- baseplot + theme_bw()
# now we want colored dots, simply red ones
# google ``r colors" for an idea of the named colors available in R
# you can choose them by name, number or hexcode; we like names
baseplot + geom_point(color = "firebrick")
# and a different size and shape
baseplot + geom_point(color = "firebrick", shape = 18, size = 4)
# or choose colors based on something else, like the number of cylinders
# note, that while many things can just be 'added' to a ggplot,
# a second ggplot object is not one of them, so we'll start over
# as it's normally best to specify dynamic color as an
# argument to aesthetics (called aes).
ggplot(data = df0, aes(x = hp, y = gas_mileage, color = cyl)) +
geom_point(shape = 18, size = 4) +
theme_bw()
# you can also make your color changes abrupt, not continuous,
# by defining cyl as a factor rather than a number
# first check what they are
class(df0$cyl)
# now make them a factor
baseplot <- ggplot(
data = df0,
aes(x = hp, y = gas_mileage, color = as.factor(cyl))
) +
geom_point(shape = 18, size = 4) +
theme_bw()
baseplot
# not the best colors, but you can modify them using scale_color_manual
# to do that you need to know which levels you have as your factors
levels(as.factor(df0$cyl))
# and now you can look at colors and decide what is what
baseplot <- baseplot +
scale_color_manual(
values = c(
"4" = "black",
"6" = "orange",
"8" = "red3"
),
name = "cylinders"
)
baseplot
# finally, a little more polishing
baseplot +
xlab("gross horsepower") +
ylab("miles per gallon") +
theme(text = element_text(size = 16))
"" "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
"Mazda RX4" 21 6 160 110 3.9 2.62 16.46 0 1 4 4
"Mazda RX4 Wag" 21 6 160 110 3.9 2.875 17.02 0 1 4 4
"Datsun 710" 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
"Hornet 4 Drive" 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
"Hornet Sportabout" 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
"Valiant" 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
"Duster 360" 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
"Merc 240D" 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
"Merc 230" 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
"Merc 280" 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
"Merc 280C" 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
"Merc 450SE" 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
"Merc 450SL" 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
"Merc 450SLC" 15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
"Cadillac Fleetwood" 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
"Lincoln Continental" 10.4 8 460 215 3 5.424 17.82 0 0 3 4
"Chrysler Imperial" 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
"Fiat 128" 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
"Honda Civic" 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
"Toyota Corolla" 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
"Toyota Corona" 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
"Dodge Challenger" 15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
"AMC Javelin" 15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
"Camaro Z28" 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
"Pontiac Firebird" 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
"Fiat X1-9" 27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
"Porsche 914-2" 26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
"Lotus Europa" 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
"Ford Pantera L" 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
"Ferrari Dino" 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
"Maserati Bora" 15 8 301 335 3.54 3.57 14.6 0 1 5 8
"Volvo 142E" 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
\section{Course material}
This workshop is organized based on an ARC (Annotated Research Context).
\subsection{Annotated Research Context}
The ARC stores all required input data, backup data, scripts, presentations as well as this reader in one research data package.
\subsubsection{The basic ARC structure summarized}
\begin{itemize}
\item \il{studies} \verb| -> | external data and sample metadata
\item \il{assays} \verb| -> | measurement data (i.e. "raw" sequence data) and metadata
\item \il{workflows} \verb| -> | computational analyses (i.e. "scripts")
\item \il{runs} \verb| -> | outputs of workflow (i.e. "results")
\end{itemize}
\subsubsection{Material added to the ARC for this course}
\begin{itemize}
\item \il{_reader} \verb| -> | The basis to this reader (written in \LaTeX)
\item \il{_slides} \verb| -> | Slides presented during the workshop
\item \il{_handouts} \verb| -> | Cheat sheets and additional materials
\item \il{runs/_backups} \verb| -> | Backups for runs, in case a workflow unexpectedly did not work or takes too long.
\begin{itemize}
\item If a workflow would output \il{blat_results}, but failed, copy \il{runs/_backup/blat_results} to \il{runs/blat_results}.
\end{itemize}
\end{itemize}
\subsubsection{Disclaimer}
For this workshop we do not take full advantage of all ARC features. This is in part due to the fact that some developments are work in progress.
More so, we could easily spend two more days exploring those features, which are simply out of scope for a three-day workshop focusing on RNA-Seq data analysis. And besides, the ARC will in the future circumvent some annoyances that we run into for training purposes during this course.
If you want to learn more, check out the \href{https://nfdi4plants.org}{DataPLANT website}.
\subsection{During the workshop}
For the in-person workshop, we have already stored the ARC to the "Raumlaufwerke"
folder under our room number (25.41.00.41). Please copy the folder "rnaseq-workshop"
into your own home folder, so that everyone has their own copy.
We have written all code and commands so that they can be "executed from" the root of the ARC.
As a check now, please open a terminal (applications \verb| -> | terminal) and run:
\lstset{language=bash, style=bashstyle}
\begin{lstlisting}
cd $HOME/rnaseq-workshop
\end{lstlisting}
This is also a very good thing to try throughout the workshop if you're getting
any sort of 'file not found' error, make sure you're in the right place.
%% TODO: Software dependencies.
\subsection{Docker}
We rely heavily on Docker for this workshop, which allows us to provide an 'image'
with all the software you will need for the workshop installed and ready to use.
There are two major reasons we chose Docker. 1) Convenience for development, as it
works the same on our normal work machines, and on the ZIM teaching machines we
use for the workshop, and more importantly 2) Portability, you can all take the
Docker file home, work once through standardized instructions for installing
Docker: \url{https://docs.docker.com/} (or ask your admin to), and then you can
readily reproduce the work from the course.
As a brief disclaimer however, none of this is intended to demonstrate
good practice with Docker, particularly not for other purposes.
The Docker image ``rnaseq\_docker.tar" is available on Sciebo at: %% TODO!
\url{https://uni-duesseldorf.sciebo.de/s/53pA9W9TbOKTgGQ}.
the password is written on the board, or you have received it by e-mail.
The image can also be found in the ``Raumlaufwerke" folder under our room (25.41.00.41).
\subsubsection{load image}
To load the image, run:
\begin{lstlisting}
docker image load -i </path/to/>rnaseq_docker.tar
\end{lstlisting}
Where \il{</path/to/>} is replaced by e.g. ``Downloads"
or the path to the ``Raumlaufwerke" folder as appropriate, e.g.
\begin{lstlisting}
docker image load -i Downloads/rnaseq_docker.tar.gz
\end{lstlisting}
\subsubsection{run image as container}
To start a writable `container' from the image where you can work
interactively, \emph{from your home directory} (\il{cd $HOME})
run:
\begin{lstlisting}
docker run -it --name rnalive -p 8889:8889 --mount \
type=bind,source="$(pwd)"/rnaseq-workshop,target=/home/ZIM-gast/rnaseq-workshop \
rnaseq:latest
\end{lstlisting}
\subsubsection{restart container}
\begin{lstlisting}
docker start -i rnalive
\end{lstlisting}
\subsubsection{exiting}
You can exit a container with Ctr+D or by typing \il{exit}.
You won't need to do this much in the workshop, however.
\subsubsection{Then what?}
You should now have a terminal, that looks a lot like the one before,
except now know that you're in the 'container', where all the necessary
software is installed.
\begin{lstlisting}
# the workshop directory is shared
cd $HOME/rnaseq-workshop
# you should be able to see a list of all the files, that you can,
# e.g. by opening the folder in a file manager
ls
\end{lstlisting}
\subsubsection{What to do when}
At home:
\begin{itemize}
\item{load: once per machine or update to the image}
\item{run: once after each load, on first use}
\item{start: whenever you want to use it}
\end{itemize}
For the workshop however, the computers are wiped clean each night.
So you will have to run :
\begin{itemize}
\item{load: each morning}
\item{run: after running load each morning}
\item{start: as needed, should you exit the container}
\end{itemize}
\textbf{Important:} keep any data you wish to save within the folder
rnaseq-workshop, and make sure the contents of this folder are copied
to either the ``Raumlaufwerke" folder or to e.g. your USB flash drive or
your sciebo account before you logout of the computer!!!
\fbox{\begin{minipage}{45em}
If you have any trouble whatsoever with Docker, please ask!!
Properly understanding docker is beyond the scope of this workshop,
so don't worry if the above ``doesn't make sense",
but you will absolutely need it to be running as expected for
the other parts to work, so just ask :-)
\end{minipage}}
\subsection{After the workshop}
At least for the next half year, the ARC to this workshop is shared publicly under a CC-BY 4.0 license at \url{https://git.nfdi4plants.org/brilator/rnaseq-workshop.git}.
Feel free to download and unarchive the whole ARC as a zip / tar archive or \il{git clone} or \il{git fork} the ARC. Just like docker, explaining the full usage of `git` is beyond the scope of this course.
To ensure, that all code works as used in this reader, make sure to store the ARC as \il{rnaseq-workshop} in your \$HOME directory.
Similarly, the built docker image will remain available on sciebo. The built image is not included
in the ARC as we did not have the time to check licenses for included software, regarding distribution.
This diff is collapsed.
This diff is collapsed.
\section{Biological data extraction}
\subsection{Rstudio via second Docker file}
Please see sub section "Rstudio via second Docker file" from "Basics"
on the first day to restart working environment.
\subsection{The Plan}
Now, we use R in earnest. Rstudio will let us see what is happening immediately when we do it, so it is a
good environment for beginners.
\fbox{\begin{minipage}{45em}
Note: To allow for the level of customization that is often necessary for each project, this
section is a little lower-level and relies less on pre-made tools than before. In as much
there is a lot of R code, and retyping it would be rough on the schedule, so
we'll be using a different paradigm than we did with the bash scripts. You don't have to retype anything,
rather you're encouraged to copy from the .pdf. It is then your responsibility to go through it
slowly and make sure you understand what you are doing. We would recommend, for every section to
make sure you have understood it, that you vary something(s). Look at a different column, change the
plot color, etc... and then save your changes into a script.
If for some reason we haven't given you the .pdf yet, please notify us.
\end{minipage}}
We begin by getting our data from the kallisto results into a data.frame in
R. Open a GUI which allows you to see the data structure – all results are in folders with different names,
but the files themselves all have the same name.
\subsection{Import}
\lstset{language=R, style=Rstyle}
\lstinputlisting{../workflows/kallisto_data_import.R}
The results from both methods are identical. This first step by step method is more to write, but safe for
beginners since you can change the names by hand. The loop is more challenging to produce but certainly
faster to write. We will use the laborious way for the rest of the training course.
\subsection{Basic data.frame / matrix calculations}
Before we do high level analyses, let's calculate means for the replicates and fold-changes.
\lstinputlisting{../workflows/kallisto_data_frames.R}
\subsection{PCA and HCL}
Before we start, let's think about what to expect. We have an experiment with a single factor, the
treatment. So when we look at how the samples group together, how many groups with information do we
expect? Yes, right, only one. So for our principal component analysis we expect that we have one
dimension reflecting the treatment variation. As we have no other variables, the other dimensions of the
PCA should show "noise" or biological variation due to random variables outside of our control. Ideally,
we want to see the treatment variation in the first dimension (meaning most of the variation is due to
treatment) and the noise based variation should occupy the lower dimensions.
Now we start with our first analysis. We ask whether the replicates are more similar to each other than the
samples. To that we use principal component analysis and hierarchical clustering of the samples.
\lstinputlisting{../workflows/clustering_PCA_HCL.R}
\subsection{Differential Expression}
One of the most typical questions for an RNA-seq analysis is what is different between two samples? To
answer that question we use sleuth. There is some debate about which tool is the best to detect differential
expression. Here we will use Sleuth, while other possibilities are deseq2, edgeR, and cufflinks.
Most meta analysis papers agree on cufflinks
being less suitable. DESeq2 and edgeR are both well established and work well with counts
per gene and assuming a negative binomial distribution, but appear to be less appropriate for the
'estimated counts' produced by Kallisto. Sleuth is developed by the same people as kallisto and is
supposedly most suitable for use with the results thereof.
Note: In order to use sleuth, kallisto has to be run with the bootstrapping option.
The following has been adapted from this on-line tutorial:
\url{https://pachterlab.github.io/sleuth_walkthroughs/trapnell/analysis.html}
\lstinputlisting{../workflows/sleuth_differential_expression.R}
\subsection{GO term enrichment}
We have about 1k significantly changed genes, too many to go through by hand quickly. We now check
if some functional categories are overrepresented. The idea behind this is as follows:
we have 1k genes changed of 28k genes in the genome, or 3.5\%. When we now look through the
categories, they should all have 3.5\% of their members changed if the changes were random. The
statistics check whether the difference from 3.5\% we see is significant. To do so, we use the package
TopGO.
This analysis looks at the significantly different genes.
The biggest impediment to the use of TopGO will be to have the gene to GO term assignment in the way
the package wants it. To this end, look at the file Athid2go.map in the folder. This is one way the package
accepts. If you have a different species and a different way of GO term annotation, you need to reformat
the file to match this one (can be done in R, but will be different for each case).
\lstinputlisting{../workflows/functional_GO.R}
GO term lists are notoriously difficult to show in nice figures and also difficult to interpret.
Generally, you want many similar terms up to be convinced something is real. Revigo (on-line tool) can
help you to summarize the GO terms but essentially you need your biologist knowledge to understand
them.
We chose a rather simple example. Among the terms enriched in the upregulated genes you can
always observe the words defense and immune. You have to know that systemic acquired resistance also
refers to the immune system, as does response to biotic stimulus. Clearly, the treatment induces the
defense response.
Among the enriched terms in down-regulated genes appears photosynthesis, which occurs in the
plastid, and involves glyceraldehyde-3-phosphate, pigments and NADP. So clearly, although the names
are different, what they describe is similar. There is no way but simply knowing this. There are thousands
of GO terms described. The figures which were also produced are sometimes helpful. The GO terms
connected by lines are the ones defined as in parent:daughter relationships.
\subsection{MapMan}
MapMan is an alternative way of looking at the overall patterns. Usually, you load all genes with
their fold-changes, not just the significant ones. In a perfect world, MapMan and GO should give you
similar results. One caveat is that the annotation behind GO and MapMan was made by different people
and therefore different areas of plant biology are covered differently.
Before starting with MapMan we will be making use of the size of the class to install MapMan in
parallel on all course computers. MapMan's provided install script can be found both on sciebo,
and in the Raumlaufwerke folder. Run
the following with java, and follow the instructions of the installation wizard.
\lstset{language=bash, style=bashstyle}
\begin{lstlisting}
java -jar </path/to/>/MapManInst-3_6_0RC1.jar
\end{lstlisting}
We only need to export our fold-change table from R. Again, we have to make sure that our gene
identifiers match the ones used in the MapMan tool.
Open MapMan, click on mappings and look at the Ath\_AGI mapping. Click on until you see AGI
codes and observe how they are formatted--again, no transcript suffix, only the locus.
Back to R.
\lstset{language=R, style=Rstyle}
\lstinputlisting{../workflows/functional_mapman.R}
Now back to MapMan. Right click on Experiments, choose add data. Select your exported table
named forMapmanloading.txt. Now you can choose if there is a header (yes!), whether you have decimal
point or decimal comma, and so on and so forth. The default works for us. Click okay
Now click on Pathways, on Overview and choose Metabolism\_overview by double click.
You will have to choose a mapping; we use Ath\_AGI. Click okay.
If nothing is displayed, click on your file shown in Experiments. I always change the color code to
+red and-blue and the scale to whatever fits my dataset (for this one I'd use 3).
Clearly, photosynthesis, the CBB cycle and photorespiration are down. If you would like to test
that statistically, Mapman does that for you. Look at the stuff below the figure and pull it up. You will see
the Wilcoxon Rank Sum Test scores. Correct the p-values for multiple hypothesis testing and sort the table
by probability. Some of the significantly changed pathways we can see, most are not displayed. Look at
the other options for visualization to see, if anyone fits your needs.
You can make your own pathway figures and map the MapMan bins onto them. Refer to the
MapMan manual for that!
\subsection{Adding information from public data sources}
Combining information from multiple sources is often helpful, or even necessary to fully
understand one's results. This becomes harder to write a standardized protocol for, however, as one
researcher might simply be interested in including the TAIR annotation with their "mother table" for ease
of looking through the gene descriptions, and another might be interested in knowing whether there's a
significant overlap between their study and a particular paper. That said, there is frequently a similar line
of attack. First, get the desired comparative data in an organized format such as a list of gene IDs or a
table (csv/tsv/xls) with gene IDs and associated values. Second, make sure the gene IDs are comparable
(are they from the same genome/annotation release, do you have transcript IDs when you wanted gene
IDs, are they both upper/lower case). Third, combine data with yours (e.g. merge). Finally, in many cases
you will visually and statistically evaluate whether there is a overlap or correlation between the studies.
\subsubsection{Including gene descriptions from TAIR}
First step first, we need to get the descriptions, and not by copying them one by one from the
website. Most biological databases have a bulk download page if you look, and TAIR is on the easy side.
From \url{https://arabidopsis.org} you simply need to go to Download:Genes, select the annotation (TAIR10), and you
already have a list of what is available for download. We'll take "TAIR10\_functional\_descriptions", which
is a tab-separated text file. Save, copy or move this file to studies/AthalianaReferences/resources for ease of access. The
ID format matches, so adding the descriptions to our major data.frame in R is very simple.
\lstinputlisting{../workflows/additional_TAIR.R}
\subsubsection{Including MapMan annotations}
% TODO, honestly we should move this to mapman4
Frequently you want to have more flexibility working with a dataset than is possible with the provided
GUI programs. We'll now import the MapMan annotations into R, and look briefly at what else one could
do with them. For a handful of plant species, MapMan annotations are provided in the 'store' at
\url{https://mapman.gabipd.org/web/guest/mapmanstore}.
More commonly, they can be created for a species
using the Mercator webtool
\url{http://www.plabipd.de/portal/web/guest/mercator-sequence-annotation}
Assigning
MapMan annotations with Mercator normally runs in less than 15min and avoids any annotation version
trouble. For today, the results are provided in the file mapmanTAIR10.txt.
\lstinputlisting{../workflows/additional_MapMan.R}
You can, of course, automate a Fisher's exact test or Wilcoxon test for every category in R, visualize many
samples at once by different bin levels, or whatever suits your purposes.
\subsubsection{Special list - from a review paper}
Frequently, whole genome annotations have not been updated with the very latest information or
you are interested in a more obscure list, so you want to compare data with another paper directly. This
often makes the first step of getting the data in an organized table harder. Let's look at the review paper
\url{https://doi.org/10.1007/s11427-016-0048-4}
``Diverse roles of SERK family genes in
plant growth, development and defense response". In table 1, it has a list of SERK interacting genes,
which we might be interested in, but they are only available in pdf format. There are many ways to
approach this. You can for instance copy the whole table, and clean up any formatting issues by hand.
Often you can get the formatting to work by pasting it into either a spreadsheet program or a text file.
However, as we've worked with \emph{A. thaliana} a lot before, we already had a handy script to find all of the
AGIs out of a text file and will use this.
First, copy all of table 1 (or even the whole paper) into a text file and save it as serk\_interacting.txt
\lstset{language=bash, style=bashstyle}
\begin{lstlisting}
# now run the following in bash
workflows/agi_finder.py -i studies/Fan2016_SERK_review/resources/serk_interacting.txt \
> studies/Fan2016_SERK_review/resources/serk_interacting.agi
less studies/Fan2016_SERK_review/resources/serk_interacting.agi
\end{lstlisting}
For today, we only needed the AGIs and not the additional information, so this will do.
\lstset{language=R, style=Rstyle}
\lstinputlisting{../workflows/additional_SERK.R}
OK, we'll never be able to cover all possible comparative data and data formats you might want to
look at, but hopefully you have some ideas on where to start now. Take a look around, if you are ahead of
your neighbors this would be a great time to try importing contextual data you would actually be
interested in for your studies.
\subsection{Further Clustering}
\subsubsection{K-means}
\lstinputlisting{../workflows/clustering_kmeans.R}
\subsubsection{Heatmaps - gene set}
We've seen heatmaps at a large scale when displaying the hierarchical clustering above.
Sometimes however, they are useful just for looking at dozens of genes at once. Let's look at the MapMan
category, stress.biotic.signalling as an example.
\lstinputlisting{../workflows/miniheatmaps.R}
\begin{lstlisting}
\end{lstlisting}
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment