add and update reader

ea6410f1 · Dominik Brilhaus · 689b3a7e · ea6410f1 · ea6410f1 · ea6410f1
Commit ea6410f1 authored 2 years ago by Dominik Brilhaus
--- a/_reader/README.md
+++ b/_reader/README.md
+# RNAseq_workshop
+Protocol and other material for RNAseq workshop (Illumina + some long read).
+## why latex
+we now have both a) syntax highlighting and b) the ability to copy
+paste from code blocks and get valid commands.
+I'd give them a printed version only until they are through the linux section, 
+and after that they can have the pdf as well. 
+## to make the pdf (needs texlive, and probably texlive-extra-utils)
+pdflatex RNAseqWorkshop.tex
+### support files
+RNAseqWorkshop.tex imports everything in sections. So for instance,
+all of the content of the biological data extraction section is 
+organized in "sections/03_biological_data_extraction.tex". 
+As this section had
+a lot of code, it currently imbeds scripts from "R/". But that
+can definitely be changed if desired.
+## support scripts for the course
+The course uses a few scripts from us that should be put in the 
+$PATH for the .py scripts under "python/".
+The exception to this is the join_r_scripts.py in the main directory,
+which just creates "musings.R", which should be provided in the RNAdata
+directory the way the script is written (although since we can now
+copy paste from the pdf when latex/listings is used, 
+we have the theoretical option to drop this).
--- a/_reader/RNAseqWorkshop.pdf
+++ b/_reader/RNAseqWorkshop.pdf
--- a/_reader/RNAseqWorkshop.tex
+++ b/_reader/RNAseqWorkshop.tex
+\documentclass[]{article}
+\usepackage{amssymb,amsmath}
+\usepackage[utf8]{inputenc}
+\usepackage[T1]{fontenc}%required
+\usepackage{textcomp}
+\usepackage{listings} % syntax highlighting
+% \usepackage{markdown}
+\usepackage{color}
+\usepackage{pxfonts}
+\usepackage{graphicx}
+\usepackage{wrapfig} % wrapping text around figures
+\graphicspath{{figures/}}
+\usepackage{booktabs} % tables
+\usepackage{hyperref} % for links
+% title page formatting
+%\usepackage{fancyhdr}
+\usepackage[bottom]{footmisc} % foot note
+%\usepackage{color} %Colorbox?
+% xcolor-hypersetup for making links not ugly
+\usepackage{xcolor}
+\hypersetup{ 
+    colorlinks,
+    linkcolor={red!50!black},
+    citecolor={blue!50!black},
+    urlcolor={blue!80!black}
+}
+\usepackage[parfill]{parskip} % spaces, not indentation
+\usepackage{geometry} % normal sized margins
+ \geometry{
+ a4paper,
+ total={160mm,237mm},
+ left=25mm,
+ top=35mm,
+ }
+% most from: http://latexcolor.com/
+\definecolor{mygray}{gray}{0.4}
+\definecolor{lightgray}{gray}{0.9}
+\definecolor{aliceblue}{rgb}{0.94, 0.97, 1.0}
+\definecolor{bananamania}{rgb}{0.98, 0.91, 0.71}
+\definecolor{cerulean}{rgb}{0.0, 0.48, 0.65}
+\definecolor{seashell}{rgb}{1.0, 0.96, 0.93}
+\lstdefinestyle{basestyle}{
+                showstringspaces=false,
+                columns=fullflexible,
+                literate={*}{\char42}1,% {-}{\char45}1,
+                upquote=true,
+                %literate={`}{\char18}1 {'}{\char13}1,  % \char18 = ` and \char13 = '
+                frame=
+}
+% setting up style sets
+\lstdefinestyle{bashstyle}{language=bash,
+                style=basestyle,
+                commentstyle=\color{mygray},
+                basicstyle=\ttfamily,
+                backgroundcolor=\color{lightgray},
+                keywordstyle=\color{cerulean}\bfseries,
+                stringstyle=\color{magenta}
+}
+\lstdefinestyle{Rstyle}{language=R,
+                style=basestyle,
+                commentstyle=\color{mygray},
+                basicstyle=\ttfamily,
+                backgroundcolor=\color{aliceblue},
+                keywordstyle=\bfseries,
+                stringstyle=\color{magenta},
+}
+\newcommand{\il}[1]{\colorbox{seashell}{\lstinline[columns=fixed]{#1}}}
+% title page prep
+\newcommand{\HRule}[1]{\rule{\linewidth}{#1}}
+\setcounter{tocdepth}{2} % table of content depth to just subsection
+%-------------------------------------------------------------------------------
+% HEADER & FOOTER
+%-------------------------------------------------------------------------------
+%\pagestyle{fancy}
+%\fancyhf{}
+%\setlength\headheight{15pt}
+%\fancyhead[L]{RNAseq Workshop}
+%\fancyfoot[R]{Page \thepage\ of \pageref{LastPage}}
+%\title{RNAseq Workshop 2022}
+\title{ \normalsize \textsc{Heinrich Heine University, Düsseldorf}
+                \\ [1.0cm]
+                \textsc{Cluster of Excellence on Plant Sciences}
+		\\ [4.0cm]
+		\HRule{0.5pt} \\
+		\LARGE \textbf{RNAseq Workshop 2022}
+		\HRule{2pt} \\ [0.5cm]
+		\normalsize \today \vspace*{5\baselineskip}
+                \\ [1.0cm]
+}
+\author{Alisandra Denton \\ Dominik Brilhaus}
+\date{}
+\begin{document}
+\maketitle
+\newpage
+\tableofcontents
+\section{License}
+This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License. To view a
+copy of this license, visit 
+\url{https://creativecommons.org/licenses/by-nc-sa/4.0/}
+In non-legal terminology: You are welcome to use, distribute,
+and modify this material for non- commercial purposes.
+\section{Acknowledgements}
+The authors would like to thank Juliane Schmid and the CEPLAS team
+for organizing.  
+This course builds directly or indirectly on a series of workshops that
+we've been involved with for over a decade. Particularly the 2018 iteration
+under the same name. Even though we do not have a list of contributors
+to the various workshops, we'd like to thank all of you all the same, 
+and acknowledge this wasn't put together in a vacuum.
+\section{Preface}
+Here we want to jump start your big data and RNAseq analyses. We dive
+head first into doing this on the command line in linux (via Docker).
+While there maybe more "user-friendly" options out there to complete
+the same tasks (google "galaxy rnaseq" if you're interested!), and there
+are certainly more precise and detailed deeper levels (e.g. algorithms
+and source code), we choose the command line for several reasons:
+\begin{itemize}
+\item it is customizable and flexible
+\item we work at a good level to gain a conceptual understanding 
+\item you might as well try the command line in a setting where you can
+   ask for help, you can always come back to galaxy \& co. later.
+\end{itemize}
+At this point, we will not strive for the most elegant, nor the most
+efficient code; but rather for simplicity and understandability for
+beginners. We may write ugly, redundant, copy-paste code; but it will
+get the job done and allow us to analyze our own data following a more
+or less standardized pipeline.
+There are many good resources out there, in so far as you find it 
+useful we recommend using some of them, such as \url{https://linuxsurvival.com/},
+to help cement what you have learned in the course. 
+We also provide a set of handouts that may be useful to you during
+or after the workshop in the ARC under \il{_handouts/}
+The Docker setup should also allow you to take home and practice
+the material anywhere you can get docker installed: \url{https://www.docker.com/}
+\include{sections/00_section_arc}
+\include{sections/01_section_linux}
+\include{sections/02_section_illumina}
+\include{sections/03_biological_data_extraction}
+\include{sections/04_section_longreads}
+\section{Questions and answers}
+\paragraph{Congratulations, you've made it through the workshop material!}
+\paragraph{Do you have questions on what we've covered?}
+\paragraph{Do you have questions on transferring the information here to your own data?}
+\end{document}
--- a/_reader/figures/align_n_quant.pdf
+++ b/_reader/figures/align_n_quant.pdf
--- a/_reader/figures/align_n_quant.svg
+++ b/_reader/figures/align_n_quant.svg
--- a/_reader/figures/dun_merge.pdf
+++ b/_reader/figures/dun_merge.pdf
--- a/_reader/figures/dun_merge.svg
+++ b/_reader/figures/dun_merge.svg
--- a/_reader/figures/longread_workflow.pdf
+++ b/_reader/figures/longread_workflow.pdf
--- a/_reader/figures/longread_workflow.svg
+++ b/_reader/figures/longread_workflow.svg
--- a/_reader/intro_scripts_data/R/ggplot.R
+++ b/_reader/intro_scripts_data/R/ggplot.R
+head(df0)
+# now we will learn about how ggplot makes figures
+# the minimal information is where is the data, what is the x variable
+# and what is the y variable and what we want to plot
+# let's try
+# this will make and save plot object
+baseplot <- ggplot(data = df0, aes(x = hp, y = gas_mileage)) +
+  geom_point()
+# the first line indicates the data, what x is and what y is
+# the second line says we want points
+# this will display plot object
+baseplot
+# now we can customize that plot and make it prettier
+# first off, different style, we do that by using theme
+baseplot + theme_minimal() # display plot object + modification
+# or alternatively, if you prefer
+baseplot + theme_bw()
+# now we'll save theme_bw with the object for later use
+baseplot <- baseplot + theme_bw()
+# now we want colored dots, simply red ones
+# google ``r colors" for an idea of the named colors available in R
+# you can choose them by name, number or hexcode; we like names
+baseplot + geom_point(color = "firebrick")
+# and a different size and shape
+baseplot + geom_point(color = "firebrick", shape = 18, size = 4)
+# or choose colors based on something else, like the number of cylinders
+# note, that while many things can just be 'added' to a ggplot,
+# a second ggplot object is not one of them, so we'll start over
+# as it's normally best to specify dynamic color as an
+# argument to aesthetics (called aes).
+ggplot(data = df0, aes(x = hp, y = gas_mileage, color = cyl)) +
+  geom_point(shape = 18, size = 4) +
+  theme_bw()
+# you can also make your color changes abrupt, not continuous,
+# by defining cyl as a factor rather than a number
+# first check what they are
+class(df0$cyl)
+# now make them a factor
+baseplot <- ggplot(
+  data = df0,
+  aes(x = hp, y = gas_mileage, color = as.factor(cyl))
+) +
+  geom_point(shape = 18, size = 4) +
+  theme_bw()
+baseplot
+# not the best colors, but you can modify them using scale_color_manual
+# to do that you need to know which levels you have as your factors
+levels(as.factor(df0$cyl))
+# and now you can look at colors and decide what is what
+baseplot <- baseplot +
+  scale_color_manual(
+    values = c(
+      "4" = "black",
+      "6" = "orange",
+      "8" = "red3"
+    ),
+    name = "cylinders"
+  )
+baseplot
+# finally, a little more polishing
+baseplot +
+  xlab("gross horsepower") +
+  ylab("miles per gallon") +
+  theme(text = element_text(size = 16))
--- a/_reader/intro_scripts_data/example_table.txt
+++ b/_reader/intro_scripts_data/example_table.txt
+""	"mpg"	"cyl"	"disp"	"hp"	"drat"	"wt"	"qsec"	"vs"	"am"	"gear"	"carb"
+"Mazda RX4"	21	6	160	110	3.9	2.62	16.46	0	1	4	4
+"Mazda RX4 Wag"	21	6	160	110	3.9	2.875	17.02	0	1	4	4
+"Datsun 710"	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
+"Hornet 4 Drive"	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
+"Hornet Sportabout"	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
+"Valiant"	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
+"Duster 360"	14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
+"Merc 240D"	24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
+"Merc 230"	22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
+"Merc 280"	19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
+"Merc 280C"	17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
+"Merc 450SE"	16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
+"Merc 450SL"	17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
+"Merc 450SLC"	15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
+"Cadillac Fleetwood"	10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
+"Lincoln Continental"	10.4	8	460	215	3	5.424	17.82	0	0	3	4
+"Chrysler Imperial"	14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
+"Fiat 128"	32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1
+"Honda Civic"	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
+"Toyota Corolla"	33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1
+"Toyota Corona"	21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
+"Dodge Challenger"	15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
+"AMC Javelin"	15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
+"Camaro Z28"	13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
+"Pontiac Firebird"	19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
+"Fiat X1-9"	27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
+"Porsche 914-2"	26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
+"Lotus Europa"	30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
+"Ford Pantera L"	15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
+"Ferrari Dino"	19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
+"Maserati Bora"	15	8	301	335	3.54	3.57	14.6	0	1	5	8
+"Volvo 142E"	21.4	4	121	109	4.11	2.78	18.6	1	1	4	2
--- a/_reader/sections/00_section_arc.tex
+++ b/_reader/sections/00_section_arc.tex
+\section{Course material}
+This workshop is organized based on an ARC (Annotated Research Context). 
+\subsection{Annotated Research Context}
+The ARC stores all required input data, backup data, scripts, presentations as well as this reader in one research data package.
+\subsubsection{The basic ARC structure summarized} 
+\begin{itemize}
+  \item \il{studies} \verb| -> |  external data and sample metadata
+  \item \il{assays}  \verb| -> |  measurement data (i.e. "raw" sequence data) and metadata
+  \item \il{workflows}  \verb| -> |  computational analyses (i.e. "scripts")
+  \item \il{runs}  \verb| -> |  outputs of workflow  (i.e. "results")
+\end{itemize}
+\subsubsection{Material added to the ARC for this course}
+\begin{itemize}
+  \item \il{_reader} \verb| -> |  The basis to this reader (written in \LaTeX)
+  \item \il{_slides} \verb| -> |  Slides presented during the workshop
+  \item \il{_handouts} \verb| -> |  Cheat sheets and additional materials
+  \item \il{runs/_backups} \verb| -> |   Backups for runs, in case a workflow unexpectedly did not work or takes too long.
+  \begin{itemize}
+    \item If a workflow would output \il{blat_results}, but failed, copy \il{runs/_backup/blat_results} to \il{runs/blat_results}.
+  \end{itemize}  
+\end{itemize}
+\subsubsection{Disclaimer}
+For this workshop we do not take full advantage of all ARC features. This is in part due to the fact that some developments are work in progress. 
+More so, we could easily spend two more days exploring those features, which are simply out of scope for a three-day workshop focusing on RNA-Seq data analysis. And besides, the ARC will in the future circumvent some annoyances that we run into for training purposes during this course. 
+If you want to learn more, check out the \href{https://nfdi4plants.org}{DataPLANT website}.
+\subsection{During the workshop}
+For the in-person workshop, we have already stored the ARC to the "Raumlaufwerke" 
+folder under our room number (25.41.00.41). Please copy the folder "rnaseq-workshop" 
+into your own home folder, so that everyone has their own copy. 
+We have written all code and commands so that they can be "executed from" the root of the ARC.
+As a check now, please open a terminal (applications \verb| -> | terminal) and run:
+\lstset{language=bash, style=bashstyle}
+\begin{lstlisting}
+cd $HOME/rnaseq-workshop
+\end{lstlisting}
+This is also a very good thing to try throughout the workshop if you're getting
+any sort of 'file not found' error, make sure you're in the right place.
+%% TODO: Software dependencies. 
+\subsection{Docker}
+We rely heavily on Docker for this workshop, which allows us to provide an 'image'
+with all the software you will need for the workshop installed and ready to use.
+There are two major reasons we chose Docker. 1) Convenience for development, as it
+works the same on our normal work machines, and on the ZIM teaching machines we 
+use for the workshop, and more importantly 2) Portability, you can all take the 
+Docker file home, work once through standardized instructions for installing 
+Docker: \url{https://docs.docker.com/} (or ask your admin to), and then you can
+readily reproduce the work from the course.
+As a brief disclaimer however, none of this is intended to demonstrate
+good practice with Docker, particularly not for other purposes.
+The Docker image ``rnaseq\_docker.tar" is available on Sciebo at:  %% TODO!
+\url{https://uni-duesseldorf.sciebo.de/s/53pA9W9TbOKTgGQ}.
+the password is written on the board, or you have received it by e-mail.
+The image can also be found in the ``Raumlaufwerke" folder under our room (25.41.00.41).
+\subsubsection{load image}
+To load the image, run:
+\begin{lstlisting}
+docker image load -i </path/to/>rnaseq_docker.tar
+\end{lstlisting}
+Where \il{</path/to/>} is replaced by e.g. ``Downloads"
+or the path to the ``Raumlaufwerke" folder as appropriate, e.g.
+\begin{lstlisting}
+docker image load -i Downloads/rnaseq_docker.tar.gz
+\end{lstlisting}
+\subsubsection{run image as container}
+To start a writable `container' from the image where you can work
+interactively, \emph{from your home directory} (\il{cd $HOME}) 
+run:
+\begin{lstlisting}
+docker run -it --name rnalive -p 8889:8889 --mount \
+  type=bind,source="$(pwd)"/rnaseq-workshop,target=/home/ZIM-gast/rnaseq-workshop \
+  rnaseq:latest
+\end{lstlisting}
+\subsubsection{restart container}
+\begin{lstlisting}
+docker start -i rnalive
+\end{lstlisting}
+\subsubsection{exiting}
+You can exit a container with Ctr+D or by typing \il{exit}.
+You won't need to do this much in the workshop, however.
+\subsubsection{Then what?}
+You should now have a terminal, that looks a lot like the one before,
+except now know that you're in the 'container', where all the necessary
+software is installed.
+\begin{lstlisting}
+# the workshop directory is shared
+cd $HOME/rnaseq-workshop
+# you should be able to see a list of all the files, that you can, 
+# e.g. by opening the folder in a file manager
+ls
+\end{lstlisting}
+\subsubsection{What to do when}
+At home:
+\begin{itemize}
+\item{load: once per machine or update to the image}
+\item{run: once after each load, on first use}
+\item{start: whenever you want to use it}
+\end{itemize}
+For the workshop however, the computers are wiped clean each night.
+So you will have to run :
+\begin{itemize}
+\item{load: each morning}
+\item{run: after running load each morning}
+\item{start: as needed, should you exit the container}
+\end{itemize}
+\textbf{Important:} keep any data you wish to save within the folder
+rnaseq-workshop, and make sure the contents of this folder are copied
+to either the ``Raumlaufwerke" folder or to e.g. your USB flash drive or
+your sciebo account before you logout of the computer!!!
+\fbox{\begin{minipage}{45em}
+If you have any trouble whatsoever with Docker, please ask!!
+Properly understanding docker is beyond the scope of this workshop,
+so don't worry if the above ``doesn't make sense",
+but you will absolutely need it to be running as expected for 
+the other parts to work, so just ask :-)
+\end{minipage}}
+\subsection{After the workshop}
+At least for the next half year, the ARC to this workshop is shared publicly under a CC-BY 4.0 license at \url{https://git.nfdi4plants.org/brilator/rnaseq-workshop.git}. 
+Feel free to download and unarchive the whole ARC as a zip / tar archive or \il{git clone} or \il{git fork} the ARC. Just like docker, explaining the full usage of `git` is beyond the scope of this course.
+To ensure, that all code works as used in this reader, make sure to store the ARC as \il{rnaseq-workshop} in your \$HOME directory.
+Similarly, the built docker image will remain available on sciebo. The built image is not included
+in the ARC as we did not have the time to check licenses for included software, regarding distribution.
--- a/_reader/sections/01_section_linux.tex
+++ b/_reader/sections/01_section_linux.tex
--- a/_reader/sections/02_section_illumina.tex
+++ b/_reader/sections/02_section_illumina.tex
--- a/_reader/sections/03_biological_data_extraction.tex
+++ b/_reader/sections/03_biological_data_extraction.tex
+\section{Biological data extraction}
+\subsection{Rstudio via second Docker file}
+Please see sub section "Rstudio via second Docker file" from "Basics" 
+on the first day to restart working environment.
+\subsection{The Plan}
+Now, we use R in earnest. Rstudio will let us see what is happening immediately when we do it, so it is a
+good environment for beginners.
+\fbox{\begin{minipage}{45em}
+Note: To allow for the level of customization that is often necessary for each project, this 
+section is a little lower-level and relies less on pre-made tools than before. In as much
+there is a lot of R code, and retyping it would be rough on the schedule, so
+we'll be using a different paradigm than we did with the bash scripts. You don't have to retype anything,
+rather you're encouraged to copy from the .pdf. It is then your responsibility to go through it
+slowly and make sure you understand what you are doing. We would recommend, for every section to
+make sure you have understood it, that you vary something(s). Look at a different column, change the
+plot color, etc... and then save your changes into a script. 
+If for some reason we haven't given you the .pdf yet, please notify us.
+\end{minipage}}
+ We begin by getting our data from the kallisto results into a data.frame in
+R. Open a GUI which allows you to see the data structure – all results are in folders with different names,
+but the files themselves all have the same name.
+\subsection{Import}
+\lstset{language=R, style=Rstyle}
+\lstinputlisting{../workflows/kallisto_data_import.R}
+The results from both methods are identical. This first step by step method is more to write, but safe for
+beginners since you can change the names by hand. The loop is more challenging to produce but certainly
+faster to write. We will use the laborious way for the rest of the training course.
+\subsection{Basic data.frame / matrix calculations}
+Before we do high level analyses, let's calculate means for the replicates and fold-changes.
+\lstinputlisting{../workflows/kallisto_data_frames.R}
+\subsection{PCA and HCL}
+Before we start, let's think about what to expect. We have an experiment with a single factor, the
+treatment. So when we look at how the samples group together, how many groups with information do we
+expect? Yes, right, only one. So for our principal component analysis we expect that we have one
+dimension reflecting the treatment variation. As we have no other variables, the other dimensions of the
+PCA should show "noise" or biological variation due to random variables outside of our control. Ideally,
+we want to see the treatment variation in the first dimension (meaning most of the variation is due to
+treatment) and the noise based variation should occupy the lower dimensions.
+Now we start with our first analysis. We ask whether the replicates are more similar to each other than the
+samples. To that we use principal component analysis and hierarchical clustering of the samples.
+\lstinputlisting{../workflows/clustering_PCA_HCL.R}
+\subsection{Differential Expression}
+One of the most typical questions for an RNA-seq analysis is what is different between two samples? To
+answer that question we use sleuth. There is some debate about which tool is the best to detect differential
+expression. Here we will use Sleuth, while other possibilities are deseq2, edgeR, and cufflinks. 
+Most meta analysis papers agree on cufflinks
+being less suitable. DESeq2 and edgeR are both well established and work well with counts
+per gene and assuming a negative binomial distribution, but appear to be less appropriate for the
+'estimated counts' produced by Kallisto. Sleuth is developed by the same people as kallisto and is 
+supposedly most suitable for use with the results thereof. 
+Note: In order to use sleuth, kallisto has to be run with the bootstrapping option. 
+The following has been adapted from this on-line tutorial:
+\url{https://pachterlab.github.io/sleuth_walkthroughs/trapnell/analysis.html}
+\lstinputlisting{../workflows/sleuth_differential_expression.R}
+\subsection{GO term enrichment}
+We have about 1k significantly changed genes, too many to go through by hand quickly. We now check
+if some functional categories are overrepresented. The idea behind this is as follows:
+we have 1k genes changed of 28k genes in the genome, or 3.5\%. When we now look through the
+categories, they should all have 3.5\% of their members changed if the changes were random. The
+statistics check whether the difference from 3.5\% we see is significant. To do so, we use the package
+TopGO.
+This analysis looks at the significantly different genes.
+The biggest impediment to the use of TopGO will be to have the gene to GO term assignment in the way
+the package wants it. To this end, look at the file Athid2go.map in the folder. This is one way the package
+accepts. If you have a different species and a different way of GO term annotation, you need to reformat
+the file to match this one (can be done in R, but will be different for each case).
+\lstinputlisting{../workflows/functional_GO.R}
+GO term lists are notoriously difficult to show in nice figures and also difficult to interpret.
+Generally, you want many similar terms up to be convinced something is real. Revigo (on-line tool) can
+help you to summarize the GO terms but essentially you need your biologist knowledge to understand
+them.
+We chose a rather simple example. Among the terms enriched in the upregulated genes you can
+always observe the words defense and immune. You have to know that systemic acquired resistance also
+refers to the immune system, as does response to biotic stimulus. Clearly, the treatment induces the
+defense response.
+Among the enriched terms in down-regulated genes appears photosynthesis, which occurs in the
+plastid, and involves glyceraldehyde-3-phosphate, pigments and NADP. So clearly, although the names
+are different, what they describe is similar. There is no way but simply knowing this. There are thousands
+of GO terms described. The figures which were also produced are sometimes helpful. The GO terms
+connected by lines are the ones defined as in parent:daughter relationships.
+\subsection{MapMan}
+MapMan is an alternative way of looking at the overall patterns. Usually, you load all genes with
+their fold-changes, not just the significant ones. In a perfect world, MapMan and GO should give you
+similar results. One caveat is that the annotation behind GO and MapMan was made by different people
+and therefore different areas of plant biology are covered differently.
+Before starting with MapMan we will be making use of the size of the class to install MapMan in
+parallel on all course computers. MapMan's provided install script can be found both on sciebo, 
+and in the Raumlaufwerke folder. Run
+the following with java, and follow the instructions of the installation wizard.
+\lstset{language=bash, style=bashstyle}
+\begin{lstlisting}
+java -jar </path/to/>/MapManInst-3_6_0RC1.jar
+\end{lstlisting}
+We only need to export our fold-change table from R. Again, we have to make sure that our gene
+identifiers match the ones used in the MapMan tool.
+Open MapMan, click on mappings and look at the Ath\_AGI mapping. Click on until you see AGI
+codes and observe how they are formatted--again, no transcript suffix, only the locus.
+Back to R.
+\lstset{language=R, style=Rstyle}
+\lstinputlisting{../workflows/functional_mapman.R}
+Now back to MapMan. Right click on Experiments, choose add data. Select your exported table
+named forMapmanloading.txt. Now you can choose if there is a header (yes!), whether you have decimal
+point or decimal comma, and so on and so forth. The default works for us. Click okay
+Now click on Pathways, on Overview and choose Metabolism\_overview by double click.
+You will have to choose a mapping; we use Ath\_AGI. Click okay.
+If nothing is displayed, click on your file shown in Experiments. I always change the color code to
+red and-blue and the scale to whatever fits my dataset (for this one I'd use 3).
+Clearly, photosynthesis, the CBB cycle and photorespiration are down. If you would like to test
+that statistically, Mapman does that for you. Look at the stuff below the figure and pull it up. You will see
+the Wilcoxon Rank Sum Test scores. Correct the p-values for multiple hypothesis testing and sort the table
+by probability. Some of the significantly changed pathways we can see, most are not displayed. Look at
+the other options for visualization to see, if anyone fits your needs.
+You can make your own pathway figures and map the MapMan bins onto them. Refer to the
+MapMan manual for that!
+\subsection{Adding information from public data sources}
+Combining information from multiple sources is often helpful, or even necessary to fully
+understand one's results. This becomes harder to write a standardized protocol for, however, as one
+researcher might simply be interested in including the TAIR annotation with their "mother table" for ease
+of looking through the gene descriptions, and another might be interested in knowing whether there's a
+significant overlap between their study and a particular paper. That said, there is frequently a similar line
+of attack. First, get the desired comparative data in an organized format such as a list of gene IDs or a
+table (csv/tsv/xls) with gene IDs and associated values. Second, make sure the gene IDs are comparable
+(are they from the same genome/annotation release, do you have transcript IDs when you wanted gene
+IDs, are they both upper/lower case). Third, combine data with yours (e.g. merge). Finally, in many cases
+you will visually and statistically evaluate whether there is a overlap or correlation between the studies.
+\subsubsection{Including gene descriptions from TAIR}
+First step first, we need to get the descriptions, and not by copying them one by one from the
+website. Most biological databases have a bulk download page if you look, and TAIR is on the easy side.
+From \url{https://arabidopsis.org} you simply need to go to Download:Genes, select the annotation (TAIR10), and you
+already have a list of what is available for download. We'll take "TAIR10\_functional\_descriptions", which
+is a tab-separated text file. Save, copy or move this file to studies/AthalianaReferences/resources for ease of access. The
+ID format matches, so adding the descriptions to our major data.frame in R is very simple.
+\lstinputlisting{../workflows/additional_TAIR.R}
+\subsubsection{Including MapMan annotations}
+% TODO, honestly we should move this to mapman4
+Frequently you want to have more flexibility working with a dataset than is possible with the provided
+GUI programs. We'll now import the MapMan annotations into R, and look briefly at what else one could
+do with them. For a handful of plant species, MapMan annotations are provided in the 'store' at
+\url{https://mapman.gabipd.org/web/guest/mapmanstore}. 
+More commonly, they can be created for a species
+using the Mercator webtool 
+\url{http://www.plabipd.de/portal/web/guest/mercator-sequence-annotation}
+Assigning
+MapMan annotations with Mercator normally runs in less than 15min and avoids any annotation version
+trouble. For today, the results are provided in the file mapmanTAIR10.txt.
+\lstinputlisting{../workflows/additional_MapMan.R}
+You can, of course, automate a Fisher's exact test or Wilcoxon test for every category in R, visualize many
+samples at once by different bin levels, or whatever suits your purposes.
+\subsubsection{Special list - from a review paper}
+Frequently, whole genome annotations have not been updated with the very latest information or
+you are interested in a more obscure list, so you want to compare data with another paper directly. This
+often makes the first step of getting the data in an organized table harder. Let's look at the review paper
+\url{https://doi.org/10.1007/s11427-016-0048-4}
+``Diverse roles of SERK family genes in
+plant growth, development and defense response". In table 1, it has a list of SERK interacting genes,
+which we might be interested in, but they are only available in pdf format. There are many ways to
+approach this. You can for instance copy the whole table, and clean up any formatting issues by hand.
+Often you can get the formatting to work by pasting it into either a spreadsheet program or a text file.
+However, as we've worked with \emph{A. thaliana} a lot before, we already had a handy script to find all of the
+AGIs out of a text file and will use this.
+First, copy all of table 1 (or even the whole paper) into a text file and save it as serk\_interacting.txt
+\lstset{language=bash, style=bashstyle}
+\begin{lstlisting}
+# now run the following in bash
+workflows/agi_finder.py -i studies/Fan2016_SERK_review/resources/serk_interacting.txt \
+  > studies/Fan2016_SERK_review/resources/serk_interacting.agi
+less studies/Fan2016_SERK_review/resources/serk_interacting.agi
+\end{lstlisting}
+For today, we only needed the AGIs and not the additional information, so this will do.
+\lstset{language=R, style=Rstyle}
+\lstinputlisting{../workflows/additional_SERK.R}
+OK, we'll never be able to cover all possible comparative data and data formats you might want to
+look at, but hopefully you have some ideas on where to start now. Take a look around, if you are ahead of
+your neighbors this would be a great time to try importing contextual data you would actually be
+interested in for your studies.
+\subsection{Further Clustering}
+\subsubsection{K-means}
+\lstinputlisting{../workflows/clustering_kmeans.R}
+\subsubsection{Heatmaps - gene set}
+We've seen heatmaps at a large scale when displaying the hierarchical clustering above.
+Sometimes however, they are useful just for looking at dozens of genes at once. Let's look at the MapMan
+category, stress.biotic.signalling as an example.
+\lstinputlisting{../workflows/miniheatmaps.R}
+\begin{lstlisting}
+\end{lstlisting}
--- a/_reader/sections/04_section_longreads.tex
+++ b/_reader/sections/04_section_longreads.tex