diff --git a/_reader/README.md b/_reader/README.md new file mode 100644 index 0000000000000000000000000000000000000000..bb75081700801e1dc46338be303721ffe2b7b3a9 --- /dev/null +++ b/_reader/README.md @@ -0,0 +1,33 @@ +# RNAseq_workshop + +Protocol and other material for RNAseq workshop (Illumina + some long read). + +## why latex +we now have both a) syntax highlighting and b) the ability to copy +paste from code blocks and get valid commands. + +I'd give them a printed version only until they are through the linux section, +and after that they can have the pdf as well. + +## to make the pdf (needs texlive, and probably texlive-extra-utils) +pdflatex RNAseqWorkshop.tex + +### support files + +RNAseqWorkshop.tex imports everything in sections. So for instance, +all of the content of the biological data extraction section is +organized in "sections/03_biological_data_extraction.tex". + +As this section had +a lot of code, it currently imbeds scripts from "R/". But that +can definitely be changed if desired. + +## support scripts for the course +The course uses a few scripts from us that should be put in the +$PATH for the .py scripts under "python/". + +The exception to this is the join_r_scripts.py in the main directory, +which just creates "musings.R", which should be provided in the RNAdata +directory the way the script is written (although since we can now +copy paste from the pdf when latex/listings is used, +we have the theoretical option to drop this). diff --git a/_reader/RNAseqWorkshop.pdf b/_reader/RNAseqWorkshop.pdf index 544750320fdcf56222147c804cce8f585612ee67..fc61e85050fa86d9901987a8072d30b921d723df 100644 Binary files a/_reader/RNAseqWorkshop.pdf and b/_reader/RNAseqWorkshop.pdf differ diff --git a/_reader/RNAseqWorkshop.tex b/_reader/RNAseqWorkshop.tex new file mode 100644 index 0000000000000000000000000000000000000000..7540d79bc0d35501a7d034d02b6d9908f2035011 --- /dev/null +++ b/_reader/RNAseqWorkshop.tex @@ -0,0 +1,173 @@ +\documentclass[]{article} +\usepackage{amssymb,amsmath} +\usepackage[utf8]{inputenc} +\usepackage[T1]{fontenc}%required +\usepackage{textcomp} +\usepackage{listings} % syntax highlighting +% \usepackage{markdown} +\usepackage{color} +\usepackage{pxfonts} +\usepackage{graphicx} +\usepackage{wrapfig} % wrapping text around figures +\graphicspath{{figures/}} +\usepackage{booktabs} % tables +\usepackage{hyperref} % for links +% title page formatting +%\usepackage{fancyhdr} +\usepackage[bottom]{footmisc} % foot note +%\usepackage{color} %Colorbox? +% xcolor-hypersetup for making links not ugly +\usepackage{xcolor} +\hypersetup{ + colorlinks, + linkcolor={red!50!black}, + citecolor={blue!50!black}, + urlcolor={blue!80!black} +} +\usepackage[parfill]{parskip} % spaces, not indentation +\usepackage{geometry} % normal sized margins + \geometry{ + a4paper, + total={160mm,237mm}, + left=25mm, + top=35mm, + } + + +% most from: http://latexcolor.com/ +\definecolor{mygray}{gray}{0.4} +\definecolor{lightgray}{gray}{0.9} +\definecolor{aliceblue}{rgb}{0.94, 0.97, 1.0} +\definecolor{bananamania}{rgb}{0.98, 0.91, 0.71} +\definecolor{cerulean}{rgb}{0.0, 0.48, 0.65} +\definecolor{seashell}{rgb}{1.0, 0.96, 0.93} + +\lstdefinestyle{basestyle}{ + showstringspaces=false, + columns=fullflexible, + literate={*}{\char42}1,% {-}{\char45}1, + upquote=true, + %literate={`}{\char18}1 {'}{\char13}1, % \char18 = ` and \char13 = ' + frame= +} + +% setting up style sets +\lstdefinestyle{bashstyle}{language=bash, + style=basestyle, + commentstyle=\color{mygray}, + basicstyle=\ttfamily, + backgroundcolor=\color{lightgray}, + keywordstyle=\color{cerulean}\bfseries, + stringstyle=\color{magenta} +} + + +\lstdefinestyle{Rstyle}{language=R, + style=basestyle, + commentstyle=\color{mygray}, + basicstyle=\ttfamily, + backgroundcolor=\color{aliceblue}, + keywordstyle=\bfseries, + stringstyle=\color{magenta}, +} +\newcommand{\il}[1]{\colorbox{seashell}{\lstinline[columns=fixed]{#1}}} + +% title page prep +\newcommand{\HRule}[1]{\rule{\linewidth}{#1}} + +\setcounter{tocdepth}{2} % table of content depth to just subsection +%------------------------------------------------------------------------------- +% HEADER & FOOTER +%------------------------------------------------------------------------------- +%\pagestyle{fancy} +%\fancyhf{} +%\setlength\headheight{15pt} +%\fancyhead[L]{RNAseq Workshop} +%\fancyfoot[R]{Page \thepage\ of \pageref{LastPage}} + +%\title{RNAseq Workshop 2022} + +\title{ \normalsize \textsc{Heinrich Heine University, Düsseldorf} + \\ [1.0cm] + \textsc{Cluster of Excellence on Plant Sciences} + \\ [4.0cm] + \HRule{0.5pt} \\ + \LARGE \textbf{RNAseq Workshop 2022} + \HRule{2pt} \\ [0.5cm] + \normalsize \today \vspace*{5\baselineskip} + \\ [1.0cm] +} + +\author{Alisandra Denton \\ Dominik Brilhaus} +\date{} + +\begin{document} +\maketitle + +\newpage + +\tableofcontents +\section{License} + +This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License. To view a +copy of this license, visit + +\url{https://creativecommons.org/licenses/by-nc-sa/4.0/} + +In non-legal terminology: You are welcome to use, distribute, +and modify this material for non- commercial purposes. + +\section{Acknowledgements} +The authors would like to thank Juliane Schmid and the CEPLAS team +for organizing. + +This course builds directly or indirectly on a series of workshops that +we've been involved with for over a decade. Particularly the 2018 iteration +under the same name. Even though we do not have a list of contributors +to the various workshops, we'd like to thank all of you all the same, +and acknowledge this wasn't put together in a vacuum. + +\section{Preface} + +Here we want to jump start your big data and RNAseq analyses. We dive +head first into doing this on the command line in linux (via Docker). + +While there maybe more "user-friendly" options out there to complete +the same tasks (google "galaxy rnaseq" if you're interested!), and there +are certainly more precise and detailed deeper levels (e.g. algorithms +and source code), we choose the command line for several reasons: + +\begin{itemize} +\item it is customizable and flexible +\item we work at a good level to gain a conceptual understanding +\item you might as well try the command line in a setting where you can + ask for help, you can always come back to galaxy \& co. later. +\end{itemize} + +At this point, we will not strive for the most elegant, nor the most +efficient code; but rather for simplicity and understandability for +beginners. We may write ugly, redundant, copy-paste code; but it will +get the job done and allow us to analyze our own data following a more +or less standardized pipeline. + +There are many good resources out there, in so far as you find it +useful we recommend using some of them, such as \url{https://linuxsurvival.com/}, +to help cement what you have learned in the course. +We also provide a set of handouts that may be useful to you during +or after the workshop in the ARC under \il{_handouts/} + +The Docker setup should also allow you to take home and practice +the material anywhere you can get docker installed: \url{https://www.docker.com/} + +\include{sections/00_section_arc} +\include{sections/01_section_linux} +\include{sections/02_section_illumina} +\include{sections/03_biological_data_extraction} +\include{sections/04_section_longreads} + +\section{Questions and answers} + +\paragraph{Congratulations, you've made it through the workshop material!} +\paragraph{Do you have questions on what we've covered?} +\paragraph{Do you have questions on transferring the information here to your own data?} +\end{document} diff --git a/_reader/figures/align_n_quant.pdf b/_reader/figures/align_n_quant.pdf new file mode 100644 index 0000000000000000000000000000000000000000..56ddf660d370ae600e1dc0e601f7c30a22d8a777 Binary files /dev/null and b/_reader/figures/align_n_quant.pdf differ diff --git a/_reader/figures/align_n_quant.svg b/_reader/figures/align_n_quant.svg new file mode 100644 index 0000000000000000000000000000000000000000..fa35e0bc90cafaf04adc519170effc5fc4bbf1fa --- /dev/null +++ b/_reader/figures/align_n_quant.svg @@ -0,0 +1,692 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!-- Created with Inkscape (http://www.inkscape.org/) --> + +<svg + xmlns:dc="http://purl.org/dc/elements/1.1/" + xmlns:cc="http://creativecommons.org/ns#" + xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns="http://www.w3.org/2000/svg" + xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" + xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" + width="187.95236mm" + height="132.43103mm" + viewBox="0 0 665.97293 469.2438" + id="svg2" + version="1.1" + inkscape:version="0.91 r13725" + sodipodi:docname="align_n_quant.svg"> + <defs + id="defs4"> + <marker + inkscape:isstock="true" + style="overflow:visible" + id="marker5238" + refX="0" + refY="0" + orient="auto" + inkscape:stockid="Arrow1Mend"> + <path + transform="matrix(-0.4,0,0,-0.4,-4,0)" + style="fill:#ff0000;fill-opacity:1;fill-rule:evenodd;stroke:#ff0000;stroke-width:1pt;stroke-opacity:1" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + id="path5240" + inkscape:connector-curvature="0" /> + </marker> + <marker + inkscape:stockid="Arrow1Mend" + orient="auto" + refY="0" + refX="0" + id="Arrow1Mend" + style="overflow:visible" + inkscape:isstock="true" + inkscape:collect="always"> + <path + id="path4563" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.4,0,0,-0.4,-4,0)" + inkscape:connector-curvature="0" /> + </marker> + <marker + inkscape:stockid="Arrow1Mstart" + orient="auto" + refY="0" + refX="0" + id="Arrow1Mstart" + style="overflow:visible" + inkscape:isstock="true"> + <path + id="path4560" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#000000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1pt;stroke-opacity:1" + transform="matrix(0.4,0,0,0.4,4,0)" + inkscape:connector-curvature="0" /> + </marker> + </defs> + <sodipodi:namedview + id="base" + pagecolor="#ffffff" + bordercolor="#666666" + borderopacity="1.0" + inkscape:pageopacity="0.0" + inkscape:pageshadow="2" + inkscape:zoom="0.98994949" + inkscape:cx="323.02001" + inkscape:cy="160.97131" + inkscape:document-units="px" + inkscape:current-layer="layer1" + showgrid="false" + showguides="true" + inkscape:guide-bbox="true" + fit-margin-top="1" + fit-margin-left="1" + fit-margin-right="1" + fit-margin-bottom="1" + inkscape:window-width="1855" + inkscape:window-height="1176" + inkscape:window-x="1985" + inkscape:window-y="24" + inkscape:window-maximized="1"> + <sodipodi:guide + position="116.46858,220.82034" + orientation="1,0" + id="guide4158" /> + <sodipodi:guide + position="3.0001897,572.49187" + orientation="1,0" + id="guide4166" /> + <sodipodi:guide + position="216.47368,262.2366" + orientation="1,0" + id="guide4176" /> + <sodipodi:guide + position="269.68046,206.81666" + orientation="1,0" + id="guide4182" /> + <sodipodi:guide + position="584.84803,464.40556" + orientation="0,1" + id="guide5030" /> + <sodipodi:guide + position="431.30484,3.7759992" + orientation="0,1" + id="guide5032" /> + </sodipodi:namedview> + <metadata + id="metadata7"> + <rdf:RDF> + <cc:Work + rdf:about=""> + <dc:format>image/svg+xml</dc:format> + <dc:type + rdf:resource="http://purl.org/dc/dcmitype/StillImage" /> + <dc:title></dc:title> + </cc:Work> + </rdf:RDF> + </metadata> + <g + inkscape:label="Layer 1" + inkscape:groupmode="layer" + id="layer1" + transform="translate(173.03712,-119.0553)"> + <ellipse + style="fill:#4d4d4d;fill-opacity:0.2871287;stroke:none;stroke-width:5;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="path4142" + cx="-122.03661" + cy="382.77521" + rx="47.457199" + ry="43.43401" /> + <rect + style="fill:#b3b3b3;fill-opacity:1;stroke:none;stroke-width:5;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136" + width="153.50325" + height="425.14413" + x="-44.451809" + y="159.37892" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="35.304344" + y="134.7782" + id="text4152" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4154" + x="37.692039" + y="134.7782" + style="text-align:center;text-anchor:middle">Reference sequence </tspan><tspan + sodipodi:role="line" + x="35.304344" + y="153.5282" + id="tspan4156" + style="text-align:center;text-anchor:middle">used</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-120.38283" + y="371.60571" + id="text4144" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4146" + x="-120.38283" + y="371.60571" + style="text-align:center;text-anchor:middle">Illuimina</tspan><tspan + sodipodi:role="line" + x="-120.38283" + y="390.35571" + id="tspan4150" + style="text-align:center;text-anchor:middle">RNAseq</tspan><tspan + sodipodi:role="line" + x="-120.38283" + y="409.10571" + id="tspan4148" + style="text-align:center;text-anchor:middle">reads</tspan></text> + <path + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + d="m -6.39223,278.58534 0,47.47717 49.82879,0 0,0 52.527932,1.01015" + id="path4180" + inkscape:connector-curvature="0" + sodipodi:nodetypes="ccccc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + d="m 96.64334,191.47225 -53.20678,0 0,77.11309 53.20678,0" + id="path4186" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cccc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + d="m -56.56854,372.71737 0,102.00879 50.17631,0 0,40.52794 103.03557,0" + id="path4194" + inkscape:connector-curvature="0" + sodipodi:nodetypes="ccccc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + d="m -6.39223,474.72616 0,-35.47717 103.03557,0" + id="path4196" + inkscape:connector-curvature="0" + sodipodi:nodetypes="ccc" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-35.355335" + y="180.38087" + id="text4198" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4200" + x="-35.355335" + y="180.38087">Transcriptome</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-27.605433" + y="345.27939" + id="text4202" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4204" + x="-27.605433" + y="345.27939">Genome</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-29.625746" + y="411.97485" + id="text4206" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4208" + x="-29.625746" + y="411.97485">Transcriptome</tspan><tspan + sodipodi:role="line" + x="-29.625746" + y="430.72485" + id="tspan4210">close relative</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="6.4079494" + y="543.47748" + id="text4212" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + x="6.4079494" + y="543.47748" + id="tspan4216"><tspan + id="tspan4538" + style="font-style:italic;text-align:center;text-anchor:middle">de novo</tspan></tspan><tspan + sodipodi:role="line" + x="6.4079494" + y="562.22748" + id="tspan4540">Assembly</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="86.541801" + y="182.40117" + id="text4218" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4220" + x="86.541801" + y="182.40117">1.</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="83.511345" + y="345.11694" + id="text4222" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4224" + x="83.511345" + y="345.11694">2.</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="84.521507" + y="433.18805" + id="text4226" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4228" + x="84.521507" + y="433.18805">3.</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="84.521515" + y="544.4267" + id="text4230" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4232" + x="84.521515" + y="544.4267">4.</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:100%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="167.81126" + y="182.38087" + id="text4234" + sodipodi:linespacing="100%"><tspan + sodipodi:role="line" + id="tspan4236" + x="170.19896" + y="182.38087" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle">"nice" </tspan><tspan + sodipodi:role="line" + x="167.81126" + y="197.38087" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle" + id="tspan4317">transcriptome</tspan><tspan + sodipodi:role="line" + x="167.81126" + y="212.38087" + id="tspan4238" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle">count k-mers</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:100%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="162.21339" + y="326.10309" + id="text4240" + sodipodi:linespacing="100%"><tspan + sodipodi:role="line" + id="tspan4242" + x="162.21339" + y="326.10309" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle">splice-aware</tspan><tspan + sodipodi:role="line" + x="164.60109" + y="341.10309" + id="tspan4244" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle">fast read </tspan><tspan + sodipodi:role="line" + x="162.21339" + y="356.10309" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle" + id="tspan4319">alignment</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:100%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="162.0696" + y="434.27942" + id="text4246" + sodipodi:linespacing="100%"><tspan + sodipodi:role="line" + id="tspan4248" + x="164.45729" + y="434.27942" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle">slower read </tspan><tspan + sodipodi:role="line" + x="162.0696" + y="449.27942" + id="tspan4321" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle">alignment</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:100%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="160.27658" + y="486.06116" + id="text4250" + sodipodi:linespacing="100%"><tspan + sodipodi:role="line" + id="tspan4252" + x="162.66428" + y="486.06116" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle">Assembly </tspan><tspan + sodipodi:role="line" + x="160.27658" + y="501.06116" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle" + id="tspan4323">followed</tspan><tspan + sodipodi:role="line" + x="162.66428" + y="516.06116" + id="tspan4254" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle">by fast read </tspan><tspan + sodipodi:role="line" + x="162.66428" + y="531.06116" + id="tspan4256" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle">alignment </tspan><tspan + sodipodi:role="line" + x="160.27658" + y="546.06116" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle" + id="tspan4331">(as above)</tspan></text> + <rect + style="fill:#999999;fill-opacity:1;stroke:none;stroke-width:5;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-3" + width="194.91951" + height="285.74313" + x="226.69382" + y="171.43575" /> + <rect + style="fill:#aaccff;fill-opacity:1;stroke:none;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4273" + width="103.03556" + height="24.243666" + x="225.66365" + y="147.64987" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="247.92761" + y="165.83263" + id="text4275" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4277" + x="247.92761" + y="165.83263">Position</tspan></text> + <rect + style="fill:#e9afaf;fill-opacity:1;stroke:none;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4273-6" + width="96.974648" + height="24.243666" + x="323.46573" + y="147.64989" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="347.79062" + y="165.7514" + id="text4294" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4296" + x="347.79062" + y="165.7514">Quant.</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:100%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="153.49268" + y="268.58533" + id="text4325" + sodipodi:linespacing="100%"><tspan + sodipodi:role="line" + id="tspan4327" + x="155.88037" + y="268.58533" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle">fast read </tspan><tspan + sodipodi:role="line" + x="153.49268" + y="283.58533" + id="tspan4329" + style="line-height:100%;text-align:center;writing-mode:lr-tb;text-anchor:middle">alignment</tspan></text> + <rect + style="fill:#aaccff;fill-opacity:1;stroke:none;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4371-3" + width="82.832504" + height="227.28433" + x="236.06468" + y="223.39357" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="257.27789" + y="438.11191" + id="text4355" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4357" + x="257.27789" + y="438.11191">BLAT</tspan></text> + <rect + style="fill:#ffdd55;fill-opacity:1;stroke:none;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4273-6-6" + width="96.974648" + height="24.243666" + x="274.43018" + y="472.37585" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="289.58249" + y="491.56873" + id="text4359" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4361" + x="289.58249" + y="491.56873">Assembly</tspan></text> + <rect + style="fill:#d6d0ff;fill-opacity:1;stroke:none;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4273-6-3" + width="170.71579" + height="24.243666" + x="237.57991" + y="182.48239" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="296.65353" + y="199.65498" + id="text4367" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4369" + x="296.65353" + y="199.65498">Kallisto</tspan></text> + <rect + style="fill:#e9afaf;fill-opacity:1;stroke:none;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4371" + width="82.832504" + height="141.42136" + x="327.88705" + y="223.89865" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="370.61859" + y="329.96463" + id="text4349" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4351" + x="370.61859" + y="329.96463" + style="text-align:center;text-anchor:middle">HTSeq</tspan><tspan + sodipodi:role="line" + x="370.61859" + y="348.71463" + id="tspan4353" + style="text-align:center;text-anchor:middle">StringTie</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="278.64844" + y="267.33524" + id="text4333" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4335" + x="278.64844" + y="267.33524" + style="text-align:center;text-anchor:middle">bowtie2</tspan><tspan + sodipodi:role="line" + x="278.64844" + y="286.08524" + id="tspan4337" + style="text-align:center;text-anchor:middle">BWA</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="277.08066" + y="328.9545" + id="text4343" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4345" + x="277.08066" + y="328.9545" + style="text-align:center;text-anchor:middle">Hisat2</tspan><tspan + sodipodi:role="line" + x="277.08066" + y="347.7045" + id="tspan4347" + style="text-align:center;text-anchor:middle">STAR</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="342.09012" + y="271.37582" + id="text4339" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4341" + x="342.09012" + y="271.37582">eXpress</tspan></text> + <path + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + d="m -74.072444,384.71737 17.503904,0 0,-106.13203 50.17631,0 0,-49.63592 49.82879,0" + id="path4178" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cccccc" /> + <rect + style="fill:#999999;fill-opacity:1;stroke:none;stroke-width:5;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-3-8" + width="193.90936" + height="53.668118" + x="226.69382" + y="-550.36475" + transform="scale(1,-1)" /> + <rect + style="fill:#ffdd55;fill-opacity:1;stroke:none;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4273-6-6-0" + width="174.75639" + height="25.253807" + x="236.00375" + y="509.89368" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="297.62314" + y="527.08655" + id="text4363" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4365" + x="297.62314" + y="527.08655">Trinity</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="253.21698" + y="133.99507" + id="text4448" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4450" + x="253.21698" + y="133.99507">Example software</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="435.97336" + y="135.99507" + id="text4454" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4456" + x="435.97336" + y="135.99507" + style="font-weight:bold">Easier</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="431.00381" + y="584.5434" + id="text4458" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4460" + x="431.00381" + y="584.5434" + style="font-weight:bold">Harder</tspan></text> + <path + style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:5.97104597;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-start:url(#Arrow1Mstart);marker-end:url(#Arrow1Mend)" + d="m 459.88338,159.14944 0,395.16854" + id="path4462" + inkscape:connector-curvature="0" /> + <g + id="g4546" + transform="translate(299.00515,16.162441)"> + <ellipse + ry="17.172594" + rx="16.667517" + cy="500.68045" + cx="-257.41513" + id="path4542" + style="fill:none;fill-opacity:1;stroke:#ff0000;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <path + inkscape:connector-curvature="0" + id="path4544" + d="m -269.03188,512.80228 22.22335,-22.22335" + style="fill:none;fill-rule:evenodd;stroke:#ff0000;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" /> + </g> + <path + style="fill:none;fill-rule:evenodd;stroke:#ff0000;stroke-width:2;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#marker5238)" + d="m 52.182746,543.10686 c 12.365497,31.31473 18.408531,31.31473 43.450442,29.29443" + id="path5226" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cc" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="117.81595" + y="578.50281" + id="text5228" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan5230" + x="117.81595" + y="578.50281" + style="fill:#ff0000">Get long reads.</tspan></text> + </g> +</svg> diff --git a/_reader/figures/dun_merge.pdf b/_reader/figures/dun_merge.pdf new file mode 100644 index 0000000000000000000000000000000000000000..be634624679c5eb41f5f1a501cdf06722b5041d0 Binary files /dev/null and b/_reader/figures/dun_merge.pdf differ diff --git a/_reader/figures/dun_merge.svg b/_reader/figures/dun_merge.svg new file mode 100644 index 0000000000000000000000000000000000000000..aee12711ab0d8cd7b51b9f9b93b729b871de4859 --- /dev/null +++ b/_reader/figures/dun_merge.svg @@ -0,0 +1,642 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!-- Created with Inkscape (http://www.inkscape.org/) --> + +<svg + xmlns:dc="http://purl.org/dc/elements/1.1/" + xmlns:cc="http://creativecommons.org/ns#" + xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns="http://www.w3.org/2000/svg" + xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" + xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" + width="144.60693mm" + height="92.354301mm" + viewBox="0 0 512.38677 327.23965" + id="svg2" + version="1.1" + inkscape:version="0.91 r13725" + sodipodi:docname="dun_merge.svg"> + <defs + id="defs4" /> + <sodipodi:namedview + id="base" + pagecolor="#ffffff" + bordercolor="#666666" + borderopacity="1.0" + inkscape:pageopacity="0.0" + inkscape:pageshadow="2" + inkscape:zoom="1.4" + inkscape:cx="285.01794" + inkscape:cy="141.65781" + inkscape:document-units="px" + inkscape:current-layer="g4396-8" + showgrid="false" + showguides="true" + inkscape:guide-bbox="true" + inkscape:window-width="1855" + inkscape:window-height="1176" + inkscape:window-x="1985" + inkscape:window-y="24" + inkscape:window-maximized="1" + fit-margin-top="0" + fit-margin-left="0" + fit-margin-right="0" + fit-margin-bottom="0" /> + <metadata + id="metadata7"> + <rdf:RDF> + <cc:Work + rdf:about=""> + <dc:format>image/svg+xml</dc:format> + <dc:type + rdf:resource="http://purl.org/dc/dcmitype/StillImage" /> + <dc:title></dc:title> + </cc:Work> + </rdf:RDF> + </metadata> + <g + inkscape:label="Layer 1" + inkscape:groupmode="layer" + id="layer1" + transform="translate(-92.954338,-150.47261)"> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="98.994949" + y="171.00412" + id="text4252" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4254" + x="98.994949" + y="171.00412">5'</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="494.46964" + y="171.97366" + id="text4256" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4258" + x="494.46964" + y="171.97366">3'</tspan></text> + <g + id="g4396"> + <rect + y="185.61624" + x="107.54617" + height="0.57523108" + width="373.3215" + id="rect4349" + style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.91969275;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="209.54118" + x="214.54729" + height="0.72535276" + width="266.39545" + id="rect4349-4" + style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.76957107;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="231.40164" + x="325.52454" + height="1.0044436" + width="155.55775" + id="rect4349-4-8" + style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.49048018;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="253.57932" + x="166.60318" + height="0.64910936" + width="314.30145" + id="rect4349-4-8-8" + style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.84581447;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <g + id="g4260"> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136" + width="147.48227" + height="10.101525" + x="94.954338" + y="181.61072" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4" + width="50.979172" + height="11.583224" + x="298.2135" + y="180.86987" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3" + width="78.725777" + height="11.045557" + x="420.48233" + y="181.1387" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-7" + width="12.777371" + height="12.777372" + x="381.61642" + y="180.2728" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.10028768;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-5" + width="34.22459" + height="12.001238" + x="209.16187" + y="204.66086" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-9" + width="50.979172" + height="11.583224" + x="298.2135" + y="204.86987" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-6" + width="78.725777" + height="11.045557" + x="420.48233" + y="205.1387" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-7-2" + width="12.777371" + height="12.777372" + x="381.61642" + y="204.2728" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.20774055;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-9-7" + width="38.157749" + height="11.893785" + x="311.19022" + y="226.7146" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-6-8" + width="78.725777" + height="11.045557" + x="420.48233" + y="227.1387" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-7-2-5" + width="12.777371" + height="12.777372" + x="381.61642" + y="226.2728" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.51366901;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-7" + width="108.57265" + height="10.587856" + x="134.10712" + y="249.36755" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-4" + width="50.979172" + height="11.583224" + x="298.2135" + y="248.86987" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-1" + width="78.725777" + height="11.045557" + x="420.48233" + y="249.1387" /> + </g> + <rect + y="181.61072" + x="94.954338" + height="10.101525" + width="147.48227" + id="rect4136-59" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="180.86987" + x="298.2135" + height="11.583224" + width="50.979172" + id="rect4136-4-7" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="181.1387" + x="420.48233" + height="11.045557" + width="78.725777" + id="rect4136-4-3-5" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="180.2728" + x="381.61642" + height="12.777372" + width="12.777371" + id="rect4136-4-3-7-3" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="204.66086" + x="209.16187" + height="12.001238" + width="34.22459" + id="rect4136-5-8" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.10028768;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="204.86987" + x="298.2135" + height="11.583224" + width="50.979172" + id="rect4136-4-9-8" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="205.1387" + x="420.48233" + height="11.045557" + width="78.725777" + id="rect4136-4-3-6-3" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="204.2728" + x="381.61642" + height="12.777372" + width="12.777371" + id="rect4136-4-3-7-2-1" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="226.7146" + x="311.19022" + height="11.893785" + width="38.157749" + id="rect4136-4-9-7-8" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.20774055;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="227.1387" + x="420.48233" + height="11.045557" + width="78.725777" + id="rect4136-4-3-6-8-9" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="226.2728" + x="381.61642" + height="12.777372" + width="12.777371" + id="rect4136-4-3-7-2-5-6" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="249.36755" + x="134.10712" + height="10.587856" + width="108.57265" + id="rect4136-7-4" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.51366901;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="248.86987" + x="298.2135" + height="11.583224" + width="50.979172" + id="rect4136-4-4-3" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="249.1387" + x="420.48233" + height="11.045557" + width="78.725777" + id="rect4136-4-3-1-3" + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + </g> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="297.99503" + y="161.40767" + id="text4431" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4433" + x="297.99503" + y="161.40767">Input/sam</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:36.84271622px;line-height:125%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="737.42059" + y="156.09688" + id="text4441" + sodipodi:linespacing="125%" + transform="scale(0.69109067,1.4469881)"><tspan + sodipodi:role="line" + id="tspan4443" + x="737.42059" + y="156.09688">}</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="561.64484" + y="209.89497" + id="text4445" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4447" + x="564.03253" + y="209.89497">Different </tspan><tspan + sodipodi:role="line" + x="561.64484" + y="228.64497" + id="tspan4449">Transcripts?</tspan></text> + <g + transform="translate(-7.3730462e-8,144)" + id="g4396-6"> + <rect + y="185.61624" + x="107.54617" + height="0.57523108" + width="373.3215" + id="rect4349-43" + style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.91969275;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="207.57932" + x="166.60318" + height="0.64910936" + width="314.30145" + id="rect4349-4-8-8-0" + style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.84581447;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <g + id="g4260-9"> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-2" + width="147.48227" + height="10.101525" + x="94.954338" + y="181.61072" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-5" + width="50.979172" + height="11.583224" + x="298.2135" + y="180.86987" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-4" + width="78.725777" + height="11.045557" + x="420.48233" + y="181.1387" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-7-0" + width="12.777371" + height="12.777372" + x="381.61642" + y="180.2728" /> + </g> + <rect + y="181.61072" + x="94.954338" + height="10.101525" + width="147.48227" + id="rect4136-59-4" + style="fill:#008080;fill-opacity:1;stroke:#008080;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="180.86987" + x="298.2135" + height="11.583224" + width="50.979172" + id="rect4136-4-7-8" + style="fill:#008080;fill-opacity:1;stroke:#008080;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="181.1387" + x="420.48233" + height="11.045557" + width="78.725777" + id="rect4136-4-3-5-1" + style="fill:#008080;fill-opacity:1;stroke:#008080;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="180.2728" + x="381.61642" + height="12.777372" + width="12.777371" + id="rect4136-4-3-7-3-2" + style="fill:#008080;fill-opacity:1;stroke:#008080;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="203.36755" + x="134.10712" + height="10.587856" + width="108.57265" + id="rect4136-7-4-1" + style="fill:#008080;fill-opacity:1;stroke:#008080;stroke-width:3.51366901;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="202.86987" + x="298.2135" + height="11.583224" + width="50.979172" + id="rect4136-4-4-3-0" + style="fill:#008080;fill-opacity:1;stroke:#008080;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="203.1387" + x="420.48233" + height="11.045557" + width="78.725777" + id="rect4136-4-3-1-3-5" + style="fill:#008080;fill-opacity:1;stroke:#008080;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + </g> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="297.99503" + y="305.40765" + id="text4431-1" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4433-1" + x="297.99503" + y="305.40765">Default collapse</tspan></text> + <g + transform="translate(-7.3730462e-8,236)" + id="g4396-8"> + <rect + y="189.61624" + x="107.54617" + height="0.57523108" + width="373.3215" + id="rect4349-6" + style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.91969275;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="211.40164" + x="325.52454" + height="1.0044436" + width="155.55775" + id="rect4349-4-8-84" + style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.49048018;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="233.57932" + x="166.60318" + height="0.64910936" + width="314.30145" + id="rect4349-4-8-8-7" + style="fill:#000000;fill-opacity:1;stroke:#000000;stroke-width:2.84581447;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <g + id="g4260-2" + transform="translate(0,4)"> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-40" + width="147.48227" + height="10.101525" + x="94.954338" + y="181.61072" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-6" + width="50.979172" + height="11.583224" + x="298.2135" + y="180.86987" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-2" + width="78.725777" + height="11.045557" + x="420.48233" + y="181.1387" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-7-9" + width="12.777371" + height="12.777372" + x="381.61642" + y="180.2728" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.20774055;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-9-7-1" + width="38.157749" + height="11.893785" + x="311.19022" + y="202.7146" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-6-8-1" + width="78.725777" + height="11.045557" + x="420.48233" + y="203.1387" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-7-2-5-0" + width="12.777371" + height="12.777372" + x="381.61642" + y="202.2728" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.51366901;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-7-3" + width="108.57265" + height="10.587856" + x="134.10712" + y="225.36755" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-4-4" + width="50.979172" + height="11.583224" + x="298.2135" + y="224.86987" /> + <rect + style="fill:#aca793;fill-opacity:1;stroke:#aca793;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4136-4-3-1-0" + width="78.725777" + height="11.045557" + x="420.48233" + y="225.1387" /> + </g> + <rect + y="185.61072" + x="94.954338" + height="10.101525" + width="147.48227" + id="rect4136-59-3" + style="fill:#67536c;fill-opacity:1;stroke:#67536c;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="184.86987" + x="298.2135" + height="11.583224" + width="50.979172" + id="rect4136-4-7-9" + style="fill:#67536c;fill-opacity:1;stroke:#67536c;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="185.1387" + x="420.48233" + height="11.045557" + width="78.725777" + id="rect4136-4-3-5-19" + style="fill:#67536c;fill-opacity:1;stroke:#67536c;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="184.2728" + x="381.61642" + height="12.777372" + width="12.777371" + id="rect4136-4-3-7-3-6" + style="fill:#67536c;fill-opacity:1;stroke:#67536c;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="206.7146" + x="311.19022" + height="11.893785" + width="38.157749" + id="rect4136-4-9-7-8-0" + style="fill:#67536c;fill-opacity:1;stroke:#67536c;stroke-width:2.20774055;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="207.1387" + x="420.48233" + height="11.045557" + width="78.725777" + id="rect4136-4-3-6-8-9-5" + style="fill:#67536c;fill-opacity:1;stroke:#67536c;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="206.2728" + x="381.61642" + height="12.777372" + width="12.777371" + id="rect4136-4-3-7-2-5-6-6" + style="fill:#67536c;fill-opacity:1;stroke:#67536c;stroke-width:1.32415295;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="229.36755" + x="134.10712" + height="10.587856" + width="108.57265" + id="rect4136-7-4-6" + style="fill:#67536c;fill-opacity:1;stroke:#67536c;stroke-width:3.51366901;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="228.86987" + x="298.2135" + height="11.583224" + width="50.979172" + id="rect4136-4-4-3-4" + style="fill:#67536c;fill-opacity:1;stroke:#67536c;stroke-width:2.51830149;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + <rect + y="229.1387" + x="420.48233" + height="11.045557" + width="78.725777" + id="rect4136-4-3-1-3-0" + style="fill:#67536c;fill-opacity:1;stroke:#67536c;stroke-width:3.05596852;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + </g> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="297.99503" + y="401.40765" + id="text4431-0" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4433-4" + x="297.99503" + y="401.40765">--dun-merge-5-shorter collapse</tspan></text> + </g> +</svg> diff --git a/_reader/figures/longread_workflow.pdf b/_reader/figures/longread_workflow.pdf new file mode 100644 index 0000000000000000000000000000000000000000..7ad50902a9de876e6b95de4be83584b4cdbc18a3 Binary files /dev/null and b/_reader/figures/longread_workflow.pdf differ diff --git a/_reader/figures/longread_workflow.svg b/_reader/figures/longread_workflow.svg new file mode 100644 index 0000000000000000000000000000000000000000..bea7de1c9f01d77b9192bab180f9da3166124d14 --- /dev/null +++ b/_reader/figures/longread_workflow.svg @@ -0,0 +1,847 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!-- Created with Inkscape (http://www.inkscape.org/) --> + +<svg + xmlns:dc="http://purl.org/dc/elements/1.1/" + xmlns:cc="http://creativecommons.org/ns#" + xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns="http://www.w3.org/2000/svg" + xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" + xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" + width="130.59871mm" + height="165.21391mm" + viewBox="0 0 462.75134 585.40362" + id="svg2" + version="1.1" + inkscape:version="0.91 r13725" + sodipodi:docname="longread_workflow.svg"> + <defs + id="defs4"> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send" + style="overflow:visible" + inkscape:isstock="true"> + <path + id="path4409" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" + inkscape:connector-curvature="0" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8-7" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5-5" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8-7-8" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5-5-8" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8-7-8-3" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5-5-8-3" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8-8" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5-6" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8-4" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5-8" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8-89" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5-7" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8-6" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5-4" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8-3" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5-0" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8-7-8-2" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5-5-8-5" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8-7-8-0" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5-5-8-59" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + <marker + inkscape:stockid="Arrow1Send" + orient="auto" + refY="0" + refX="0" + id="Arrow1Send-8-7-8-0-9" + style="overflow:visible" + inkscape:isstock="true"> + <path + inkscape:connector-curvature="0" + id="path4409-5-5-8-59-2" + d="M 0,0 5,-5 -12.5,0 5,5 0,0 Z" + style="fill:#aca793;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:1pt;stroke-opacity:1" + transform="matrix(-0.2,0,0,-0.2,-1.2,0)" /> + </marker> + </defs> + <sodipodi:namedview + id="base" + pagecolor="#ffffff" + bordercolor="#666666" + borderopacity="1.0" + inkscape:pageopacity="0.0" + inkscape:pageshadow="2" + inkscape:zoom="0.98994949" + inkscape:cx="241.19547" + inkscape:cy="109.05528" + inkscape:document-units="px" + inkscape:current-layer="layer1" + showgrid="false" + inkscape:window-width="1855" + inkscape:window-height="1176" + inkscape:window-x="1985" + inkscape:window-y="24" + inkscape:window-maximized="1" + showguides="true" + inkscape:guide-bbox="true" + fit-margin-top="0" + fit-margin-left="0" + fit-margin-right="0" + fit-margin-bottom="0" /> + <metadata + id="metadata7"> + <rdf:RDF> + <cc:Work + rdf:about=""> + <dc:format>image/svg+xml</dc:format> + <dc:type + rdf:resource="http://purl.org/dc/dcmitype/StillImage" /> + <dc:title></dc:title> + </cc:Work> + </rdf:RDF> + </metadata> + <g + inkscape:label="Layer 1" + inkscape:groupmode="layer" + id="layer1" + transform="translate(65.761436,-15.905114)"> + <path + style="fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:4.00000048;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + d="m 82.62947,599.30874 -0.556924,-53.3915 78.482974,-20.34963 75.59684,20.34963 -2,53.3915 z" + id="path10920" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cccccc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8-7-8-2)" + d="m 192.03051,301.84438 79.28422,53.02025 43.95439,70.21836" + id="path4388-9-3-3-4" + inkscape:connector-curvature="0" + sodipodi:nodetypes="ccc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8-3)" + d="m 317.33003,457.38757 0,23.23351" + id="path4388-9-9" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8-6)" + d="m 28.08122,458.39772 0,23.23351" + id="path4388-9-30" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8-89)" + d="m 128.55333,457.40787 0,23.23351" + id="path4388-9-7" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8-8)" + d="m -45.91878,395.78857 0,23.23351" + id="path4388-9-0" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8-4)" + d="m 195.56349,394.77842 0,23.23351" + id="path4388-9-8" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8-7-8-3)" + d="m 13.91878,247.29614 0,112.12693" + id="path4388-9-3-3-3" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8-7-8)" + d="m 140.67517,296.83423 0,63.63961" + id="path4388-9-3-3" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8-7)" + d="m 162.65486,242.18447 0,23.23351" + id="path4388-9-3" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cc" /> + <rect + style="fill:#cdde87;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-2-5-7" + width="119.198" + height="27.274122" + x="-56.690384" + y="491.22244" /> + <rect + style="fill:#cdde87;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-2-5-7-4" + width="104.04572" + height="27.274122" + x="105.15738" + y="490.21231" /> + <rect + style="fill:#cdde87;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-2-5-7-1" + width="55.558392" + height="27.274122" + x="289.99503" + y="490.2326" /> + <rect + style="fill:#ffe680;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-2" + width="117.17769" + height="27.274122" + x="-57.700523" + y="431.62344" /> + <rect + style="fill:#ffe680;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-2-8" + width="108.08631" + height="27.274122" + x="106.16754" + y="431.62347" /> + <rect + style="fill:#ffe680;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-2-5" + width="141.42136" + height="27.274122" + x="244.55844" + y="431.62347" /> + <rect + style="fill:#ffe680;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-9" + width="101.01526" + height="27.274122" + x="104.14725" + y="367.98386" /> + <rect + style="fill:#ffccaa;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-6" + width="117.17769" + height="27.274122" + x="-56.670067" + y="368.68942" /> + <rect + style="fill:#ffccaa;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-1" + width="131.31982" + height="27.274122" + x="-65.761436" + y="221.51173" /> + <rect + style="fill:#ffe680;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-7" + width="96.97464" + height="27.274122" + x="104.14725" + y="221.51173" /> + <rect + style="fill:#ffe680;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-5" + width="87.88327" + height="27.274122" + x="110.20816" + y="275.99905" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="111.85931" + y="239.56741" + id="text4148" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4150" + x="111.85931" + y="239.56741">unpolished</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-58.977531" + y="240.8024" + id="text4152" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4154" + x="-58.977531" + y="240.8024">unpolished flnc</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="122.05293" + y="294.3136" + id="text4156" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4158" + x="122.05293" + y="294.3136">polished</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="112.49309" + y="449.25998" + id="text4180" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4182" + x="112.49309" + y="449.25998">collapsed hq</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="252.07471" + y="450.51523" + id="text4184" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4186" + x="252.07471" + y="450.51523">collapsed cogent</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-43.914452" + y="388.01703" + id="text4188" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4190" + x="-43.914452" + y="388.01703">mapped flnc</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="114.29804" + y="386.30127" + id="text4192" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4194" + x="114.29804" + y="386.30127">mapped hq</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-50.148678" + y="449.6066" + id="text4196" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4198" + x="-50.148678" + y="449.6066">collapsed flnc</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-47.598999" + y="508.90012" + id="text4314" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4316" + x="-47.598999" + y="508.90012">augustus flnc</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="112.22845" + y="508.85953" + id="text4318" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4320" + x="112.22845" + y="508.85953">augustus hq</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="297.0864" + y="507.88995" + id="text4322" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4324" + x="297.0864" + y="507.88995">angel</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="30.263966" + y="197.83408" + id="text4372" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4374" + x="30.263966" + y="197.83408">isoseq3 cluster</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="173.82745" + y="266.46353" + id="text4376" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4378" + x="173.82745" + y="266.46353">isoseq3 polish</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="78.55864" + y="325.59549" + id="text4384" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4386" + x="78.55864" + y="325.59549">gmap</tspan><tspan + sodipodi:role="line" + x="78.55864" + y="344.34549" + id="tspan4796">(-> genome)</tspan></text> + <g + id="g5825" + transform="translate(0,4)"> + <g + transform="translate(468.95444,20.162441)" + id="g5807"> + <rect + style="fill:#cdde87;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-1-8-6-4" + width="176.77669" + height="27.274122" + x="-288.90363" + y="75.542107" /> + <rect + style="fill:#ffe680;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-1-8-6" + width="213.14218" + height="27.274122" + x="-288.90363" + y="49.542107" /> + <rect + style="fill:#ffccaa;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-1-8" + width="175.76654" + height="27.274122" + x="-288.90363" + y="23.542141" /> + <rect + style="fill:#ffaaaa;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-1" + width="96.97464" + height="27.274122" + x="-288.90399" + y="-2.742131" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-273.27554" + y="16.541292" + id="text4160" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4162" + x="-273.27554" + y="16.541292">subreads</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-274.94925" + y="43.148224" + id="text4164" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4166" + x="-274.94925" + y="43.148224">consensus full reads</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-275.36768" + y="67.611588" + id="text4168" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4170" + x="-275.36768" + y="67.611588">reconstructed transcripts</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-276.40753" + y="94.475937" + id="text4176" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4178" + x="-276.40753" + y="94.475937">coding gene models</tspan></text> + </g> + <rect + y="13.905127" + x="176.797" + height="112.12689" + width="218.19295" + id="rect5823" + style="fill:none;fill-opacity:1;stroke:#aca793;stroke-width:4;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> + </g> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8-7-8-0)" + d="M 59.538084,158.40272 28.17026,177.52818 14.08122,215.98141" + id="path4388-9-3-3-46" + inkscape:connector-curvature="0" + sodipodi:nodetypes="ccc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8-7-8-0-9)" + d="m 115.03642,157.4911 31.36782,19.12546 14.08904,38.45323" + id="path4388-9-3-3-46-2" + inkscape:connector-curvature="0" + sodipodi:nodetypes="ccc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send)" + d="m 89.781746,40.148744 0,23.233508" + id="path4388" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cc" /> + <path + style="fill:none;fill-rule:evenodd;stroke:#aca793;stroke-width:4;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1;marker-end:url(#Arrow1Send-8)" + d="m 89.802051,102.84433 0,23.23351" + id="path4388-9" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cc" /> + <rect + style="fill:#ffccaa;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6-3" + width="78.791893" + height="27.274122" + x="48.365494" + y="133.58786" /> + <rect + style="fill:#ffaaaa;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200" + width="96.97464" + height="27.274122" + x="40.284271" + y="15.905084" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="55.550404" + y="35.753052" + id="text4136" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4138" + x="55.550404" + y="35.753052">subreads</tspan></text> + <rect + style="fill:#ffccaa;fill-opacity:1;stroke:none;stroke-width:2;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" + id="rect4200-6" + width="46.467014" + height="27.274122" + x="65.538086" + y="76.009155" /> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="76.416977" + y="92.895905" + id="text4140" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4142" + x="76.416977" + y="92.895905">ccs</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="56.213924" + y="152.4057" + id="text4144" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4146" + x="56.213924" + y="152.4057">trimmed</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="96.852814" + y="61.36195" + id="text4364" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4366" + x="96.852814" + y="61.36195">ccs</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="96.852814" + y="122.98125" + id="text4368" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan4370" + x="96.852814" + y="122.98125">lima</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="238.49754" + y="326.0625" + id="text10272" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan10274" + x="238.49754" + y="326.0625">cogent</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="329.49249" + y="477.52448" + id="text10276" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan10278" + x="329.49249" + y="477.52448">angel</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="-37.335033" + y="416.97626" + id="text10334" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan10336" + x="-37.335033" + y="416.97626">collapse_isoforms_by_sam.py</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="43.436562" + y="477.58539" + id="text10800" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan10802" + x="43.436562" + y="477.58539">augustus</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="285.79193" + y="570.80109" + id="text10894" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + x="285.79193" + y="570.80109" + id="tspan10898" + style="font-style:italic">de novo</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="157.87939" + y="551.79095" + id="text10902" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan10904" + x="157.87939" + y="551.79095">ref based,</tspan><tspan + sodipodi:role="line" + x="160.26709" + y="570.54095" + id="tspan10906">mapped high </tspan><tspan + sodipodi:role="line" + x="157.87939" + y="589.29095" + id="tspan10908">quality transcripts</tspan></text> + <text + xml:space="preserve" + style="font-style:normal;font-weight:normal;font-size:15px;line-height:125%;font-family:sans-serif;text-align:center;letter-spacing:0px;word-spacing:0px;text-anchor:middle;fill:#000000;fill-opacity:1;stroke:none;stroke-width:1px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" + x="0.37531292" + y="552.80115" + id="text10910" + sodipodi:linespacing="125%"><tspan + sodipodi:role="line" + id="tspan10912" + x="0.37531292" + y="552.80115">ref based,</tspan><tspan + sodipodi:role="line" + x="2.7630081" + y="571.55115" + id="tspan10914">mapped </tspan><tspan + sodipodi:role="line" + x="0.37531292" + y="590.30115" + id="tspan10922">reads</tspan></text> + <path + style="fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:4.00000048;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + d="m 256.51983,599.30874 -0.55692,-53.3915 62.48298,-20.34963 59.59684,20.34963 -2,53.3915 z" + id="path10920-4" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cccccc" /> + <path + style="fill:none;fill-opacity:1;fill-rule:evenodd;stroke:#aca793;stroke-width:4.00000048;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" + d="m -59.37053,599.30874 -0.556924,-53.3915 62.4829736,-20.34963 59.5968494,20.34963 -2,53.3915 z" + id="path10920-7" + inkscape:connector-curvature="0" + sodipodi:nodetypes="cccccc" /> + </g> +</svg> diff --git a/_reader/intro_scripts_data/R/ggplot.R b/_reader/intro_scripts_data/R/ggplot.R new file mode 100644 index 0000000000000000000000000000000000000000..c53fa0a17697ca15f3e92093f6a2a0b25f6083ad --- /dev/null +++ b/_reader/intro_scripts_data/R/ggplot.R @@ -0,0 +1,72 @@ +head(df0) +# now we will learn about how ggplot makes figures +# the minimal information is where is the data, what is the x variable +# and what is the y variable and what we want to plot +# let's try + +# this will make and save plot object +baseplot <- ggplot(data = df0, aes(x = hp, y = gas_mileage)) + + geom_point() +# the first line indicates the data, what x is and what y is +# the second line says we want points + +# this will display plot object +baseplot + +# now we can customize that plot and make it prettier +# first off, different style, we do that by using theme +baseplot + theme_minimal() # display plot object + modification +# or alternatively, if you prefer +baseplot + theme_bw() +# now we'll save theme_bw with the object for later use +baseplot <- baseplot + theme_bw() + +# now we want colored dots, simply red ones +# google ``r colors" for an idea of the named colors available in R +# you can choose them by name, number or hexcode; we like names +baseplot + geom_point(color = "firebrick") + +# and a different size and shape +baseplot + geom_point(color = "firebrick", shape = 18, size = 4) + +# or choose colors based on something else, like the number of cylinders + +# note, that while many things can just be 'added' to a ggplot, +# a second ggplot object is not one of them, so we'll start over +# as it's normally best to specify dynamic color as an +# argument to aesthetics (called aes). +ggplot(data = df0, aes(x = hp, y = gas_mileage, color = cyl)) + + geom_point(shape = 18, size = 4) + + theme_bw() + +# you can also make your color changes abrupt, not continuous, +# by defining cyl as a factor rather than a number +# first check what they are +class(df0$cyl) +# now make them a factor +baseplot <- ggplot( + data = df0, + aes(x = hp, y = gas_mileage, color = as.factor(cyl)) +) + + geom_point(shape = 18, size = 4) + + theme_bw() +baseplot +# not the best colors, but you can modify them using scale_color_manual +# to do that you need to know which levels you have as your factors +levels(as.factor(df0$cyl)) +# and now you can look at colors and decide what is what +baseplot <- baseplot + + scale_color_manual( + values = c( + "4" = "black", + "6" = "orange", + "8" = "red3" + ), + name = "cylinders" + ) +baseplot +# finally, a little more polishing +baseplot + + xlab("gross horsepower") + + ylab("miles per gallon") + + theme(text = element_text(size = 16)) diff --git a/_reader/intro_scripts_data/example_table.txt b/_reader/intro_scripts_data/example_table.txt new file mode 100644 index 0000000000000000000000000000000000000000..ec459db60809864c530b55983342f9c3b811dc0c --- /dev/null +++ b/_reader/intro_scripts_data/example_table.txt @@ -0,0 +1,33 @@ +"" "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb" +"Mazda RX4" 21 6 160 110 3.9 2.62 16.46 0 1 4 4 +"Mazda RX4 Wag" 21 6 160 110 3.9 2.875 17.02 0 1 4 4 +"Datsun 710" 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 +"Hornet 4 Drive" 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 +"Hornet Sportabout" 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 +"Valiant" 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 +"Duster 360" 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 +"Merc 240D" 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 +"Merc 230" 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 +"Merc 280" 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 +"Merc 280C" 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 +"Merc 450SE" 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 +"Merc 450SL" 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 +"Merc 450SLC" 15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 +"Cadillac Fleetwood" 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 +"Lincoln Continental" 10.4 8 460 215 3 5.424 17.82 0 0 3 4 +"Chrysler Imperial" 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 +"Fiat 128" 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1 +"Honda Civic" 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 +"Toyota Corolla" 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1 +"Toyota Corona" 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1 +"Dodge Challenger" 15.5 8 318 150 2.76 3.52 16.87 0 0 3 2 +"AMC Javelin" 15.2 8 304 150 3.15 3.435 17.3 0 0 3 2 +"Camaro Z28" 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 +"Pontiac Firebird" 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2 +"Fiat X1-9" 27.3 4 79 66 4.08 1.935 18.9 1 1 4 1 +"Porsche 914-2" 26 4 120.3 91 4.43 2.14 16.7 0 1 5 2 +"Lotus Europa" 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 +"Ford Pantera L" 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 +"Ferrari Dino" 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 +"Maserati Bora" 15 8 301 335 3.54 3.57 14.6 0 1 5 8 +"Volvo 142E" 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 diff --git a/_reader/sections/00_section_arc.tex b/_reader/sections/00_section_arc.tex new file mode 100644 index 0000000000000000000000000000000000000000..b78a4c9d9f474acb87281a8fbe26d77e93bebbf3 --- /dev/null +++ b/_reader/sections/00_section_arc.tex @@ -0,0 +1,164 @@ +\section{Course material} + +This workshop is organized based on an ARC (Annotated Research Context). + +\subsection{Annotated Research Context} + +The ARC stores all required input data, backup data, scripts, presentations as well as this reader in one research data package. + +\subsubsection{The basic ARC structure summarized} + +\begin{itemize} + \item \il{studies} \verb| -> | external data and sample metadata + \item \il{assays} \verb| -> | measurement data (i.e. "raw" sequence data) and metadata + \item \il{workflows} \verb| -> | computational analyses (i.e. "scripts") + \item \il{runs} \verb| -> | outputs of workflow (i.e. "results") +\end{itemize} + + +\subsubsection{Material added to the ARC for this course} + +\begin{itemize} + \item \il{_reader} \verb| -> | The basis to this reader (written in \LaTeX) + \item \il{_slides} \verb| -> | Slides presented during the workshop + \item \il{_handouts} \verb| -> | Cheat sheets and additional materials + \item \il{runs/_backups} \verb| -> | Backups for runs, in case a workflow unexpectedly did not work or takes too long. + \begin{itemize} + \item If a workflow would output \il{blat_results}, but failed, copy \il{runs/_backup/blat_results} to \il{runs/blat_results}. + \end{itemize} +\end{itemize} + + +\subsubsection{Disclaimer} + +For this workshop we do not take full advantage of all ARC features. This is in part due to the fact that some developments are work in progress. +More so, we could easily spend two more days exploring those features, which are simply out of scope for a three-day workshop focusing on RNA-Seq data analysis. And besides, the ARC will in the future circumvent some annoyances that we run into for training purposes during this course. + +If you want to learn more, check out the \href{https://nfdi4plants.org}{DataPLANT website}. + +\subsection{During the workshop} + +For the in-person workshop, we have already stored the ARC to the "Raumlaufwerke" +folder under our room number (25.41.00.41). Please copy the folder "rnaseq-workshop" +into your own home folder, so that everyone has their own copy. +We have written all code and commands so that they can be "executed from" the root of the ARC. + +As a check now, please open a terminal (applications \verb| -> | terminal) and run: + +\lstset{language=bash, style=bashstyle} +\begin{lstlisting} +cd $HOME/rnaseq-workshop +\end{lstlisting} + +This is also a very good thing to try throughout the workshop if you're getting +any sort of 'file not found' error, make sure you're in the right place. + +%% TODO: Software dependencies. +\subsection{Docker} +We rely heavily on Docker for this workshop, which allows us to provide an 'image' +with all the software you will need for the workshop installed and ready to use. +There are two major reasons we chose Docker. 1) Convenience for development, as it +works the same on our normal work machines, and on the ZIM teaching machines we +use for the workshop, and more importantly 2) Portability, you can all take the +Docker file home, work once through standardized instructions for installing +Docker: \url{https://docs.docker.com/} (or ask your admin to), and then you can +readily reproduce the work from the course. + +As a brief disclaimer however, none of this is intended to demonstrate +good practice with Docker, particularly not for other purposes. + +The Docker image ``rnaseq\_docker.tar" is available on Sciebo at: %% TODO! +\url{https://uni-duesseldorf.sciebo.de/s/53pA9W9TbOKTgGQ}. +the password is written on the board, or you have received it by e-mail. + +The image can also be found in the ``Raumlaufwerke" folder under our room (25.41.00.41). + +\subsubsection{load image} +To load the image, run: + +\begin{lstlisting} +docker image load -i </path/to/>rnaseq_docker.tar +\end{lstlisting} + +Where \il{</path/to/>} is replaced by e.g. ``Downloads" +or the path to the ``Raumlaufwerke" folder as appropriate, e.g. + +\begin{lstlisting} +docker image load -i Downloads/rnaseq_docker.tar.gz +\end{lstlisting} + +\subsubsection{run image as container} +To start a writable `container' from the image where you can work +interactively, \emph{from your home directory} (\il{cd $HOME}) +run: + +\begin{lstlisting} +docker run -it --name rnalive -p 8889:8889 --mount \ + type=bind,source="$(pwd)"/rnaseq-workshop,target=/home/ZIM-gast/rnaseq-workshop \ + rnaseq:latest +\end{lstlisting} + +\subsubsection{restart container} +\begin{lstlisting} +docker start -i rnalive +\end{lstlisting} + +\subsubsection{exiting} +You can exit a container with Ctr+D or by typing \il{exit}. + +You won't need to do this much in the workshop, however. + +\subsubsection{Then what?} +You should now have a terminal, that looks a lot like the one before, +except now know that you're in the 'container', where all the necessary +software is installed. + +\begin{lstlisting} +# the workshop directory is shared +cd $HOME/rnaseq-workshop +# you should be able to see a list of all the files, that you can, +# e.g. by opening the folder in a file manager +ls +\end{lstlisting} + +\subsubsection{What to do when} +At home: +\begin{itemize} +\item{load: once per machine or update to the image} +\item{run: once after each load, on first use} +\item{start: whenever you want to use it} +\end{itemize} + +For the workshop however, the computers are wiped clean each night. +So you will have to run : + +\begin{itemize} +\item{load: each morning} +\item{run: after running load each morning} +\item{start: as needed, should you exit the container} +\end{itemize} + +\textbf{Important:} keep any data you wish to save within the folder +rnaseq-workshop, and make sure the contents of this folder are copied +to either the ``Raumlaufwerke" folder or to e.g. your USB flash drive or +your sciebo account before you logout of the computer!!! + +\fbox{\begin{minipage}{45em} +If you have any trouble whatsoever with Docker, please ask!! + +Properly understanding docker is beyond the scope of this workshop, +so don't worry if the above ``doesn't make sense", +but you will absolutely need it to be running as expected for +the other parts to work, so just ask :-) +\end{minipage}} + + +\subsection{After the workshop} + +At least for the next half year, the ARC to this workshop is shared publicly under a CC-BY 4.0 license at \url{https://git.nfdi4plants.org/brilator/rnaseq-workshop.git}. +Feel free to download and unarchive the whole ARC as a zip / tar archive or \il{git clone} or \il{git fork} the ARC. Just like docker, explaining the full usage of `git` is beyond the scope of this course. + +To ensure, that all code works as used in this reader, make sure to store the ARC as \il{rnaseq-workshop} in your \$HOME directory. + +Similarly, the built docker image will remain available on sciebo. The built image is not included +in the ARC as we did not have the time to check licenses for included software, regarding distribution. diff --git a/_reader/sections/01_section_linux.tex b/_reader/sections/01_section_linux.tex new file mode 100644 index 0000000000000000000000000000000000000000..be7e146824582c1ad2b575e683d0e8ac529abc69 --- /dev/null +++ b/_reader/sections/01_section_linux.tex @@ -0,0 +1,699 @@ +\section{Basics} +\subsection{Linux \& Bash} +\subsubsection{Command-line basics} +We type our instruction for the computer into the terminal, open one by +typing (Ctrl + Alt + T) or by navigating to one using the GUI. In all +languages used in the workshop, the \texttt{\#} character indicates +comments, and what follows will not be executed by the computer, but is +there for the user. +\lstset{language=bash, style=bashstyle} + +\begin{lstlisting} +# list the contents of the current directory +ls +# move to the home directory +cd +# list the contents +ls +# change to the directory with the data to be used +cd $HOME/rnaseq-workshop +# change to one directory up +cd .. +# make a directory and change into it +mkdir a_dummy_directory +cd a_dummy_directory +\end{lstlisting} + +Alright, we can change between folders, look at their contents, etc... but +we hardly needed to learn to use a command line to accomplish this. Let's +very briefly look at some things that are nice on the command line, like +variables, loops, and wild cards. + +\begin{lstlisting} +## variables +# set one variable +species="Arabidopsis_thaliana_Col0" + +# use (here simply display) the variable by adding a '$' before the name +echo $species + +# that might already save some typing, or make it easy to perform +# a similar analysis on multiple different species. But variables are +# even more useful when it comes to loops + +## loops +# this will loop through the numbers 1 - 20 +for i in {01..20}; + +# the keyword 'do' designates the start of each iteration +do + # 'touch' will create an empty text file with the given name + touch my_sample_${i}.txt + + # the keyword 'done' is used to designate the end of each iteration +done + +# look at the result +ls + +## wild cards +# we need one more file for the next example +touch keep_me_for_now.md + +# note any differences in the results of the following commands + +ls +ls * +ls *.md +ls my_sample_* + +# now we can selectively clean up all those files we made +rm my_sample_* +ls # check result +# be careful with wild cards though, and when possible restrict +# their scope. Think about what would happen if you ran +# rm * +# in the wrong directory + +## more on variables +# if you were wondering why we used ${i} and not $i in the loop +# the answer is it's best practice when the variable is used +# within words. Compare the results + +echo $i +echo $isample.txt # '$isample' is not defined +echo ${i}sample.txt + +# we'll also run into (and use) built-in variables +# for instance +echo $HOME # home directory +echo $PWD # working (current) directory + +# we can return to our main directory this way +cd $HOME/rnaseq-workshop +\end{lstlisting} + +Now we will write our first shell script. We will try everything out before we actually make the +script. There are five steps to making a script: (i) define what you want to happen, in our case, we want the +computer to print the current time and date, (ii) test out the code by writing it out piece by piece, (iii) +actually write the script using a text editor, (iv) make the script executable and (v) run the script. + +First, test the code: + +\begin{lstlisting} +# make the computer repeat what you wrote +echo "My 1st script prints the date and time." +echo $(date +%F_%T) +\end{lstlisting} + +And now we will produce the script (or program if you will). Open the text editor, gedit, +on the host machine either +by double clicking a text file from the GUI file manager or e.g. by opening applications +( \vdots \vdots \vdots~, lower left) and searching for 'gedit'. + +Then write the following in the text editor: + + +\begin{lstlisting} +#!/bin/bash +echo "My 1st script prints the date and time." +echo $(date +%F_%T) +exit +\end{lstlisting} +Save the file as \il{myfirstprogram.sh} in the \il{rnaseq-workshop} directory. + +This program cannot yet be executed (ran) as the computer has not been told that it is an executable +program. But we can set the permissions. We'll start just by checking the current +permissions. + +\begin{lstlisting} +# from the rnaseq-workshop directory where you saved myfirstprogram.sh +ls +# note the colors +ls -l +# note the information you see on the screen +chmod u+x myfirstprogram.sh +# this command tells the computer to add (+) the +# execute permission (x) to the current user (u) +# if you omit the u, you will give permission to everyone +ls +ls -l +# compare what changed to output from above +\end{lstlisting} + +\fbox{\begin{minipage}{45em} +As a side note, there's a lot more to permissions control, both in \emph{what} can be done +and also in \emph{how} it can be done. It's more than we have time or need to get into here +but there's a lot of nice resources out there to explain it if you're curious, e.g. +\url{https://www.pluralsight.com/blog/it-ops/linux-file-permissions}. +\end{minipage}} + +You are ready to run the script. + +\begin{lstlisting} +# execute the script in the current directory (./) +./myfirstprogram.sh +# we can also store the output +./myfirstprogram.sh > text.txt +# look in the current directory in the GUI to find +# the file text.txt and open it by double clicking. + +# further, view the output in the terminal +less text.txt +# enter 'q' to quit + +# extra exercise: +# try appending output to the existing text.txt file +# by using '>>' instead of '>' +\end{lstlisting} + +Writing shell scripts is +easy, you just put in the program that you would write in the terminal and the computer will execute it +line by line. The first line tells the computer what to use to read the program (in our case +\il{bash}) and the last line tells it to exit from executing the +program. Writing these small shell +scripts becomes important when you want to run your read mapping overnight or over a weekend. We will +revisit them when we do the read mapping. + +You can also use commands to make analyses in a file. We will use a transcriptome (you +will assemble your own later) which is in .fasta format + +As necessary {\Large\textbf c}hange into the +{\large\textbf d}irectory \\ +\il{$HOME/rnaseq-workshop/studies/AthalianaReferences/resources/}. + +\begin{lstlisting} +# look at the file using the command line +less Athaliana_primaryTranscripts.fa +# you can move through the document by pressing Enter +# you can leave the file by pressing q +# if you only want to look at the first few lines +head Athaliana_primaryTranscripts.fa +# if you want to look at the last few lines +tail Athaliana_primaryTranscripts.fa +# view the whole file +cat Athaliana_primaryTranscripts.fa +# now you know why the 'head' and 'tail' commands +# are so important. To interrupt this, you'll +# want the shortcut Ctrl + C (or Strg + C) +\end{lstlisting} + +With these large files, we often get our first look and first stats in bash. + +\begin{lstlisting} +# count the number of sequences, more specifically: +# count the headers, which start with '>' +# the grep command, is a type of search, we'll look for '>' +# the '|' tells the computer to pass the output of one command +# to another, and wc is the command for "word count" +grep ">" Athaliana_primaryTranscripts.fa | wc --lines + +# the number you see, is the number of transcripts +# you can also estimate* how many bases there are in total +# by counting all characters in the file with the +# exception of the fasta headers. +# you do that by inverting grep (-v), producing all but +# lines starting with ">", and then counting characters +grep ">" Athaliana_primaryTranscripts.fa -v | wc --chars +# *estimate, because the end-of-line character is counted. +\end{lstlisting} + +Linux has many more such small useful commands. It is frequently helpful to google or to look at +Linux tutorials and see what you can scavenge for your purpose. + +\subsubsection{Installing software} +\fbox{\begin{minipage}{45em} +If you're running behind, this is a good-to-know sub-section that we won't reference or +need later in the course, so feel free to skip. You can always come back to it later +on your own time. +\end{minipage}} + +We've pre-installed everything for this course, and we'll be sending you home with a +copy of the Linux image. That said, we would be remiss if we didn't briefly cover how to install additional +software. + +In general, a lot of software can be installed through the package manager, which is best +when the program is available there, up-to-date, and we have administrative privileges. + +Let's try and install a program, "TopHat", which is a now-retired deprecated short-read aligner +that we don't need, but which has a fairly typical installation process for a bioinformatics tool. + +\begin{lstlisting} +# search for a program +apt search tophat +# you can see (in green) exact program names matching the search +# now we know exactly what to install +sudo apt install tophat +# Hmmmm.... now we run into the problem that we don't have +# administrative rights on these computers +# That's OK though, you'll normally either +# a) have administrative rights yourself, or +# b) have a systems administrator around that you can ask for help + +\end{lstlisting} + +Note that the package-manager is distribution specific +so the above 'apt' commands work for Ubuntu. While other systems +have their own package-managers (e.g. Fedora uses 'dnf'). +There are much more extensive resources on these elsewhere (e.g. +\url{https://www.digitalocean.com/community/tutorials/package-management-basics-apt-yum-dnf-pkg}). + +But now let's look at one fall-back alternative when installation with the package +manager isn't working, basically how almost any compiled program can be locally +"installed" for one user. This occurs a lot for bioinformatic programs, most often when packages +aren't available via the package manager or they are only available in old versions (e.g. +"FastQC", "samtools", "Trimmomatic"). That said, it can also work to install software on e.g. +a shared cluster, for instance when it's just a specific program that only you will use +and you don't want to bother the administrators to install it. + +Google "cufflinks rnaseq github" and your first hit is presumably +\url{https://github.com/cole-trapnell-lab/cufflinks}. + +Scroll down to ``Installing a pre-compiled binary release", and click ``here" + +If you then right click the link for 2.2.1, Linux +and then click ``Copy Link" you get the path shown below. Or you could +just left click it and download it normally of course, it +would just slightly change the instructions from the demo. + +You can further unpack the downloaded file by double clicking in the +file manager GUI, or as shown below. + +\begin{lstlisting} +# change to a directory we can store the files +cd $HOME/rnaseq-workshop/workflows + +# download and unpack the files +wget http://cole-trapnell-lab.github.io/cufflinks/assets/\ +downloads/cufflinks-2.2.1.Linux_x86_64.tar.gz + +tar -xvf cufflinks-2.2.1.Linux_x86_64.tar.gz +# look +ls +\end{lstlisting} +At this point we have downloaded and unpacked the file, but we can't just run +it yet. The next thing one should almost always do is look for some sort of "Read me" file, +(e.g. README.txt, Readme.txt, Readme.md) or installation instructions file (e.g. "INSTALL.txt"). + +Unfortunately Tophat's README doesn't do much but point you towards +the website. But if you search there for 'install' or 'quick start' you should +find some information. Even then, it's not exceptionally beginner friendly, +so see below, if and when you want further help. + + +\begin{lstlisting} + +# change into the new cufflinks directory +cd cufflinks-2.2.1.Linux_x86_64 +# look at what's there +ls -l +# you should see a lot of executable files +# check that tophat works +./cufflinks -h +\end{lstlisting} + +Hopefully you got the help function and it's working. That's great, but this only works +from this directory or if you supply the full path. e.g. + +\il{$HOME/rnaseq-workshop/workflows/cufflinks-2.2.1.Linux_x86_64/cufflinks}. + +That'd be a pain to remember. So how do we tell Linux to look for this executable +no matter which directory we are currently in? We can do this through the \il{$PATH} variable. + +This \il{$PATH} variable, stores all the directories where Linux will look +for executables, and we can append directories to it. +\begin{lstlisting} + +# the syntax of the 'mv' AKA "move" command is 'mv source destination' +# now add the file to our path +export PATH=$PATH:$HOME/rnaseq-workshop/workflows/cufflinks-2.2.1.Linux_x86_64/ +# to break this down +# PATH= is just setting the variable +# $PATH:$HOME/rnaseq-workshop/workflows/cufflinks-2.2.1.Linux_x86_64/ is just the +# old path with our custom bit appended (old):(custom) +# and 'export' makes it available to subprocesses of this shell + +# test that it works +cufflinks -h +# test that something else still works +ls +# at this point, if you want this line to always be available, add +# the line "export PATH=$PATH:$HOME/Documents/tophat-2.1.1.Linux_x86_64" +# to the file $HOME/.bashrc and it will run every time you open a terminal +# e.g. using a text editor like 'gedit' as above + +\end{lstlisting} + +\subsubsection{Getting help} + +Most programs and often even simple scripts will come with usage instructions. + +\begin{lstlisting} +# for instance, here are several ways to get documentation for 'grep' +man grep +grep -h +grep --help +# while the above commands are standard, you'll occasionally see things +# like -help or -?, and simply typing the name of the program without +# any parameters can also be a good bet + +### challenge assignment ### +# grep can actually do a lot more than we showed here. From it's usage +# function, can you figure out how to accomplish the results of the +# command above: `grep ">" Athaliana_primaryTranscripts.fa | wc --lines` +# using just `grep`, and not `wc`? +\end{lstlisting} + +If this doesn't help (enough) or perhaps you don't know quite which tool +to use yet: Google it, duck-duck-go-it, your-search-engine-of-choice it. + +Searching for any error messages you encounter can also be extremely +helpful during trouble shooting. + +% todo, this needs a cheat sheet + +\subsection{R} +We will run a brief introduction to R, including installing packages, getting your data into R, exporting data from R and how +to make a figure using the package "ggplot2". We will use RStudio to write and execute R code since it makes our life +easy. As a rule, Interactive Development Environments (IDEs), like RStudio, make the programmer's life easier, +particularly at the start. PyCharm is a nice equivalent for python. + +You can open Rstudio, either by entering \il{rstudio} +on the command line, or from the applications +menu under `Development'. + +\subsubsection{Rstudio via second Docker file} +Normally, you would open R by entering \il{rstudio} +on the command line, or from the applications +menu under `Development'. However, for the duration +of the course we'll use a second Docker container for +this. + +\begin{itemize} +\item acquire the \il{rstudio_docker.tar} file from sciebo or the ``Raumlaufwereke" folder (as before) +\item on the \emph{host} machine, open a new terminal +\item run the following +\end{itemize} +\begin{lstlisting} +docker image load -i </path/to/>rstudio_docker.tar +docker run --rm -p 8787:8787 -e PASSWORD=rstudio --mount \ + type=bind,source=${PWD}/rnaseq-workshop,target=/home/ZIM-gast/rnaseq-workshop \ + rstudiotest +\end{lstlisting} +\begin{itemize} +\item on the \emph{host} machine go to the browser and enter 'localhost:8787' +\item enter 'rstudio' as the username and 'rstudio' as the password +\item have fun learing R \& Rstudio! +\end{itemize} + +We will use the basic R functions that come with R when you install it. We will also use R +packages, little or large things other people wrote for us and those need to be installed. To install one we +type in the Rstudio script window and execute by typing (Ctrl + Enter). We type only the lines without the +preceding "\#" and we execute each line after typing it by pressing (Ctrl + Enter). + + +\lstset{language=R, style=Rstyle} + +Within Rstudio, the first thing we'll want to do is navigate to the +usual directory. + +\begin{lstlisting} +setwd('/home/ZIM-gast/rnaseq-workshop/') +\end{lstlisting} + +\subsubsection{basics in R and Rstudio} + +\begin{lstlisting} +install.packages("ggplot2") +# this installs the packages, you can see some logging information in the +# console about what R is doing. If you get an error about permissions, +# don't worry, it's already installed +library(ggplot2) +# you only need to install the package once, afterwards you only need the +# library() command. +\end{lstlisting} + +Let's assume you have an Excel file and want to do something to it using R. First, always, always +use tab-delimited text files to move from program to program, no matter which programs you use. Look at +the small tab-delimited txt file in the \$HOME/rnaseq-workshop/\_WorkshopReader/intro\_scripts\_data called "example\_table.txt" +using a text editor (e.g. `gedit') or LibreOffice Calc (which is just like Excel +except free, open source, and Linux compatible). Next we'll import this into R +\begin{lstlisting} +# set your working directory to the directory with the example_table.txt +# file using the [Session > set working directory > choose directory] GUI, +# and look what happens in the console +df0 <- read.delim("example_table.txt") +head(df0) +# the first command reads in a tab-delimited text file, our go-to format, +# it stores the table under the name 'df0' +# the second command, much like 'head' in bash shows the first rows +\end{lstlisting} + +The standard format for biologists applications in R is the \il{data.frame} +which is essentially what we +used to call a table. It is organized in columns and rows with column names and row names (not present in +our table, that is why \il{head(df0)} gives us numbers. +Now we will briefly look at how to get particular +columns out, and how to get particular rows out. We will always look at the data.frame, do something and +look again. We will also look a bit at classes (or types) of data because that is a frequent cause for errors +when writing scripts. + +\begin{lstlisting} +# let's remind ourselves what the data.frame looks like +head(df0) +# now we will extract the first column +df1 <- df0[, 1] +head(df1) +# now let's look at classes +class(df0) +class(df1) +# when we extracted the first column only, we no longer have a data.frame +# we now have a factor +# let's extract the first two columns +df2 <- df0[, c(1, 2)] +# this syntax means: take columns 1 and 2 from the data.frame 'df0' +# subsetting data frames works with [rows, columns] +# and c(1, 2) indicates which columns +head(df2) +df3 <- df0[, c("mpg", "cyl", "disp", "hp")] +head(df3) +# this time we have used the column names instead of the column number +# now let's look at rows +df4 <- df0[1:5, ] +# notice how the comma is now on the other side +# this extracts the first five rows, the column names are not a row, they +# are column names +df4 +# notice how I no longer ask for head(), but show everything. After all, +# I now only have five rows to display +# now some advanced magic, extract all rows in which there +# are cars with at least six cylinders +df5 <- df0[(df0$cyl >= 6), ] +# this evaluates the stuff in the bracket which says check if column 'cyl' +# of the data.frame df0 is equal or larger than 6 +# then it returns all rows (note the trailing comma) where this is TRUE +# how many? +dim(df5) +dim(df0) +df5 +\end{lstlisting} + +Let's assume you want to rename the columns but do not yet know how. If that happened in the lab, +you'd ask a colleague or ask google. Same here: google "rename columns r". The best result is not always +the first. To speed things up, we'll tell you the one we like, which is ``How to rename a single column in a +data.frame in R? - Stack Overflow" and there the 2nd solution which is: +\lstset{language=R, style=Rstyle, frame=trbl, rulecolor=\color{red}} + +\begin{lstlisting} +# df = dataframe +# old.var.name = The name you don't like anymore +# new.var.name = The name you want to get +names(df)[names(df) == "old.var.name"] <- "new.var.name" +\end{lstlisting} + +Now we can try renaming a column! + + +\lstset{language=R, style=Rstyle} + +\begin{lstlisting} +# look at current column names +colnames(df0) +# we tell R to look at the names in the data.frame df and find the one for +# which the name = mpg +# and then we re-assign that name to gas_mileage +# avoid spaces or number at the beginning of a name which are both bad in R +names(df0)[names(df0) == "mpg"] <- "gas_mileage" +# and look at after +colnames(df0) +\end{lstlisting} + +In R, we can make plots and make prettier plots. There are three major options +for plotting in R, ggplot2, lattice and the basic plot function. +We will use the package ggplot2 which was installed and loaded earlier today. +Our example plot will come from the data.frame we have worked +with all the time. We want to visualize whether horsepower of a car +and gas mileage have something to do with each other. Let's look at the +data.frame again: + +\lstinputlisting{intro_scripts_data/R/ggplot.R} +You can spend days and weeks over figures in ggplot2. Refer to the cheat sheet +provided by Rstudio for the most important functions. + +\url{https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf} + +We will cover common RNAseq-related examples as needed. + +\subsection{Perl/Java/others} +\lstset{language=bash, style=bashstyle} +While they're common starting points and the most critical for this workshop, +bash and R are just a tiny sampling of available languages. At the least, +you'll want to know how to execute scripts in other languages +This is a quick primer on how to do that. Very generally, you start +by checking whether the language is installed and you +can do that either by typing the name and seeing what happens +or by typing the name followed by +\il{-v} or \il{--version} +which usually checks the version or by typing the name followed +by \il{-h} which will call the help pages. + + +\begin{lstlisting} + +perl +# nothing seems to happen, use the interrupt (Ctrl + C) +perl -v +# you should see the version information +perl -h +# shows included help function, -h, -help, and --help are your +# first "goto"s, the second is google +python3 +# ups, you can now directly write code and execute it, we need +# to get out of here +quit +# this is not working but python tells you what will work +quit() +# and now you made it out. +# apparently both are installed +\end{lstlisting} +Google for "count\_fasta.pl". This will give you a script. +Look at the first line, apparently it is in perl (also standard for a .pl ending). +Copy the script into the text editor and save it in the same folder as +\il{Athaliana_primaryTranscripts.fa} +naming it \il{count_fasta.pl}. Now it needs permission to be executed + +\begin{lstlisting} +ls +# the filename is white +chmod u+x count_fasta.pl +ls +# the filename is green +# find out how it works +./count_fasta.pl -h +# apparently, you can specify something using -i but you do +# not have to do it and you need a fasta file. +# Use the transcriptome file from earlier +./count_fasta.pl Athaliana_primaryTranscripts.fa +# this will compute for a moment and then give you statistics +# about the fasta file. If you wanted that as a text file, +# you can redirect the output into file using > +./count_fasta.pl Athaliana_primaryTranscripts.fa > ATstats.txt +# you can look at the resulting file using the GUI or by typing +less ATstats.txt +\end{lstlisting} + +You can find scripts by googling what you want to do or what previous +authors say they have used (or made) in papers. Hosting services such +as \url{https://github.com} provide a way for all of us to backup, track and share our +code. Many simple scripts stand alone and can be executed similar to +above, while bundles of scripts or 'packages' may require installation. +Just check whatever documentation comes with the code you want to use +(it should contain instructions for any required installation) +and give it a try. We will use Java later when we use trimmomatic and +a lot of the analyses are performed with python scripts and packages. + +\subsection{Python} +\subsubsection{Python virtual environments} +As more and more software is provided in python, it's more and more +important to be at least partially familiar with installing things +in python and with python virtual environments (yes including conda). Often, and +particularly when installing less-stable bioinformatics software, there +are good reasons to install it within a virtual environment. Basically +it temporarily adds things to your PATH, while protecting you from any +damage to your system that could be caused by, say, two executables +having the same name. + +We won't use a virtual environment for the primary software in +this course, because we're using Docker. + +However, a quick demo is worth the effort. + +Virtual environments are also a great way to install things \emph{with} +a package manager \emph{without} needing administrative rights. +Pip is the general python package manager. + +First, we'll setup a virtual environment +\begin{lstlisting} +# setup environment, 'venv' can be any name or valid path of your choice +virtualenv venv +# activate or 'turn on' +source venv/bin/activate +\end{lstlisting} + + +Now we can use the virtual environment, and install +things with a package manager. + +\begin{lstlisting} +# first, we'll try and run what we're about to install +jupyter +# no surprise there +pip install jupyter +# see all those individual progress bars whizzing past, +# those are all the dependencies that pip is taking care +# of for us + +# start jupyter so that it's accessible from host +jupyter notebook --port 8889 --ip 0.0.0.0 +# you can use 'Ctr + c', and then 'y' to leave +# but leave it on until you've finished python basics +\end{lstlisting} + +You can find much information on virtual environments +at \url{https://docs.python-guide.org/dev/virtualenvs/}, or if you +are doing a lot of Python work under windows or run into installation +instructions that require \il{conda} you might be interested in Anaconda-based +virtual environments \url{https://conda.io/docs/user-guide/index.html}. + +Notes from a installation mini-workshop can also be found here +\url{https://github.com/weberlab-hhu/reproducbility-collection/blob/main/installation_intro/installation_intro.md} + +\subsubsection{python basics} +While it probably won't be critical for the rest of the workshop, +we would recommend learning some python basics to everyone. It's a very +approachable and widely used language. + +If you have time, we recommend the following. + +Above, when you ran `jupyter notebook` it should have opened +a server and listed a link to access it. Maybe something like +\il{http://127.0.0.1:8889/?token=8f463d2...}, the one with 127.0.0.1 +will be easiest. + +Copy this link and paste it into a browser on your host machine. + +Click on the 'new' button, and then make a new 'python3' notebook. +Work through the examples here: + +\url{https://jckantor.github.io/CBE30338/01.01-Getting-Started-with-Python-and-Jupyter-Notebooks.html} + +Jupyter is also great for mixing code with documentation, +visualization, or interpretation of results. A demo of what +jupyter can do can be found here: + +\url{https://github.com/weberlab-hhu/reproducbility-collection/blob/main/jupyter_intro/jupyter_demo.ipynb} + +just as an example resources, there's much more. + +When you're done with the above, hit Ctr+c (Strg + c), followed by 'y' +in the terminal to close the jupyter server. diff --git a/_reader/sections/02_section_illumina.tex b/_reader/sections/02_section_illumina.tex new file mode 100644 index 0000000000000000000000000000000000000000..1a647ebe567b4194678dad802ea017720237f8af --- /dev/null +++ b/_reader/sections/02_section_illumina.tex @@ -0,0 +1,505 @@ +\section{Example short read RNAseq analysis} +\subsection{Description of the datasets you have been given to work on} + +% \lstinputlisting{../studies/Bernsdorff2016_Arabidopsis_SAR/README.md} + +The data you see is part of an experiment to test signaling during systemic +acquired resistance (SAR) in \emph{A. thaliana.} During this experiment +you challenge a leaf with either mock or pathogen solution. +Two days later, you harvest a different leaf for RNA-seq and see if the +"we are under attack" signal has arrived and how it transforms the +transcriptome. Your data is only from wild type \emph{Arabidopsis} treated with +mock and pathogen solution. The original experiment also contains the analysis +of mutants with defects in different aspects of signaling, that through salicylic +acid and that through pipecolic acid. The complete +experiment can be found here: \url{http://www.plantcell.org/content/28/1/102.short} + +\lstset{language=bash, style=bashstyle} +\subsection{First look at reads} +Many, but not all, bioinformatics tools support compressed data as input. +If you do need to extract it, any GUI archive tool or quickly googling the +ending of the file name (e.g. *.gz) with "extract Linux" should find an easy solution. +For the workshop files, run +\begin{lstlisting} +gunzip assays/Bernsdorff2016_Illumina/dataset/Col0treatment1.fastq.gz +\end{lstlisting} + +Take a quick look at the fasta file +\begin{lstlisting} +head assays/Bernsdorff2016_Illumina/dataset/Col0treatment1.fastq +\end{lstlisting} + +Every four lines represents a read. +\begin{verbatim} +1: @ID +2: Sequence[ATCGN] +3: +ID +4: Phred quality scores +\end{verbatim} + +Note that if this was paired end data, each sample would have two files with +both having matching sorting and read IDs with all the forward reads in one +file and reverse reads in the other. The quality scores generally encode the +numbers from 0-40 that are -10 log 10 p, where p=probability that the base call was incorrect. So 30 is a 1 in 1000, and 20 a 1 in 100 chance of a miss-called +base. Currently, the most common encoding is Phred+33, and looks like this: +\begin{verbatim} +!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI +| | | | | +0..3.....................26...31........41 +\end{verbatim} + +No one wants to look at fastq files by hand, instead, +the program FastQC will create a full quality report for you. +\begin{lstlisting} +mkdir runs/fastqc_results +fastqc assays/Bernsdorff2016_Illumina/dataset/Col0treatment1.fastq \ + --outdir runs/fastqc_results +\end{lstlisting} + +You can open the result in a browser. Either from a file manager, +or from the terminal. \emph{This has to be done from the host machine}. +\begin{lstlisting} +firefox runs/fastqc_results/Col0treatment1_fastqc.html +\end{lstlisting} + +FastQC uses a simple code of green for "good" or "normal" data and +yellow then red for more questionable data. +Go through the report items and compare them to good Illumina data: + +\url{https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html} + +and to bad Illumina data: + +\url{https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html} + +How do they measure up? + +\subsection{Trimming} +You've seen from the FastQC reports that the data both a) has lower quality towards the end of +reads, and b) contains some adapter sequences. Both of these can lead to problems in down stream +analyses, and the solution of choice is generally trimming the reads. Trimmomatic is a java program that +can be used to perform both adapter and trimming steps. The command looks complicated, but for that it +is a very powerful and flexible tool. + +Breaking it down into pieces: + +\begin{tabular}{r l} + \toprule + \parbox[t][][t]{85mm}{SE} & + \parbox[t][][t]{65mm}{indicates single end reads} \\ \midrule + + \parbox[t][][t]{85mm}{Col0treatment1.fq} & + \parbox[t][][t]{65mm}{your file to trim} \\ \midrule + + \parbox[t][][t]{85mm}{Col0treatment1.trimmed.fq} & + \parbox[t][][t]{65mm}{name of output file} \\ \midrule + + \parbox[t][][t]{85mm}{ILLUMINACLIP:\textless adapters \textgreater:\textless seed mismatches\textgreater : \\ + \textless palindrome clip threshold\textgreater :\textless simple clip threshold\textgreater } & + \parbox[t][][t]{65mm}{this trims the adapters} \\ \midrule + + \parbox[t][][t]{85mm}{MAXINFO:\textless targetLength\textgreater : \\ + \textless strictness (0-1 for longer-stricter)\textgreater } & + \parbox[t][][t]{65mm}{this trims low quality bases from the 3' end} \\ \midrule + + \parbox[t][][t]{85mm}{MINLEN:\textless min\textgreater } & + \parbox[t][][t]{65mm}{drop reads below this length} \\ + + \bottomrule +\end{tabular} + +\begin{lstlisting} +# run trimmomatic +# note that the '\' in the following command is there to 'escape' +# the end-of-line character. If you write this on one line, +# skip the '\' character + +mkdir runs/trimmed_fastq/ + +# $HOME/sw is just where trimmomatic is located in our Docker container +# and would need to match the machine in question +java -jar $HOME/sw/Trimmomatic-0.39/trimmomatic-0.39.jar SE \ + assays/Bernsdorff2016_Illumina/dataset/Col0treatment1.fastq \ + runs/trimmed_fastq/Col0treatment1.trimmed.fastq \ + ILLUMINACLIP:$HOME/sw/Trimmomatic-0.39/adapters/TruSeq3-SE.fa:2:30:10 \ + MAXINFO:50:0.8 MINLEN:36 + +\end{lstlisting} + +It is normally best to check the quality of the data after trimming. This way you are confident the right +adapters were removed. + +\begin{lstlisting} +fastqc runs/trimmed_fastq/Col0treatment1.trimmed.fastq --outdir runs/fastqc_results + +# and to view it +firefox runs/fastqc_results/Col0treatment1.trimmed_fastqc.html + +\end{lstlisting} + +\subsection{From Reads to Quantified Transcripts} +A file with 5 million reads, high quality or not, is still very far away from human legible data, let +alone a biological conclusion. The next big step is to get from reads to transcripts and their abundance; +however, what this looks like will depend heavily on what species you are using and how good of a +genome and/or transcriptome annotation is already available. + +Some major options are shown in the following figure. Basically, the less data is already available +as a reference sequence, and the lower quality the available annotation there of, the harder this will +become. If you have a nice genome and a well annotated transcriptome with only modest duplication, like +\emph{A. thaliana} used here. +You can just align the reads to the transcriptome or even just count the k-mers in the +reads. However, if you are working with a non-sequenced species, you will either have to align the reads +to a closely related species, or make your own \emph{de novo} assembly. + +All of the options come with their own caveats, and what is appropriate depends on your reference and +your data. + +\includegraphics[width=\textwidth]{align_n_quant.pdf} + +For instance: +\begin{itemize} +\item If your DNAse treatment was incomplete, you don't want the DNA reads mapping to the most- +similar transcript, you want to align the reads directly to the genome (including organellar +genomes). +\item If you have a species where the genome is more complete than the corresponding transcriptome +annotation, your data will be easier to interpret if you align it directly to the genome. +\item If your species of interest is not sequenced, and not phylogenetically close to a sequenced species, +you will have little option but performing a \emph{de novo} assembly, but should be aware of biases such +as splitting highly abundant transcripts, chimeric transcripts, partial or absent low-abundance +transcripts, and detection of transcripts from other species (e.g. biotrophic fungi). Getting long reads +will help with some, but not all of the above. +\item If everything has gone well and you have a good genome available, a k-mer counting strategy will +run very quickly and very easily and give you data you can work with. +\end{itemize} + +We will focus on option 1. from the figure using Kallisto for k-mer counting, as we are working with +clean, easy, A. thaliana data. However, we'll also run a brief example for branches 2. - 4., in case this more +accurately reflects your data. + +\subsubsection{Counting k-mers with kallisto} +Classically reads have been aligned to the reference sequence. While it can be sped up by efficient +indexing, this is fundamentally a computationally intensive (and therefore slow) process, because the +reference sequences are very large and there are many reads to search against them. Recently, options such +as Kallisto have been developed, that do not actually align the reads, but count the k-mers they are made +up of and assign these counts to transcripts with these k-mers. A k-mer is a stretch of bases from a read +long enough to generally be unique, short enough to rarely contain errors and to be computationally +feasible. + +Example sequence and 6-mers that can be derived from it: +\begin{verbatim} +TCTTCCGGAGGTGGAGGAAAACCGACGATT +TCTTCC + CTTCCG + TTCCGG + TCCGGA + CCGGAG + . + . + . +\end{verbatim} + +While much faster than read alignment, this will sometimes be less accurate for assigning reads to the +correct transcripts as it does not fully utilize all the read information. + +When available, plant reference sequences can be obtained from major databases like Phytozome +(phytozome.jgi.doe.gov) or Ensembl (plants.ensembl.org), which provides some naming consistency. +Many more species are available through species specific resources. + +Here we have pre-downloaded the \emph{A. thaliana} information from Phytozome. + +\begin{lstlisting} +# first kallisto needs to create an index (de Bruijn graph) +# of the reference sequence +mkdir runs/kallisto_index/ + +kallisto index -i runs/kallisto_index/Ath_kallisto_index \ +studies/AthalianaReferences/resources/Athaliana_transcripts.fa + +\end{lstlisting} + +The indexing parameters are: + +\begin{tabular}{r l} + \toprule + \parbox[t][][t]{55mm}{index} & + \parbox[t][][t]{65mm}{tells kallisto to make an index} \\ \midrule + + \parbox[t][][t]{55mm}{-i} & + \parbox[t][][t]{65mm}{assign the index name} \\ \midrule + + \parbox[t][][t]{55mm}{Athaliana\_transcripts.fa} & + \parbox[t][][t]{65mm}{the downloaded reference transcriptome} \\ + + \bottomrule +\end{tabular} + +\begin{lstlisting} +# we'll want to keep our output in a separate folder here +mkdir runs/kallisto_results + +# now kallisto can quantify how many reads (appear to) come +# from each transcript + +kallisto quant -b 30 -i runs/kallisto_index/Ath_kallisto_index \ +-o runs/kallisto_results/treatment1 --single \ +-l 190 -s 20 runs/trimmed_fastq/Col0treatment1.trimmed.fastq +\end{lstlisting} + +The general quantification parameters are: + +\begin{tabular}{r l} + \toprule + \parbox[t][][t]{55mm}{quant} & + \parbox[t][][t]{65mm}{tells kallisto to run quantification} \\ \midrule + + \parbox[t][][t]{55mm}{-b} & + \parbox[t][][t]{65mm}{number of bootstraps (optional, for use with sleuth)} \\ \midrule + + \parbox[t][][t]{55mm}{-i} & + \parbox[t][][t]{65mm}{index name assigned above} \\ \midrule + + \parbox[t][][t]{55mm}{-o} & + \parbox[t][][t]{65mm}{output directory name} \\ + + \bottomrule +\end{tabular} + +And a few extra parameters are required for single end reads: + +\begin{tabular}{r l} + \toprule + \parbox[t][][t]{55mm}{\il{--single}} & + \parbox[t][][t]{65mm}{} \\ \midrule + + \parbox[t][][t]{55mm}{-l} & + \parbox[t][][t]{65mm}{mean fragment length} \\ \midrule + + \parbox[t][][t]{55mm}{-s} & + \parbox[t][][t]{65mm}{standard deviation of fragment length} \\ + + \bottomrule +\end{tabular} + +\begin{lstlisting} +# kallisto created a directory with three files +ls runs/kallisto_results/treatment1 + +# we just need one of them, take a look +less runs/kallisto_results/treatment1/abundance.tsv +\end{lstlisting} + +You should see tab separated values, most importantly showing us the transcript ID (target\_id), the +estimated read counts (est\_counts), and the normalized abundance estimate Transcripts Per Million (tpm). + +\begin{verbatim} +target_id length eff_length est_counts tpm +ATCG00020.1 1062 873 995 293.647 +ATCG00040.1 1581 1392 9 1.66579 +ATCG00050.1 240 51.331 0 0 +ATCG00065.1 114 5.2474 0 0 +\end{verbatim} + +We will later use the TPM for graphs and the counts for statistics. However, you may have noticed that +we've so far ignored five of our six read files. You can save all the commands individually to a file for +each of the other samples and run it. + +\begin{lstlisting} +#!/bin/bash +# trimmomatic + +java -jar $HOME/sw/Trimmomatic-0.39/trimmomatic-0.39.jar SE \ +assays/Bernsdorff2016_Illumina/dataset/Col0treatment2.fastq.gz \ +runs/trimmed_fastq/Col0treatment2.trimmed.fastq \ +ILLUMINACLIP:$HOME/sw/Trimmomatic-0.39/adapters/TruSeq3-SE.fa:2:30:10 \ +MAXINFO:50:0.8 MINLEN:36 + +java -jar $HOME/sw/Trimmomatic-0.39/trimmomatic-0.39.jar SE \ +assays/Bernsdorff2016_Illumina/dataset/Col0treatment2.fastq.gz \ +runs/trimmed_fastq/Col0treatment3.trimmed.fastq \ +ILLUMINACLIP:$HOME/sw/Trimmomatic-0.39/adapters/TruSeq3-SE.fa:2:30:10 \ +MAXINFO:50:0.8 MINLEN:36 + +# ... and so on +# kallisto +kallisto quant -i runs/kallisto_index/Ath_kallisto_index \ + -o runs/kallisto_results/treatment2 --single \ + -b 30 -l 190 -s 20 runs/trimmed_fastq/Col0treatment2.trimmed.fastq + +kallisto quant -i runs/kallisto_index/Ath_kallisto_index \ + -o runs/kallisto_results/treatment3 --single \ + -b 30 -l 190 -s 20 runs/trimmed_fastq/Col0treatment3.trimmed.fastq +# ... and so on. +\end{lstlisting} + +However, if you recall loops from the bash introduction, we can use loops, which will almost certainly be easier +in the long run. Particularly if you +have to change a parameter later. +Here is a very simple loop to catch all five of the other samples up with treatment1 + +\lstset{language=bash, style=bashstyle} +\lstinputlisting{../workflows/kallisto_loop.sh} + +Ideally, you will save the loop to a file, this is extremely helpful when it comes to repeating your work, +taking on the next study, or writing the methods section. + +Now we have six directories with six abundance files. We'll put these together into organized tables later, +but first lets take a look at some of the other options for when k-mer counts is not applicable. + +\subsubsection{Aligning reads to the genome} +This is a very robust method, even in combination with a good reference it can clean up the results +compared to using a primary transcriptome. For instance, it allows inclusion of splice variants without +causing problems with ambiguous alignments for reads that map to shared exons. Also, in reality most +RNAseq will have a lot of reads that properly map to intergenic regions, be they from DNA +contamination, unannotated genes, rRNA, or less-than perfect transcriptional regulation. That and reads +from introns that have not yet been spliced. Including the genomic background in the reference allows +reads to find their best mapping and doesn't bias these towards the next-closest gene. This all comes at a +cost of run time, but in the worst case it comes down to the computers time vs your time trying to interpret +the data later. + +Recent developments in indexing have also greatly sped up the alignment stages. In particular Hisat2 +makes a 10-fold gain on speed, while producing highly similar results to its predecessor, TopHat. + +In contrast to k-mer counting with Kallisto, alignment and quantification is a two step process, so we'll +first align reads with Hisat2, and then count the reads uniquely mapping to each locus with HTSeq. + +\begin{lstlisting} +# make the index +mkdir runs/hisat_index/ + +hisat2-build studies/AthalianaReferences/resources/Athaliana_167_TAIR9.fa \ + runs/hisat_index/Ath_hisat2_index + +# align the reads +mkdir runs/hisat_results/ + +hisat2 -x runs/hisat_index/Ath_hisat2_index \ + -U runs/trimmed_fastq/Col0treatment1.trimmed.fastq \ + -S runs/hisat_results/treatment1.hisat.sam + +# quantify the alignments +htseq-count runs/hisat_results/treatment1.hisat.sam \ +studies/AthalianaReferences/resources/Athaliana_167_TAIR10.gene_exons.gtf \ +-s no > runs/hisat_results/treatment1.hisat.htseq + + # check output +less runs/hisat_results/treatment1.hisat.htseq +\end{lstlisting} + +For information on the various parameters, check the help function of the individual programs. + +If you wanted to use this method, you could run a script with a loop in a very similar way to what we did +with Kallisto. + + +\subsubsection{Aligning reads to the transcriptome of a related species} +Sometimes your species of interest hasn't been sequenced, but one that is phylogenetically close has. +Because \emph{de novo} transcriptome assemblies can be very hard to work with, cross species alignment can be +a very appealing option. Bear in mind, you won't get the same resolution between paralogs that you would +have mapping to the actual genome, and the alignment will take much longer. + +We will demo BLAT (BLAST-like Alignment Tool), which falls in between BLAST and a short-read +aligner like Bowtie2 on the sensitivity vs speed scale. + +\begin{lstlisting} +# Still, on these computers we'll be using a smaller read set +# (minidata.fastq), and we'll have to convert it to fasta +mkdir runs/miniexample/ + +awk 'NR % 4 == 1 {print ">" $0 } NR % 4 == 2 {print $0}' \ +studies/BLATexample/resources/minidata.fastq > runs/miniexample/minidata.fasta +\end{lstlisting} + +You might be wondering what \il{awk} is and why that command looked so much more +\emph{complicated} and less \emph{human readable} than most? +Basically, \il{awk} makes it easy to perform simple, custom, text manipulations. +It's very useful when what you want to do is a bit too complicated for +some combination of what we've seen before such as find (e.g. \il{grep}) +or find and replace (e.g. \il{sed}), but is not yet complicated enough +that one wants to write a python script. + +As a brief explanation, the command above checks the remainder when the +number of rows (NR) is divided by $4$ in order to to determine if it's on an ID or +sequence row in a fastq file. It appends '>' to the ID and just prints +the whole sequence, producing fasta format. + +A proper introduction to \il{awk} is more than we have time for here, but +there's a lot of nice resources available on-line, e.g. +\url{https://www.tutorialspoint.com/awk/}. In any case, you can use the +commands here, or simply search for and use \il{awk} code for what you want to +accomplish, without already having mastered \il{awk}. + +\begin{lstlisting} +## now we set up a BLAT database (not required, but faster) +# we will map reads to Brassica rapa transcripts +mkdir runs/blat_db/ + +faToTwoBit studies/BrapaReferences/resources/Brapa_primaryTranscripts.fa \ + runs/blat_db/Brapa.2bit +## run blat (with 6-frame translation) +mkdir runs/blat_results/ + +blat runs/blat_db/Brapa.2bit runs/miniexample/minidata.fasta -out=blast8 \ +-q=dnax -t=dnax runs/blat_results/treatment1-Brapa.tsv +## count up the results +# count_blat.py is a mini-script to count the +# best blat/blast hits to each target sequence. +workflows/count_blat.py -i runs/blat_results/treatment1-Brapa.tsv > \ + runs/blat_results/treatment1-Brapa.counts +# check output +less runs/blat_results/treatment1-Brapa.counts +\end{lstlisting} + +% \subsubsection{\emph{de novo} transcriptome assembly} +% \fbox{\begin{minipage}{45em} + +% This section is entirely optional. \\ + +% Honestly, the end of \emph{de novo} assembly of short read data is in +% sight, as long read sequencing makes the assembly puzzle fundamentally +% both easier and more possible to solve. If you do not already have the +% data, \textbf{do not plan on a \emph{de novo} assembly of Illumina or other +% short read data}, not as a first choice, particularly not for RNAseq. \\ + +% That said, sometimes the data is already there, and at the very least +% preliminary results are required before more funding for long read +% sequencing can be found. \\ + +% So we aren't deleting this section yet. +% \end{minipage}} + + +% If you have the first sequencing data for a species, and cross species alignment isn't sufficient you can try +% \emph{de novo} assembly. This makes sense when, for instance, you are interested in novel genes for your species, +% when the next closest species is dozens of millions of years of evolution away, or when you are interested +% in the sequences themselves for anything from primer design to evolutionary analyses). + +% The leading current short read \emph{de novo} assembly options for transcriptomes (and genomes) work with a de Bruijn +% graph. That is, they have nodes, or k-mers, with the sequence data. and they have edges connecting them. +% This allows for a much smaller way to look for shared regions between reads than say all-on-all +% comparisons. But it still needs a lot of RAM and makes sub-optimal use of the read information. +% Therefore, make sure to run a thorough quality control once you are done, that all key genes of interest are +% present, that you have a sufficient number of genes that are full length, and that you filter out contigs +% coming from, for instance, biotrophic fungi or insects. + +% \begin{lstlisting} +% # We'll run a de novo assembly of the subsetted data with Trinity +% Trinity --seqType fq --single minidata.fastq --max_memory 1G +% less trinity_out_dir/Trinity.fasta +% # early quality control +% ./count_fasta.pl trinity_out_dir/Trinity.fasta +% \end{lstlisting} + +% \textbf{If you are planning to run a de novo assembly you should definitely use paired-end, stranded +% sequencing.} This gives you a lot more information to work with. If you had paired-end, stranded Illumina +% data you would run Trinity with something like the following parameters. + + +% \begin{lstlisting} +% # --left my_reads_1.fq --right my_reads_2.fq +% # --seqType fq --SS_lib_type RF +% \end{lstlisting} + +% \begin{lstlisting} + +% \end{lstlisting} + diff --git a/_reader/sections/03_biological_data_extraction.tex b/_reader/sections/03_biological_data_extraction.tex new file mode 100644 index 0000000000000000000000000000000000000000..c7a25b4b953b4c5cdb44a807e3c153250988607e --- /dev/null +++ b/_reader/sections/03_biological_data_extraction.tex @@ -0,0 +1,233 @@ +\section{Biological data extraction} +\subsection{Rstudio via second Docker file} +Please see sub section "Rstudio via second Docker file" from "Basics" +on the first day to restart working environment. + +\subsection{The Plan} +Now, we use R in earnest. Rstudio will let us see what is happening immediately when we do it, so it is a +good environment for beginners. + +\fbox{\begin{minipage}{45em} +Note: To allow for the level of customization that is often necessary for each project, this +section is a little lower-level and relies less on pre-made tools than before. In as much +there is a lot of R code, and retyping it would be rough on the schedule, so +we'll be using a different paradigm than we did with the bash scripts. You don't have to retype anything, +rather you're encouraged to copy from the .pdf. It is then your responsibility to go through it +slowly and make sure you understand what you are doing. We would recommend, for every section to +make sure you have understood it, that you vary something(s). Look at a different column, change the +plot color, etc... and then save your changes into a script. + +If for some reason we haven't given you the .pdf yet, please notify us. +\end{minipage}} + + We begin by getting our data from the kallisto results into a data.frame in +R. Open a GUI which allows you to see the data structure – all results are in folders with different names, +but the files themselves all have the same name. + + +\subsection{Import} +\lstset{language=R, style=Rstyle} + +\lstinputlisting{../workflows/kallisto_data_import.R} +The results from both methods are identical. This first step by step method is more to write, but safe for +beginners since you can change the names by hand. The loop is more challenging to produce but certainly +faster to write. We will use the laborious way for the rest of the training course. + +\subsection{Basic data.frame / matrix calculations} + +Before we do high level analyses, let's calculate means for the replicates and fold-changes. + +\lstinputlisting{../workflows/kallisto_data_frames.R} + +\subsection{PCA and HCL} +Before we start, let's think about what to expect. We have an experiment with a single factor, the +treatment. So when we look at how the samples group together, how many groups with information do we +expect? Yes, right, only one. So for our principal component analysis we expect that we have one +dimension reflecting the treatment variation. As we have no other variables, the other dimensions of the +PCA should show "noise" or biological variation due to random variables outside of our control. Ideally, +we want to see the treatment variation in the first dimension (meaning most of the variation is due to +treatment) and the noise based variation should occupy the lower dimensions. +Now we start with our first analysis. We ask whether the replicates are more similar to each other than the +samples. To that we use principal component analysis and hierarchical clustering of the samples. + +\lstinputlisting{../workflows/clustering_PCA_HCL.R} + +\subsection{Differential Expression} +One of the most typical questions for an RNA-seq analysis is what is different between two samples? To +answer that question we use sleuth. There is some debate about which tool is the best to detect differential +expression. Here we will use Sleuth, while other possibilities are deseq2, edgeR, and cufflinks. +Most meta analysis papers agree on cufflinks +being less suitable. DESeq2 and edgeR are both well established and work well with counts +per gene and assuming a negative binomial distribution, but appear to be less appropriate for the +'estimated counts' produced by Kallisto. Sleuth is developed by the same people as kallisto and is +supposedly most suitable for use with the results thereof. +Note: In order to use sleuth, kallisto has to be run with the bootstrapping option. + +The following has been adapted from this on-line tutorial: +\url{https://pachterlab.github.io/sleuth_walkthroughs/trapnell/analysis.html} + + +\lstinputlisting{../workflows/sleuth_differential_expression.R} + +\subsection{GO term enrichment} +We have about 1k significantly changed genes, too many to go through by hand quickly. We now check +if some functional categories are overrepresented. The idea behind this is as follows: +we have 1k genes changed of 28k genes in the genome, or 3.5\%. When we now look through the +categories, they should all have 3.5\% of their members changed if the changes were random. The +statistics check whether the difference from 3.5\% we see is significant. To do so, we use the package +TopGO. + +This analysis looks at the significantly different genes. + +The biggest impediment to the use of TopGO will be to have the gene to GO term assignment in the way +the package wants it. To this end, look at the file Athid2go.map in the folder. This is one way the package +accepts. If you have a different species and a different way of GO term annotation, you need to reformat +the file to match this one (can be done in R, but will be different for each case). + +\lstinputlisting{../workflows/functional_GO.R} +GO term lists are notoriously difficult to show in nice figures and also difficult to interpret. +Generally, you want many similar terms up to be convinced something is real. Revigo (on-line tool) can +help you to summarize the GO terms but essentially you need your biologist knowledge to understand +them. + +We chose a rather simple example. Among the terms enriched in the upregulated genes you can +always observe the words defense and immune. You have to know that systemic acquired resistance also +refers to the immune system, as does response to biotic stimulus. Clearly, the treatment induces the +defense response. + +Among the enriched terms in down-regulated genes appears photosynthesis, which occurs in the +plastid, and involves glyceraldehyde-3-phosphate, pigments and NADP. So clearly, although the names +are different, what they describe is similar. There is no way but simply knowing this. There are thousands +of GO terms described. The figures which were also produced are sometimes helpful. The GO terms +connected by lines are the ones defined as in parent:daughter relationships. + +\subsection{MapMan} +MapMan is an alternative way of looking at the overall patterns. Usually, you load all genes with +their fold-changes, not just the significant ones. In a perfect world, MapMan and GO should give you +similar results. One caveat is that the annotation behind GO and MapMan was made by different people +and therefore different areas of plant biology are covered differently. + +Before starting with MapMan we will be making use of the size of the class to install MapMan in +parallel on all course computers. MapMan's provided install script can be found both on sciebo, +and in the Raumlaufwerke folder. Run +the following with java, and follow the instructions of the installation wizard. + +\lstset{language=bash, style=bashstyle} + +\begin{lstlisting} +java -jar </path/to/>/MapManInst-3_6_0RC1.jar +\end{lstlisting} +We only need to export our fold-change table from R. Again, we have to make sure that our gene +identifiers match the ones used in the MapMan tool. + +Open MapMan, click on mappings and look at the Ath\_AGI mapping. Click on until you see AGI +codes and observe how they are formatted--again, no transcript suffix, only the locus. +Back to R. +\lstset{language=R, style=Rstyle} + +\lstinputlisting{../workflows/functional_mapman.R} + +Now back to MapMan. Right click on Experiments, choose add data. Select your exported table +named forMapmanloading.txt. Now you can choose if there is a header (yes!), whether you have decimal +point or decimal comma, and so on and so forth. The default works for us. Click okay + +Now click on Pathways, on Overview and choose Metabolism\_overview by double click. +You will have to choose a mapping; we use Ath\_AGI. Click okay. + +If nothing is displayed, click on your file shown in Experiments. I always change the color code to ++red and-blue and the scale to whatever fits my dataset (for this one I'd use 3). + +Clearly, photosynthesis, the CBB cycle and photorespiration are down. If you would like to test +that statistically, Mapman does that for you. Look at the stuff below the figure and pull it up. You will see +the Wilcoxon Rank Sum Test scores. Correct the p-values for multiple hypothesis testing and sort the table +by probability. Some of the significantly changed pathways we can see, most are not displayed. Look at +the other options for visualization to see, if anyone fits your needs. + +You can make your own pathway figures and map the MapMan bins onto them. Refer to the +MapMan manual for that! + +\subsection{Adding information from public data sources} +Combining information from multiple sources is often helpful, or even necessary to fully +understand one's results. This becomes harder to write a standardized protocol for, however, as one +researcher might simply be interested in including the TAIR annotation with their "mother table" for ease +of looking through the gene descriptions, and another might be interested in knowing whether there's a +significant overlap between their study and a particular paper. That said, there is frequently a similar line +of attack. First, get the desired comparative data in an organized format such as a list of gene IDs or a +table (csv/tsv/xls) with gene IDs and associated values. Second, make sure the gene IDs are comparable +(are they from the same genome/annotation release, do you have transcript IDs when you wanted gene +IDs, are they both upper/lower case). Third, combine data with yours (e.g. merge). Finally, in many cases +you will visually and statistically evaluate whether there is a overlap or correlation between the studies. + +\subsubsection{Including gene descriptions from TAIR} +First step first, we need to get the descriptions, and not by copying them one by one from the +website. Most biological databases have a bulk download page if you look, and TAIR is on the easy side. +From \url{https://arabidopsis.org} you simply need to go to Download:Genes, select the annotation (TAIR10), and you +already have a list of what is available for download. We'll take "TAIR10\_functional\_descriptions", which +is a tab-separated text file. Save, copy or move this file to studies/AthalianaReferences/resources for ease of access. The +ID format matches, so adding the descriptions to our major data.frame in R is very simple. + +\lstinputlisting{../workflows/additional_TAIR.R} + +\subsubsection{Including MapMan annotations} +% TODO, honestly we should move this to mapman4 +Frequently you want to have more flexibility working with a dataset than is possible with the provided +GUI programs. We'll now import the MapMan annotations into R, and look briefly at what else one could +do with them. For a handful of plant species, MapMan annotations are provided in the 'store' at +\url{https://mapman.gabipd.org/web/guest/mapmanstore}. +More commonly, they can be created for a species +using the Mercator webtool +\url{http://www.plabipd.de/portal/web/guest/mercator-sequence-annotation} +Assigning +MapMan annotations with Mercator normally runs in less than 15min and avoids any annotation version +trouble. For today, the results are provided in the file mapmanTAIR10.txt. + +\lstinputlisting{../workflows/additional_MapMan.R} +You can, of course, automate a Fisher's exact test or Wilcoxon test for every category in R, visualize many +samples at once by different bin levels, or whatever suits your purposes. + +\subsubsection{Special list - from a review paper} +Frequently, whole genome annotations have not been updated with the very latest information or +you are interested in a more obscure list, so you want to compare data with another paper directly. This +often makes the first step of getting the data in an organized table harder. Let's look at the review paper +\url{https://doi.org/10.1007/s11427-016-0048-4} +``Diverse roles of SERK family genes in +plant growth, development and defense response". In table 1, it has a list of SERK interacting genes, +which we might be interested in, but they are only available in pdf format. There are many ways to +approach this. You can for instance copy the whole table, and clean up any formatting issues by hand. +Often you can get the formatting to work by pasting it into either a spreadsheet program or a text file. +However, as we've worked with \emph{A. thaliana} a lot before, we already had a handy script to find all of the +AGIs out of a text file and will use this. + +First, copy all of table 1 (or even the whole paper) into a text file and save it as serk\_interacting.txt + +\lstset{language=bash, style=bashstyle} + +\begin{lstlisting} +# now run the following in bash +workflows/agi_finder.py -i studies/Fan2016_SERK_review/resources/serk_interacting.txt \ + > studies/Fan2016_SERK_review/resources/serk_interacting.agi +less studies/Fan2016_SERK_review/resources/serk_interacting.agi +\end{lstlisting} + +For today, we only needed the AGIs and not the additional information, so this will do. + +\lstset{language=R, style=Rstyle} +\lstinputlisting{../workflows/additional_SERK.R} + +OK, we'll never be able to cover all possible comparative data and data formats you might want to +look at, but hopefully you have some ideas on where to start now. Take a look around, if you are ahead of +your neighbors this would be a great time to try importing contextual data you would actually be +interested in for your studies. + +\subsection{Further Clustering} +\subsubsection{K-means} +\lstinputlisting{../workflows/clustering_kmeans.R} +\subsubsection{Heatmaps - gene set} +We've seen heatmaps at a large scale when displaying the hierarchical clustering above. +Sometimes however, they are useful just for looking at dozens of genes at once. Let's look at the MapMan +category, stress.biotic.signalling as an example. +\lstinputlisting{../workflows/miniheatmaps.R} + +\begin{lstlisting} +\end{lstlisting} + diff --git a/_reader/sections/04_section_longreads.tex b/_reader/sections/04_section_longreads.tex new file mode 100644 index 0000000000000000000000000000000000000000..7fd0c0ad786ded8c7a913b42dc3d7048f2ee5bbe --- /dev/null +++ b/_reader/sections/04_section_longreads.tex @@ -0,0 +1,1137 @@ +\section{Long Read Sequencing} + +\fbox{\begin{minipage}{45em} +Important qualifier: while we've worked a lot with RNAseq, we've so far only +had our own projects working with long reads of \emph{DNA} (and more Nanopore than PacBio). +So writing this section was learning by doing for us, too. We'd rather look +forward than look back, but please take everything with a grain of salt / don't +think that the following represents stable repeatedly-tested pipelines. +\end{minipage}} + + +Recent advances in long read sequencing are making drastic changes to what +is possible in terms of RNA sequencing. While we've covered the technologies +more thoroughly in a lecture, here is a drastic over simplification. + +Pacific Bioscience's Iso-Seq provides a way to repeatedly sequence potentially full length +cDNA of transcripts and their reverse complement, and the sequencing will proceed +in a circle until the polymerase falls off. Thus the preliminary base called +read produced requires more processing then some basic trimming, but can have very +high quality after consensus calling. + +Oxford Nanopore Technology's direct RNAseq measures the current +across a membrane while the RNA molecule itself (and in some cases +it's single cDNA strand) proceeds through a pore across the membrane. While the +error rate is high, the output appears better than PacBio, and there are some +exciting possibilities in detecting RNA modifications. + +\subsection{The Plan} + +\begin{wrapfigure}{r}{0.5\textwidth} +\includegraphics[width=0.5\textwidth]{longread_workflow.pdf} +\end{wrapfigure} + +While the throughput of these technologies is improving, we would argue +that as the sheer read count is \emph{currently} (at least) an order of magnitude lower +than for Illumina, if the goal is simply quantification, use Illumina. That +said, if the goal is establishing coding gene models (with or without a reference) +or something more specific, like quantifying alternative splicing, the long read +technologies can be extremely helpful. Here we'll look at three options for reconstructing +coding gene models with example Iso-Seq data. As with the Illumina RNAseq analyses +exactly which analysis one chooses will depend upon the exact data one has and +what sort of reference is available. + +In the left-most option, we'll map the preprocessed reads directly to a reference, +collapse them to transcripts, and use the transcripts to support gene calling. +This is also the most similar to a tractable approach for ONT direct RNAseq data. +In the middle option, we will cluster the preprocessed reads into draft transcripts +and then map these to the reference genome and perform gene calling as in the first option. +In the right-most option, we will look at a pipeline to make gene models without any +reference or with a very poor reference. + +\subsection{Resources} +Many of the commands are based on the following resources, and even if I were to +repeat this on similar data in just three months I would check back here to see if +anything major has changed. + +\textbf{The main Iso-Seq pipeline (trimming-polishing):} + +Main organizational page: + +\url{https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki} + +In particular: + +\href{https://github.com/PacificBiosciences/isoseq3}{isoseq3} + +\href{https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/Tutorial:-Installing-and-Running-Iso-Seq-3-using-Conda} +{Tutorial:-Installing-and-Running-Iso-Seq-3-using-Conda} + +\href{https://github.com/Magdoll/cDNA_Cupcake/wiki/Cupcake-ToFU:-supporting-scripts-for-Iso-Seq-after-clustering-step} +{Cupcake ToFU: supporting scripts for Iso Seq after clustering step} + +The \emph{de novo} options: + +\href{https://github.com/Magdoll/Cogent/wiki/Running-Cogent}{Running-Cogent} + +\href{https://github.com/PacificBiosciences/ANGEL}{Angel} + +\subsection{Data description} + +You can find the data for this section here: +\lstset{language=bash, style=bashstyle} + +% TODO double-check directory +\begin{lstlisting} +ls assays/Zhu2017_IsoSeq/dataset +\end{lstlisting} + + +This is a fairly early Iso-Seq dataset designed to investigate any role of alternative +splicing in the abscisic acid response in Arabidopsis seedlings +(\url{https://www.ncbi.nlm.nih.gov/pubmed/28407323}). This study also relied heavily +on Illumina data, which we won't be using here. +Just a brief caveat that as Iso-Seq develops (and becomes cheaper) much deeper sequencing and more replicates +will almost certainly be expected than is seen here. But it's plenty for the workshop; +we'll even only be using one of their two conditions. + +While the raw .hd5 files were downloaded from SRA, we've done some of the initial +pre-processing for you a) for the sake of time and b) because the early steps +in particular are very specific to a given technology and can often be provided by a sequencing +center. + +Still, it looked like this: + +\lstset{language=bash, style=bashstyle, frame=trbl, rulecolor=\color{red}} + +% TODO raw/m16*bax.h5 missing +\begin{lstlisting} +# convert from Hierarchical Data Format files to bam +# (sequel anyways directly outputs bam for newer data) +bax2bam raw/m16*bax.h5 +# organize things by moving output into it's own directory +mkdir -p runs/isoseq/subreads runs/isoseq/ccs +mv m16* subreads/ +# take the raw subreads and convert them to the 'circular consensus sequence' +bam=m161031_124550_42199_c100941222550000001823217706101620_s1_X0.subreads.bam +ccs --numThreads=4 --noPolish --minLength=50 \ + --maxLength=15000 --minPasses=1 --minPredictedAccuracy=0.8 \ + --reportFile=runs/isoseq/ccs/keep.ccs.report --minSnr=3.75 --minZScore=-999 \ + --maxDropFraction=0.8 runs/isoseq/subreads/$bam runs/isoseq/ccs/m16.ccs.bam +# we also truncated the output file name here for the workshop, although +# we'd generally encourage you too keep the naming as consistent as possible +\end{lstlisting} + +\subsection{First Look} + +So first thing first, take a look at the data. + +\lstset{language=bash, style=bashstyle} + +%TODO add `cd` or rephrase +\begin{lstlisting} +# check that you're in the IsoSeqData directory and +ls -R runs/isoseq/ +# OK, this is new, as you can see there are neither fasta nor fastq sequencing files +# Instead the .bam files are the main sequence containers, +# we've seen .sam/.bam files before for an _alignment_, hmmm... +# further, the following won't get us very far +less runs/isoseq/subreads/m16*.subreads.bam +# note that throughout this whole section using the full file name instead +# of m16* is always OK. Just hit tab. +\end{lstlisting} + +So what's going on here? While they aren't aligned to a \emph{reference}, +one of PacBio's raw reads is well represented aligned back to \emph{itself} +as the read is basically going in circles. The circular consensus sequence +is no longer aligned to anything, but it avoids a trivial format conversion. +It also allows a little more flexibility for PacBio to sneak some additional +data in. + +We're going to need some ways to view .bam files now. They have a text-based +(and much more hard-drive-hungry) sister format, .sam. Luckily interconversion +is easy. + +\begin{lstlisting} +# convert to the text format, and pipe the first few lines to a file +# so we can look at them in more detail +samtools view runs/isoseq/subreads/m16*.bam|head > runs/isoseq/subreads/head_subreads.sam +samtools view runs/isoseq/ccs/m16*.bam|head > runs/isoseq/ccs/head_ccs.sam +\end{lstlisting} + +Now take a look at the files with either \il{less} or \il{gedit}. +We'd also encourage you to use \il{cut -f N <file>} to look just at a +specific column (N) when appropriate. +There are a few things to notice. +\begin{itemize} +\item The basic format: google 'sam format' or checkout +\url{https://samtools.github.io/hts-specs/SAMv1.pdf} for a full description, +but the basic idea of the format is that it has different columns for: the sequence, +its quality information, where and how it maps to a reference (or not), +and some. +\item OK, it's not quite normal .bam/.sam format, PacBio has expanded it some, +if you're curious see: \url{https://pacbiofileformats.readthedocs.io/en/3.0/BAM.html} +\item Let's look at the sequence itself (column 10), note the large stretches +of AAAA's and TTTT's, these are your poly-A tail being read as the circle +turns round. +\item Judging just off of the poly-A tail, do you notice the +quality difference between the subreads and the ccs? +\item Look at the sequence identifiers in column 1. For the subreads you see +that after taking the top few lines with \il{head} +we only actually see sub reads from one read (.../8/) +while sub-coordinates are given in the final field. In contrast we see +different reads in the ccs and the final field just says 'ccs'. +\item Can you find the primer sequences (see \il{clonetech_SMARTer.fa}) +in the reads? Hint: you can orient where to look with the poly-A tail. +\end{itemize} + +Unfortunately there really is currently no equivalent of FastQC or any at-the-start +quality control for PacBio data. The quality can be estimated by mapping +to a reference, but we're getting there in good time anyways; so for now +just try and get a feel for the sort of sequence we're looking at. The subreads +go in circles, the ccs is the consensus of the subreads from individual full-reads. +Both still have adapters and poly-A tails. + +\subsection{Trimming, Clustering and Polishing} +\subsubsection{Leveling up relative to day one} +On the first day with the Illumina RNAseq data, we always just put results +in our main directory, and it was pretty full. While this made things a +bit easier at the beginning, it becomes very confusing as a project grows. +Both because today's +work has the tendency to produce multiple and complicated output files +and because it is generally better form, we will try and use sub directories +to structure the project a bit more. + +Similarly, on the first day we wrote out, at least for the Kallisto pipeline, +almost every parameter. Today we'll just take the basic explanations from each +tool's usage function, and only stop to explain parameters that are particularly +confusing or where we've deviated from the 'standard analysis' pipeline. But +please feel free to ask questions (or point out anything confusing)! + +\subsubsection{Adapter trimming} + +We'll trim off the adapters with \il{lima}. If the run contained multiple +multiplexed libraries, we could also perform demultiplexing with \il{lima}. + +\begin{lstlisting} +mkdir -p runs/isoseq/trimmed/ +lima runs/isoseq/ccs/m16.ccs.bam workflows/pacbiosciences_lima/clonetech_SMARTer.fa \ + runs/isoseq/trimmed/m16.ccs.bam --no-pbi --isoseq +# The first three arguments are obvious from the usage +# Usage: lima [options] INPUT BARCODES OUTPUT +# --no-pbi since we won't need the .pbi output +# --isoseq is necessary to tell it what to expect for the primers + +# let's check the output +ls runs/isoseq/trimmed/ +# in particular, the summary file might be helpful +less runs/isoseq/trimmed/m16.ccs.lima.summary +# In addition to removing adapters +# we see lima removed any reads with mismatched adpaters +# and now let's convert some of the .bam output to .sam as above +samtools view runs/isoseq/trimmed/m16.ccs.primer_5p--primer_3p.bam | head \ + > runs/isoseq/trimmed/head_trimmed.sam +less runs/isoseq/trimmed/head_trimmed.sam +# notice that _most_ of the adapters are now gone and that +# _most_ of the poly-A tails are at the end of the sequence now +\end{lstlisting} + +\subsubsection{Clustering} +So at this point we have some reads that are \emph{in some ways} similar +to that which we had after running Trimmomatic on the Illumina data. But +not quite. We just saw the remaining poly-As and adapters that were in the +middle of some reads, which may be caused by concatemers. Further, +as the reads have a poly-A in them, we essentially have strand information. + +The next step, \il{isoseq3 cluster}, cleans up any concatemers and orients +reads while clustering different reads that \emph{appear to} represent the same transcript. + +Note that the primary goal of this step is the reconstruction of the +gene models, which should be done with all data, not sample by sample. +So if we were using more samples, we'd want to run \il{dataset create} +first, which would link files in a way that they could all be processed together. + +\begin{lstlisting} +mkdir -p runs/isoseq/unpolished/ +# from the help function: +# isoseq3 cluster [options] input output +isoseq3 cluster runs/isoseq/trimmed/m16.ccs.primer_5p--primer_3p.bam \ + runs/isoseq/unpolished/m16.unpolished.bam --verbose --require-polya +# again, we have a lot of output +ls -sh runs/isoseq/unpolished/ +# there's two files we really care about, and both are .bam +# *.unpolished.flnc.bam contains the fully cleaned, AKA +# "full-length non-chimeric" reads. Finally. +# *.unpolished.bam has draft reconstructed transcripts + +# we can also get a report on how many flnc reads were used +# for each draft transcript +isoseq3 summarize runs/isoseq/unpolished/m16.unpolished.bam \ + runs/isoseq/unpolished/m16.summary.csv +less runs/isoseq/unpolished/m16.summary.csv +\end{lstlisting} + +After this, the pipelines we're looking at today start to diverge, +with the first "ref based, mapped reads" option continuing with +the flnc reads and the others continuing to polishing, below. + +\paragraph{Optional challenge: } +We got to this step, compared it to the reported results +from the paper to see if the basics (e.g. number of transcripts produced) +seemed similar, and freaked out a bit. + +\begin{lstlisting} +# how many draft reconstructed transcripts did we get? +samtools view runs/isoseq/unpolished/m16.unpolished.bam|wc --lines +\end{lstlisting} + +Can you see why we were worried? Can you also figure out why we decided +the difference was nothing to worry about, just a reason to read papers with care? + +\subsubsection{Polishing} +In terms of quantity, each draft transcript was constructed +from an average of around 10 full-length non-chimeric reads, which sounds +OK at first. but if you glanced at \il{runs/isoseq/unpolished/m16.summary.csv} +or maybe even made a histogram, +you know that they are definitely not evenly distributed. +Indeed, well over half are constructed from just 2-3 flnc reads. +But at the very start of this pipeline we had a lot more +data, right? The subreads. + +The next step is to take the draft `unpolished' transcripts and carefully polish +them with all that subread data we had at the very very start. This will allow +any reads that were dropped (e.g. because they were not full length) to still +contribute to the quality of the final sequences, and further allow for a +high quality ccs read from 10+ subreads to essentially be trusted more than +a ccs read with just a single subread. + +%%%%%%%%% +%TODO : running into error: Missing .pbi + +\begin{lstlisting} +mkdir -p runs/isoseq/polished/ +# isoseq3 polish -h # uncomment for explanation +isoseq3 polish runs/isoseq/unpolished/m16.unpolished.bam \ + runs/isoseq/subreads/m16*.subreads.bam \ + runs/isoseq/polished/m16.polished.bam +\end{lstlisting} + +OK, this step is going to take a long time (maybe an hour?). It's working with 5GB of reads after all +But we actually want to be able to compare the pipelines in the end, so we didn't +want to unnecessarily subset this data. + +In the mean time we can continue with the "reference based, mapped reads" /"left most" +option. + +%%%%%%%%% +%TODO : adapt back-up routine +% Maybe add a back-up folder to runs? + +If something goes wrong or if you get through the "Mapping FLNC reads" part +before it finishes. There is a copy of the expected output at \il{runs/_BackUpData/isoseq/polished/} +Feel free to copy the data from there if need be. +\lstset{language=bash, style=bashstyle, frame=trbl, rulecolor=\color{red}} + +\begin{lstlisting} +cp runs/_BackUpData/isoseq/polished/* runs/isoseq/polished/ +\end{lstlisting} +% printf "\nruns/isoseq/polished/\n" >> .gitignore + +\subsection{Coding Genome Definition} +\subsubsection{Reference Based} +\paragraph{Mapping FLNC reads.} + +While polishing is running happily, we're going to check out what happens +when we take the cleaned up "full-length non-chimeric consensus reads" and +map them to the genome directly. This method will depend upon the genome +to help clean up any errors in the reads, but let's face it, \emph{sometimes} +you probably should trust the genome over the new transcriptome. Not always of course, +but sometimes, often even. + +GMAP is a pretty fast way to map full transcripts or similar back to the genome. +But like most aligners, it expects .fasta or .fastq files. + +Open a new terminal (Ctr+Alt+T) so you can work while polishing runs. + +\lstset{language=bash, style=bashstyle} +\begin{lstlisting} +# activate the virtual environment in the new terminal +source $HOME/Documents/venv/bin/activate +\end{lstlisting} + +Now let's take a look at the `flnc' reads + +\begin{lstlisting} +# we have a .bam file for the flnc reads +samtools view runs/isoseq/unpolished/m16.unpolished.flnc.bam | \ + head > runs/isoseq/unpolished/head_flnc.sam +less runs/isoseq/unpolished/head_flnc.sam +# two things: +# First, we finally have some cleaned-up mappable looking reads +# Second, the data from a fasta file is just a subset of what we see +# the read name (column 1), and the sequence (column 10) +# we're just going to make a fasta file out of the bam with simple text manipulation +samtools view runs/isoseq/unpolished/m16.unpolished.flnc.bam | \ + awk '{print ">"$1"\n"$10}' > \ + runs/isoseq/unpolished/m16.unpolished.flnc.fa +# we simply googled 'bam to fasta awk', just FYI +head runs/isoseq/unpolished/m16.unpolished.flnc.fa +\end{lstlisting} + +Back to GMAP, like the other aligners we've seen, GMAP is going to need +an index, and then we can map to this. + +\begin{lstlisting} +# make index (-D where -d index_name input.fa) +gmap_build -D runs/gmap_index -d Athaliana \ + studies/AthalianaReferences/resources/Athaliana_167_TAIR9.fa +# map reads +mkdir -p runs/isoseq/mapped/ +# this should take about 8 min with one thread +# add -t N to speed it up if polishing is finished +gmap -D runs/gmap_index -d Athaliana -f samse -n 0 --cross-species \ + --max-intronlength-ends 200000 -z sense_force \ + runs/isoseq/unpolished/m16.unpolished.flnc.fa 2> \ + runs/isoseq/mapped/m16.flnc.gmap.log > runs/isoseq/mapped/m16.flnc.sam +# the parameters are simply those recommended for isoseq, but briefly +# -D, -d : to find the index made above +# -f samse : to export .sam format +# -n 0 : allow chimeric alignments +# --max-intronlength-ends 200000 : more realistic max for EUK introns +# -z sense_force : since sequences were oriented 5'-3' by poly-A tail +# 2> runs/isoseq/mapped/m16.flnc.gmap.log : redirect copius standard error output to file +# > runs/isoseq/mapped/m16.flnc.sam : finally redirect standard out to our sam file + +# now we have a "normal" sam/bam file +less runs/isoseq/mapped/m16.flnc.sam +# note: in columns 3, and 4 we have the chromosome and position of mapping + +# we'll want a sorted bam file for the next step +samtools sort runs/isoseq/mapped/m16.flnc.sam > runs/isoseq/mapped/m16.flnc.sorted.bam +\end{lstlisting} + +OK, we have mapped data, it's time to take a break and really look at our long +read data, and also how it compares to the Illumina reads we had before. +Feel free to pair up with your neighbor for the next part and have one person +run the Illumina half and the other the Iso-Seq half. + +%TODO: create samtools index in runs/ +%TODO: data missing in runs/hisat_results/treatment1.hisat.sam -> earlier script didn't run. +%TODO: %TODO: Stopped here for now. + + +\begin{lstlisting} +# first to visualize the reads we'll need indexes +# one for the genome +samtools faidx studies/AthalianaReferences/resources/Athaliana_167_TAIR9.fa +# one for the new mapping of flnc reads +samtools index runs/isoseq/mapped/m16.flnc.sorted.bam +# for the Illumina reads we need both to convert to bam and sort +samtools sort runs/hisat_results/treatment1.hisat.sam \ + > runs/isoseq/mapped/illumina.sorted.bam +# and we need an index +samtools index runs/isoseq/mapped/illumina.sorted.bam +\end{lstlisting} + +To look at the assembly, we want to briefly install a genome browser, +Tablet, \emph{on the host machine}. + +Open a new terminal on the host machine, and run +\begin{lstlisting} +cd $HOME/rnaseq-workshop/ +./workflows/tablet_linux_x64_1_21_02_08.sh +# click through the installer, it should open a +# GUI genome visualization tool when done. +\end{lstlisting} + +In Tablet, you'll want to click to "Import an assembly". +Keep in mind that in Tablet lingo "assembly" is referring +to what we've been calling "mapped" or "bam". You'll also +want to open the "Reference" + +\il{studies/AthalianaReferences/resources/Athaliana_167_TAIR9.fa}. + +Finally, once +you've imported the "assembly" you'll want to click on "Import features" +and open + +\il{studies/AthalianaReferences/resources/Athaliana_167_TAIR10.gene_exons.gtf}. + +Now take some time to look at the data. Check out your favorite gene perhaps. +Can you see why long reads are so much better for cases of complicated +alternative splicing? Do you see other differences between the data (besides +that they are from totally different samples with different genes expressed)? +Is the typical coverage distribution the same? + +Once you've got a feel for the differences between Illumina and Iso-Seq data +we'll move on to collapsing mapped reads into transcripts. + +\begin{lstlisting} +mkdir collapsed +# the next script requires a _sorted_ sam file, so one more conversion +samtools view runs/isoseq/mapped/m16.flnc.sorted.bam > \ + runs/isoseq/mapped/m16.flnc.sorted.sam +# and the actual command to collapse +collapse_isoforms_by_sam.py --input runs/isoseq/unpolished/m16.unpolished.flnc.fa \ + -s runs/isoseq/mapped/m16.flnc.sorted.sam -o collapsed/m16.flnc.to_genome \ + --flnc_coverage 2 + +ls collapsed +\end{lstlisting} + +To understand everything going on and the parameters for the previous step, +we need to think a little bit about some of the wet lab details of Iso-Seq. +In particular, reading a PacBio pamphlet, or just looking at the naming of +"full length, non chimeric" reads, it seems like one could simply collapse every +read that maps to the same position, perhaps allowing a small margin of flexibility +for what counts as "the same". Certainly, this is part of what is going on, and +the \il{--max_3_diff} of 100 bps is basically this "small margin of flexibility". +But what about the \il{--max_5_diff} which defaults to 1000 bps? This must be, +and is, much more +lenient because of wet lab constraints. Nothing in the general Iso-Seq protocol +actually has a way to tell if reads run from the 5' cap to the 3' poly-A tail, +but rather, what PacBio defines as "full length" are reads that have the poly-A tail +on one end, and the 5' adapter on the other end. +The 5' adapters are attached with a blunt end +ligation, so this does mean that reverse transcription successfully proceeded +to the end of the piece of RNA. It doesn't mean that this was still an intact +mRNA, and indeed, normally degradation will result in reads, particularly, +from longer transcripts, that are slightly to severely truncated on the 5' end. +That's why the collapsible margin is wider on the five prime end. + +Some care should be taken here depending upon whether one is working with +FLNC reads or high quality transcripts. This script appears to have been designed +primarily for performing the second round of collapsing for the high quality +transcripts. Therefore it's likely that both 5' and 3' margins should be increased +when working with FLNC reads, although \emph{most} of the time it seemed ok for the +FLNC reads here. If you have high quality transcripts from \il{isoseq3 polish} +that you trust a little more you could also set \il{--dun-merge-5-shorter}, +which will prevent alignments from collapsing that differ on the 5' end by less +than the margin, but which have a different set of exons. + +\includegraphics[width=0.65\textwidth]{dun_merge.pdf} + +Finally, the \il{--flnc_coverage 2} is \emph{only} appropriate for FLNC reads, +as it will cause every possible transcript that is supported by less than 2 +alignments to be sorted into the 'bad' gff file while the rest end up in the 'good' +gff file. With high quality transcripts, if the first step worked right for a transcript, +you only would have a coverage of 1, but definitely want to keep it (and consider it +'good') anyways. +We were skeptical this made sense for FLNC reads either, as particularly with +a dataset like this that is a little low on coverage in general, there will be +a lot of real transcripts with just one read. I personally hate throwing out +precious data. However, the \il{collapse_isoforms_by_sam.py} tool does not have a +complicated error model or anything behind it, it basically just applies a series +of thresholds to see if some alignments should be collapsed. This means it treats +a single unique transcript, which provides the only data for a genetic region the same +as it treats a single transcript that is 'unique' only because it was erroneous and +slightly misaligned at a splice junction. In the first case, we want +to keep the single read, because it's all we have. However, in the second case, +we definitely want to ignore the single outlying read and trust the majority +at this locus. Luckily splitting the transcripts into 'good' +and 'bad' allows us to handle them in the future with different trustworthiness and +basically set the 'bad' set to be ignored at every locus where there was a 'good' +model. This workaround is a bit unwieldy, and certainly not ideal, +but it introduces fewer errors than without using +\il{--flnc_coverage 2}, while keeping more data than just taking high quality +transcripts. + +Now you could stop here and just use the 'good' transcripts, or add one more +step to merge in the non-conflicting set of 'bad' transcripts. +That would provide a model for the transcribed parts of the genome. +The recommend Iso-Seq tutorials kinda peter out here, and maybe there +are projects where one would stop here, e.g. if you were running Iso-Seq to +learn about alternative splicing in your favorite tissue, but in an otherwise +well annotated species. That said, one of the most promising applications of +Iso-Seq is to help define gene models for newly sequenced species. And +here, where the goal is to describe full gene models, we really ought to take +full advantage of the genome to help clean up, in particular, omissions in our +transcript models. + +For instance, the genome can help further when: +\begin{itemize} +\item None of the reads / transcripts for a locus were truly full length +\item Transcripts for a locus were not expressed (enough to be measured) in any sequenced condition +\end{itemize} + +\fbox{\begin{minipage}{45em} +OK, typically to define genes for a new genome one would bring in \emph{all} +the extrinsic data that is available, whether it was Illumina RNAseq, Iso-Seq, +protein alignments from related species, ESTs, or something else and feed +all of this to help support +a \emph{de novo} gene caller such as SNAP or Augustus that has a model for typical +codon usage of genic and non-genic regions and can look for likely ORFs. +This is somewhat organized and automated in public tools such as MAKER or PASA. +\\ + +That said, that really starts to get beyond the scope of this course, so we will +simply look at how you could use \emph{just} the Iso-Seq data to support the +\emph{de novo} gene caller Augustus. \\ + +Hopefully, Iso-Seq will be incorporated more officially into gene calling +pipelines in the near future. +\end{minipage}} + +Augustus can incorporate extrinsic evidence, like our Iso-Seq data, by adjusting +its posterior probability of a particular call based upon the extrinsic data +provided as 'hints'. That is to say, it runs its \emph{de novo} prediction +and then adjusts them as more or less likely based on the hints. + +So we have to convert our collapsed gff file into a slightly different gff file that +Augustus can understand as hints. Basically what we're doing here is converting the +transcripts into a format where we can give AUGUSTUS more careful instructions on how to +consider each part. Are the splice junctions as trust worthy as the aligned regions +themselves? What about the transcription start site? Should exons and transcripts be considered +only as a whole, or also as a part? + +Let's just try it and look at the result. + +\begin{lstlisting} +# convert to gff3 (bc the hints script takes only gff3 correctly) +pfx=collapsed/m16.flnc.to_genome.collapsed. +gffread ${pfx}bad.gff -o ${pfx}bad.gff3 +gffread ${pfx}good.gff -o ${pfx}good.gff3 +# feel free to look at output with 'less' to see differences between gff and gff3 + +# convert our exons and transcripts to 'hints' for augustus +# note the 'good' hints (w/ 2+ reads) have a higher priority than the 'bad' +gff3_to_hints_isoseq.py -i ${pfx}good.gff3 -o ${pfx}good.hints.gff3 --priority=2 +gff3_to_hints_isoseq.py -i ${pfx}bad.gff3 -o ${pfx}bad.hints.gff3 --priority=1 + +# finally we'll need the hints to be in one file +cat ${pfx}good.hints.gff3 ${pfx}bad.hints.gff3 > ${pfx}hints.gff3 +\end{lstlisting} + +Let's look at the 'features' column of the gff, and how they've now changed +from the typical 'transcript' 'exon' that we had before. + +\begin{lstlisting} +# open file | get 3rd column | sort | count unique lines +less ${pfx}hints.gff3 |cut -f3 | sort | uniq -c +\end{lstlisting} + +OK, so first, all 'ep' is very similar to the number of exons +(if you run a similar count on the input files) and the features +tts and tss are very close to the number of transcripts. The +remaining features are in between. Huh, but the names are pretty +cryptic, so what's what? + +\begin{tabular}{r l} + \toprule + \parbox[t][][t]{15mm}{ass} & + \parbox[t][][t]{135mm}{Acceptor Splice Site, on 3' end of the intron (most commonly AG). +Not our name for it.} \\ \midrule + \parbox[t][][t]{15mm}{dss} & + \parbox[t][][t]{135mm}{Donor Splice Site, on the 5' end of the intron (most commonly GT)} \\ \midrule + + \parbox[t][][t]{15mm}{ep} & + \parbox[t][][t]{135mm}{Exon Part, so an exon with the edges trimmed (by \il{--trim_exonparts}) and Augustus will try to contain it in an exon, not match exactly} \\ \midrule + + \parbox[t][][t]{15mm}{ip} & + \parbox[t][][t]{135mm}{Intron Part, so an intron with the edges trimmed (by \il{--trim_intronparts}) and Augustus will try to contain it in an intron, not match exactly} \\ \midrule + + \parbox[t][][t]{15mm}{tss} & + \parbox[t][][t]{135mm}{Transcription Start Site} \\ \midrule + + + \parbox[t][][t]{15mm}{tts} & + \parbox[t][][t]{135mm}{Transcription Termination Site} \\ \midrule + + \bottomrule +\end{tabular} + + +I'd like to say we could run augustus now, but it would run overnight for the +whole genome. So we're just going to run it for a subset of the genome. + +The following script will cut a little chunk out of fa, gff3, and bam files +alike. This is appropriate for the workshop or other niche cases like iterating +through different parameters, etc to see what works best. But for the record, +if you're doing this for the main run on a real project, please don't +cut it smaller than your chromosomes/scaffolds. + +\begin{lstlisting} +# feel free to just type in place +official_gtf=studies/AthalianaReferences/resources/Athaliana_167_TAIR10.gene_exons.gtf + +subset_genome_related.py --fasta studies/AthalianaReferences/resources/Athaliana_167_TAIR9.fa \ + -sChr1 -f1 -t300000 --gff ${pfx}good.gff3,${pfx}bad.gff3,${pfx}hints.gff3,$official_gtf \ + --bam runs/isoseq/mapped/m16.flnc.sorted.bam + +# the script reports its output, but if it's easier +find -name *__Chr1* +\end{lstlisting} + +Now we can run augustus +\begin{lstlisting} +# I don't like typing the same things over and over again +where=Chr1_1-300000 +mkdir gene_models +augustus --hintsfile=${pfx}hints__${where}.gff3 --species=arabidopsis \ + --alternatives-from-evidence=true --extrinsicCfgFile=extrinsic.E.cfg \ + --UTR=on --allow_hinted_splicesites=atac \ + studies/AthalianaReferences/resources/Athaliana_167_TAIR9__${where}.fa \ + > gene_models/flnc.${where}.augustus + +less ${pfx}augustus +# the augustus output has the hints, commented protein sequence, explanation, +# and, what we are after, gtf lines with 'AUGUSTUS', the gene models. +# let's subset it to have just these. +less gene_models/flnc.${where}.augustus | grep AUGUSTUS > \ + gene_models/flnc.${where}.augustus.gtf +\end{lstlisting} +%# copying the commands for the full runs here so as to be able to provide +%# comparative results +%augustus --hintsfile=collapsed/m16.flnc.to_genome.collapsed.hints.gff3 --species=arabidopsis --alternatives-from-evidence=true --extrinsicCfgFile=extrinsic.E.cfg --UTR=on --allow_hinted_splicesites=atac studies/AthalianaReferences/resources/Athaliana_167_TAIR9.fa > ../BackUpData/augustus/m16.flnc.to_genome.full.gtf +%# and for the hq (note next section must also be finished) +%augustus --hintsfile=collapsed/m16.hq.to_genome.collapsed.hints.gff3 --species=arabidopsis --alternatives-from-evidence=true --extrinsicCfgFile=extrinsic.E.cfg --UTR=on --allow_hinted_splicesites=atac studies/AthalianaReferences/resources/Athaliana_167_TAIR9.fa> ../BackUpData/augustus/m16.hq.to_genome.full.gtf +%augustus --species=arabidopsis --UTR=on studies/AthalianaReferences/resources/Athaliana_167_TAIR9.fa> ../BackUpData/augustus/no_hints.full.gtf + +OK cool, we have our gene models, we recommend loading them into tablet +(with the subsetted .bam and .fa files) and comparing them to the subsetted versions of the +raw transcript (good/bad.gff3), and the official Arabidopsis annotation +(studies/AthalianaReferences/resources/Athaliana\_167\_TAIR10.gene\_exons.gtf). + +Tablet is a little frustrating in that loading multiple 'transcript' features +from different .gff's will pile them on top of each other in one line. You +may a) open multiple tablet instances (think teamwork) or b) adjust the 'feature' +column of the gff to have a different name. See below for an example + +\begin{lstlisting} +less gene_models/flnc.${where}.augustus.gtf | awk 'BEGIN {OFS = FS = "\t"} \ + { sub(/^/, "flnc.", $3) }1' > gene_models/flnc.${where}.augustus.gtf.tablet +less gene_models/flnc.${where}.augustus.gtf.tablet # check results +# don't worry about understanding all of the 'awk' command. The first bit +# is to tell it to use only tabs as a column separator. The second bit says +# find and replace the start of column 3 with "flnc." +# if you want to run this for the other files. The only part of the awk command +# you'll want to change is the "flnc." (e.g. to "original.") +\end{lstlisting} + +Final note on viewing multiple gffs worth of feature sets in Tablet, you'll have to go to the features tab +(blue triangle), click on 'select tracks', and then check all the tracks you want +to see + +In any case, congratulations for getting all the way to our first full +gene models! + +\paragraph{Mapping high quality isoforms} +We're now going to run (nearly) all the same steps for mapping high quality +isoforms that we ran for mapping flnc reads above. As it is essentially +the same pipeline, we are not going to give you the commands, +although we will point out every instance where the commands should +differ from above beyond the dataset being used. It is entirely up +to you if, when and how much you check the intermediate output +with e.g. less and tablet. For clarity, wherever you're supposed to fill +in something it will be marked with \il{...} for a whole line or \il{< >} +for a partial line. + +Since you no longer have an exact script, we highly recommend filling +in one as you go (e.g. in a text file) + +\begin{lstlisting} +# first, we need to look at the output from polishing +# if polishing did not finish, please stop it and copy the backup polishing results +ls polished +# The polished high quality draft transcripts are m16.polished.hq.fastq.gz +# extract them with 'gunzip' +... +# we'll start with gmap. +# we already have a database, so we don't have to run gmap_build, just gmap +# your input file will be 'polished/m16.polished.hq.fastq' +# please name the output in a way you know what it is and that it doesn't +# overwrite the previous analyses. +... +# we'll now need to sort, and convert back to sam +# think 'samtools view' for conversions, and 'samtools sort' to sort +... +# while you're at it and for later, please index your sorted bam +samtools index <your sorted bam> +# for the collapse_isoforms_by_sam.py, there are quite a few changes +collapse_isoforms_by_sam.py --input <the hq fastq file> -s \ + <the gmap sam output> -o <your output prefix> --fq +# we don't want a coverage filter as we had for the flnc reads, +# and we have to explicitly tell the script to parse fastq input (--fq) + +# if you check your output, you'll see fewer output files than before, +# as without the --flnc_coverage parameter, the output won't be split into +# 'good' and 'bad'. Can you find the main output? It should be named +# <your output prefix>.collapsed.gff +# please convert this to gff3 with 'gffread' +... +# please convert the .gff3 file to hints using gff3_to_hints_isoseq.py +# note that you do not need to set --priority (but it also won't hurt) +... +# please subset the new bam and gff files with ./subset_genome_related.py +subset_genome_related.py --bam <your sorted.bam> \ + --gff <your collapsed.gff>,<your hints.gff> -s Chr1 -f 1 -t 300000 +# and run augustus. +# All you will need to change are the hints and output files. +... +# and subset to just lines containing 'AUGUSTUS' with 'grep' +... +\end{lstlisting} +% and with answers.... +%# first, we need to look at the output from polishing +%# if polishing did not finish, please stop it and copy the backup polishing results +%ls polished +%# The polished high quality draft transcripts are m16.polished.hq.fastq.gz +%# extract them with 'gunzip' +%gunzip runs/isoseq/polished/m16.polished.hq.fastq.gz +%# we'll start with gmap. +%# we already have a database, so we don't have to run gmap_build, just gmap +%# your input file will be 'polished/m16.polished.hq.fastq' +%# please name the output in a way you know what it is and that it doesn't +%# overwrite the previous analyses. +%gmap -D runs/gmap_index -d Athaliana -f samse -n 0 --cross-species --max-intronlength-ends 200000 -z sense_force runs/isoseq/polished/m16.polished.hq.fastq 2> runs/isoseq/mapped/m16.hq.gmap.log > runs/isoseq/mapped/m16.hq.sam +%# we'll now need to convert to bam, sort, and convert back to sam +%# think 'samtools view [-b]' for conversions, 'samtools sort' to sort +%samtools view -b runs/isoseq/mapped/m16.hq.sam |samtools sort > runs/isoseq/mapped/m16.hq.sorted.bam +%samtools view runs/isoseq/mapped/m16.hq.sorted.bam > runs/isoseq/mapped/m16.hq.sorted.sam +%# while you're at it and for later, please index your sorted bam +%samtools index runs/isoseq/mapped/m16.hq.sorted.bam +%# for the collapse_isoforms_by_sam.py, there are quite a few changes +%collapse_isoforms_by_sam.py --input runs/isoseq/polished/m16.polished.hq.fastq -s \ +% runs/isoseq/mapped/m16.hq.sorted.sam -o collapsed/m16.hq.to_genome --fq +%# we don't want a coverage filter as we had for the flnc reads, +%# and we have to explicity tell the script to parse fastq input (--fq) +% +%# if you check your output, you'll see fewer output files than before, +%# as without the --flnc_coverage parameter, the output won't be split into +%# 'good' and 'bad'. Can you find the main output? It should be named +%# <your output prefix>.collapsed.gff +%# please convert this to gff3 with 'gffread' +%gffread collapsed/m16.hq.to_genome.collapsed.gff -o collapsed/m16.hq.to_genome.collapsed.gff3 + +%# please convert the .gff3 file to hints using gff3_to_hints_isoseq.py +%# note that you do not need to set --priority (but it also won't hurt) +%gffread collapsed/m16.hq.to_genome.collapsed.gff -o collapsed/m16.hq.to_genome.collapsed.gff3 +%# please subset the new bam and gff files with ./subset_genome_related.py +%subset_genome_related.py -s Chr1 -f1 -t300000 --bam runs/isoseq/mapped/m16.hq.sorted.bam --gff collapsed/m16.hq.to_genome.collapsed.gff,collapsed/m16.hq.to_genome.collapsed.hints.gff3 --fasta studies/AthalianaReferences/resources/Athaliana_167_TAIR9.fa +%# and run augustus. +%# All you will need to change are the hints and output files. +%augustus --hintsfile=collapsed/m16.hq.to_genome.collapsed.hints__Chr1_1-300000.gff3 --species=arabidopsis --alternatives-from-evidence=true --extrinsicCfgFile=extrinsic.E.cfg --UTR=on --allow_hinted_splicesites=atac studies/AthalianaReferences/resources/Athaliana_167_TAIR9__${where}.fa > gene_models/hq.$where.augustus +%# and subset to just lines containing 'AUGUSTUS' with 'grep' +%less gene_models/hq.$where.augustus|grep AUGUSTUS > gene_models/hq.Chr1_1-300000.augustus.gtf + +Alright, well done. You can get surprisingly far in executing a bioinformatics +pipeline by taking an example from somewhere else that does approximately +what you want, checking help functions to see if all parameters make sense +for your data, and then plugging in your data (as you basically just did above). +That said it is extremely important that you check as you go to see if the +output is giving you what you expect, to read what information is available +on the tools and generally try not to blindly trust it fits your use-case. + +We'll do a larger scale comparison of the methods later, but feel free to +compare the final gene models in tablet. + +\subsubsection{\emph{de Novo}} +So you've now seen two variations on reference based gene calling, and you've +probably also gotten a chance to appreciate some of the major advantages, such +as using the genome to fill in missing transcripts or 5' degraded transcripts. + +But sometimes the genome is not available, or it can be so fragmented that many +to most genes do not occur on one scaffold, or otherwise have quality issues. +In such cases it can be preferable to use a \emph{de Novo} approach. + +That said, the only \emph{de Novo} approaches we found very much consist of +\textbf{under-development software} that hasn't yet been polished for an end user. +In as much, we'll be approaching the next section as an \textbf{example of how you +have to sanity check work as you go and be ready to fill gaps.} + +The first step is, as with a genome based approach, to collapse draft transcripts +to somewhat better draft transcripts. When a genome was available this made perfect +sense, the genome brought in extra information to differentiate say alleles from paralogs. +However, it is still quite necessary for a \emph{de Novo} approach, as the first +clustering step was intentionally conservative on the logic it is easier to collapse +later than to deal with transcripts that have been too aggressively merged. +Thus, the next clustering step with cogent is required to shift from 'under clustering' +to a somewhat more aggressive 'best guess' sort of clustering. + +The main commands we've taken from \url{https://github.com/Magdoll/Cogent/wiki/Running-Cogent} + +Instead of having a \emph{clustering} step, as under-development +software, Cogent has a multi step process, with distance calculation, +partitioning, and consensus calling all performed separately + +\begin{lstlisting} +# distance calculation: +run_mash.py -h +# looks like we really just need an input fasta file +# which we just need to extract from the polishing step +gunzip runs/isoseq/polished/m16.polished.hq.fasta.gz + +run_mash.py --cpus 4 runs/isoseq/polished/m16.polished.hq.fasta +# check output +less runs/isoseq/polished/m16.polished.hq.fasta.s1000k30.dist +# you see pairs of transcripts with some info on +# distance and alignment length in the later columns. +# Seems reasonable. We'll also check the size of the file +# we don't know how large it is supposed to be, but I think +# we can say that if has fewer lines than input sequences or more +# than input sequences squared, we should worry. +wc --lines runs/isoseq/polished/m16.polished.hq.fasta.s1000k30.dist +\end{lstlisting} + +In the next step we move from pairwise distances to partitions / +clusters. + +\begin{lstlisting} +# partitioning +process_kmer_to_graph.py -h +# we'll need the input and output from above and we'll need +# to set an output directory & prefix +mkdir cogent +# also the wiki recommended using -c COUNTS_FILE, and showed an example +# COUNTS_FILE with (hq transcript ID, N flnc reads). +# Hmmm, do we have the number for how many full length reads +# went into each high quality transcript? +less runs/isoseq/polished/m16.polished.hq.fasta +# we do, we just have to parse it out of the fasta headers +less runs/isoseq/polished/m16.polished.hq.fasta | grep '>' | cut -f1 -d';' |\ + sed -e 's/>//g' -e 's/full_length_coverage=//g' -e 's/ /\t/g' \ + > runs/isoseq/polished/m16.polished.hq.weights +# for help understanding the above command, please truncate it +# to the grep, check the output, and then start extending it piece by +# piece (grep, grep | cut, grep | cut | sed -e, grep | cut | sed -e -e, etc...) + +# back to partitioning +process_kmer_to_graph.py -c runs/isoseq/polished/m16.polished.hq.weights \ + runs/isoseq/polished/m16.polished.hq.fasta \ + runs/isoseq/polished/m16.polished.hq.fasta.s1000k30.dist \ + cogent/ m16.polished.hq +ls cogent # puh, that's a lot of output +# let's keep looking at one folder +ls cogent/m16.polished.hq_0 +less cogent/m16.polished.hq_0/in.fa +less cogent/m16.polished.hq_0/in.weights +# alright, it looks like we have a bunch of folders that contain the +# bits of fasta and weight assigned to their partition. Simple actually. +\end{lstlisting} + +In the final Cogent step, the sequences within the partitions are finally +assigned to transcripts and collapsed. +\begin{lstlisting} +# consensus +reconstruct_contig.py -h +# from the wiki (more than the help function) we know +# this has to be ran for each partition. We'll use a for loop +for item in `ls cogent`; +do + reconstruct_contig.py -p $item cogent/$item/; +done + +ls cogent/m16.polished.hq_0/ +# the final output is always in cogent2.renamed.fasta +# now lets sanity check the results\ +# from the genome-based hq analysis we have some idea what to expect for +# number of transcripts +less collapsed/m16.hq.to_genome.collapsed.gff|awk '$3 == "transcript"'|wc --lines +# so nearly 2k +# counting from this output format is a bit harder +# but ultimately is just concatenating all the fasta files, and counting the +# headers as we learned on the first day +cat cogent/m16.polished.hq_*/cogent2.renamed.fasta | grep '>' | wc --lines +# a lot less... hmmm... +# while it's same order of magnitude and _could_ be reasonable +# it was worrisome enough to start double checking +# let's count home many draft transcripts were still there after partitioning +cat cogent/m16.polished.hq_*/in.fa |grep '>'|wc +# and compare to total hq draft transcripts +less runs/isoseq/polished/m16.polished.hq.fasta | grep '>' |wc +# so barely over half of the sequences remained after partitioning. +# if we look at one of the meta info outputs from process_kmers_to_graph.py +less m16.polished.hq.partition.txt +# we see a) that the vast majority of partitions have 2+ sequences and +# b) a list of unassigned sequences, we can count these +less m16.polished.hq.partition.txt | grep unassigned | sed 's/,/ /g' | wc --words +# exactly the number we were missing. Success. +# OK, but what should we do with these, if we were making orthogroups, +# we would just skip the 'unassigned' singletons. But in this case, being a +# singleton doesn't even mean it's low quality or an outlier, just that the FLNCs from +# this draft transcript were successfully collapsed in the polish step, and cogent +# didn't have anything left to do. +# We already put together a convenience python script that will join +# collapsed and singleton transcripts together in a single file. +clean_cogent_output.py -f runs/isoseq/polished/m16.polished.hq.fasta -c cogent/ \ + -o collapsed/m16.hq.de_novo.fa +# feel free to run with -h, or simply open if you want to know how it works +less collapsed/m16.hq.de_novo.fa|grep '>'|wc +# and now we have very comparable results to the genome based methods (numerically) +\end{lstlisting} + +\fbox{\begin{minipage}{45em} +To be fair to the Cogent developers, this was all documented, and they +do provide a way to get everything back together, see: +\href{https://github.com/Magdoll/Cogent/wiki/Tutorial\%3A-Using-Cogent-to-collapse-redundant-transcripts-in-absence-of-genome} +{Using-Cogent-to-collapse-redundant-transcripts-in-absence-of-genome}. +That said, we missed the documentation in the first run, and noticing the +discrepancy between expectation and result was therefore critical to getting +the output right. +Ultimately our solution was easier for the workshop, so we kept it. +\end{minipage}} + +OK, so we have transcript models now. Without a genome one certainly couldn't +use e.g. augustus to extend or fill gaps in the sequences, what's missing +is irrecoverable (without more sequencing). But we aren't done yet, the other +thing that Augustus gave us was predictions for cds/protein sequences. For this +we'll use ANGEL. When we were working with Augustus, we just used the pre-trained +model for arabidopsis, ANGEL doesn't ship with pre-trained models, but it is +easy to train. + +First we need a training set, we'll just take a naive ORF caller (you know, based +on start/stop codons and length), and grab the transcripts with the longest predicted CDS. +Then we'll filter them, e.g. to avoid having two nearly-identical transcripts for +training. Then we'll train. + +\begin{lstlisting} +mkdir angel + +dumb_predict.py -h +# since this is such a small dataset we set --min_aa_length kinda low +dumb_predict.py collapsed/m16.hq.de_novo.fa angel/m16.hq.de_novo --min_aa_length=290 +ls angel/ # we'll continue with *.final.* +# pick low-redundancy training set +angel_make_training_set.py -h +angel_make_training_set.py angel/m16.hq.de_novo.final angel/m16.hq.de_novo.final.train +ls angel/ # we'll continue with *.train.cds and *.train.utr +# and train, this will take a minute +angel_train.py -h +angel_train.py angel/m16.hq.de_novo.final.train.cds \ + angel/m16.hq.de_novo.final.train.utr angel/arabidopsis.pickle --cpus=4 +# our trained model is now at angel/arabidopsis.pickle, importantly the computer +# can read it, even if we can't +\end{lstlisting} + +Now that we have our trained model, we just need to run it on all the sequences + +\begin{lstlisting} +angel_predict.py -h +# angel can only output predictions to the same directory +cd angel/ +# this will take several minutes +angel_predict.py ../collapsed/m16.hq.de_novo.fa arabidopsis.pickle m16.predictions \ + --output_mode=best --min_angel_aa_length 100 --min_dumb_aa_length 100 +ls +cd .. +\end{lstlisting} + +Now we have our completed gene models! + +\subsection{Comparing and Evaluating Gene Models} +And with completed gene models, whether or not one has several methods to compare, +one definitely wants some feedback on whether they are any good or not. + +We'll run through a few examples of basic ways to evaluate the results. +\textbf{We'll just provide the first command, and then let you figure out how to change +it / if it makes sense for the rest of the methods.} + +Exactly what you compare is up to you, you could compare the results of processing +the high quality draft transcripts in a \emph{de novo} or reference based manner. +You could compare the augustus output when it gets hints from the flnc reads, +the hq transcripts, or is run without hints. You could compare any of the methods +to the official \emph{Arabidopsis} gene models. You could compare the +genome-based IsoSeq transcripts to the Trinity assembly (minidata.fastq from +day 1 consisted of just reads that mapped to Chr1:1-300000). +There are a lot of options. + +As it can be hard to compare just the genome fragment we used for Augustus to the +full genome/transcriptome predictions, we've provided full augustus +results if you want them. + +\begin{lstlisting} +cp -r ../BackUpData/augustus ./ +\end{lstlisting} + +First we need to have actual transcript/protein sequences, these are easy to create from +the gtf/gffs. + +\begin{lstlisting} +# get transcript sequences from gff with just transcripts/exons +# e.g. our collapsed gff sequences +gffread collapsed/m16.flnc.to_genome.collapsed.good.gff -O \ + -g studies/AthalianaReferences/resources/Athaliana_167_TAIR9.fa \ + -w collapsed/m16.flnc.to_genome.collapsed.good.transcript.fa + +# get transcript, cds, and protein sequences from a full gff (e.g. from Augustus) +basename=studies/AthalianaReferences/resources/Athaliana_167_TAIR10.gene_exons +gffread ${basename}.gtf -g studies/AthalianaReferences/resources/Athaliana_167_TAIR9.fa \ + -w ${basename}.transcript.fa -x ${basename}.cds.fa -y ${basename}.protein.fa +\end{lstlisting} + +Next we'll want to run a numerical comparison. +How many transcripts/proteins were +produced and how long were they? +\begin{lstlisting} +# quast is a program that can perform a fairly thorough quality control +# on a variety of DNA sequences / assemblies. +quast.py angel/m16.predictions.ANGEL.cds --min-contig=1 -o quast/angel_cds +# but quast will only work for DNA sequence, for proteins we can use the +# count_fasta.pl from the first day. Copy it into the directory, then +./count_fasta.pl -i 20 angel/m16.predictions.ANGEL.pep +\end{lstlisting} + +Knowing you have long sequences is nice, and it's particularly encouraging +when you have longer protein sequences. However, sometimes and particularly when +there is no reference available, it can be hard to interpret. Moreover, given +an 'omics sized analysis, you'll always have the occasional long ORF by chance, +etc. So once one has a short list of methods that produced decent +\emph{quantitative} results, it's good to perform a more \emph{qualitative} +check. This can include steps you've seen before, like \emph{looking} at the +models with Tablet. But another important question is always whether the +sequences can be annotated. + +In plant biology, Mercator (and the newly released Mercator 4) provide a nice +quick way to get functional annotations. +You can upload your sequences to \url{http://www.plabipd.de/portal/web/guest/mercator4}, +and the results should be there within a +few minutes. That said, if everyone does this at once, it may take a good deal +longer, so feel free to look over each others shoulders. + +The web interface provides some summary output, but most importantly you can +download the results as a tab separated annotation file and then calculate +whatever you need. +\begin{lstlisting} +# count all transcripts +less <your results> |grep "T$"|wc --lines +# count annotated transcripts +less <your results> |grep "T$"|grep -v "not assigned"|wc --lines +\end{lstlisting} + +One could similarly look at assigning pfam domains, assigning GO terms, +or comparing transcripts directly to other related species whether via +reciprocal best blast or orthogroup creation. + +\paragraph{Challenge Assignment.} If you've made it this far, and we haven't +yet interrupted you to start the Q\&A session, you might want to take this time +to try and apply some of your R skills to the comparison above. Can you make +a plot of the N50 or total length for your comparisons? +Can you make a comparative +plot that summarizes the Mercator/MapMan annotations? + +\subsection{Future perspectives} + +We spent a lot of this section talking about how this was un-stable software, +how it was changing a lot, how to check as you go when working with new +pipelines, etc... You might almost get the idea that we don't expect you +to be able to use this script as an appropriate guideline 6 months from now, and you'd be right. + +Here are some things that we think are needed, and we think will come +\emph{soonish}. +These are things I would definitely look for at the start of my next project. +\begin{itemize} +\item Generally, there will be a flux, better tools will come out, tools will change names and parameters +\item Some equivalents of FastQC that give more tailored feedback for PacBio / Nanopore data will become available +\item Cogent or a similar tool will likely provide a 1-step process for genome-free draft transcript collapsing +\item More tools will become available that are specialized for direct RNAseq +\begin{itemize} +\item Genome-based collapse for Nanopore reads that has poly-A tail detection on the 3' end and, +of course, accounts for the higher / biased error. +\item Read and or transcript polishing (self but very importantly also Illumina based) that are explicitly aware of the dynamic range of transcriptomics data +\end{itemize} +\item There's some interest in 5' and 3' selection in the Iso-Seq pipeline, +which already even has some early software available for the collapse step +\href{https://github.com/GenomeRIK/tama}{TAMA} +\item Finally, in good time, both Iso-Seq and direct RNAseq will be incorporated into +full gene annotation pipelines. +\end{itemize}