diff --git a/_reader/sections/04_section_longreads.tex b/_reader/sections/04_section_longreads.tex index 0828439dd3b0457db4b7c82d6e0391a566d3ff7b..440b478d8f6352329c2b4c7dd75e0ac323d38e60 100644 --- a/_reader/sections/04_section_longreads.tex +++ b/_reader/sections/04_section_longreads.tex @@ -256,22 +256,18 @@ So if we were using more samples, we'd want to run \il{dataset create} first, which would link files in a way that they could all be processed together. \begin{lstlisting} -mkdir -p runs/isoseq/polished/ +mkdir -p runs/isoseq/flnc/ # from the help function: # isoseq refine [options] <ccs.demux.bam|xml> <primer.fasta|xml> <flnc.bam|xml> isoseq3 refine runs/isoseq/trimmed/m16.ccs.primer_5p--primer_3p.bam \ workflows/pacbiosciences_lima/clonetech_SMARTer.fa \ - runs/isoseq/polished/m16.polished.bam --verbose --require-polya + runs/isoseq/flnc/m16.flnc.bam --verbose --require-polya # again, we have a lot of output -ls -sh runs/isoseq/polished/ -# there's two files we really care about, and both are .bam -# *.polished.flnc.bam contains the fully cleaned, AKA -# "full-length non-chimeric" reads. Finally. -# *.polished.bam has draft reconstructed transcripts - -# we can also get a report on how many flnc reads were used -# for each draft transcript -less runs/isoseq/polished/m16.polished.report.csv +ls -sh runs/isoseq/flnc/ +# there's one file we really care about, and both are .bam +# *.flnc.bam contains the fully cleaned, AKA +# "full-length non-chimeric" reads. + \end{lstlisting} After this, the pipelines we're looking at today start to diverge, @@ -283,64 +279,25 @@ We got to this step, compared it to the reported results from the paper to see if the basics (e.g. number of transcripts produced) seemed similar, and freaked out a bit. -\begin{lstlisting} -# how many draft reconstructed transcripts did we get? -samtools view runs/isoseq/unpolished/m16.unpolished.bam|wc --lines -\end{lstlisting} +How many draft reconstructed transcripts did we get? Can you see why we were worried? Can you also figure out why we decided the difference was nothing to worry about, just a reason to read papers with care? \subsubsection{Polishing} -In terms of quantity, each draft transcript was constructed -from an average of around 10 full-length non-chimeric reads, which sounds -OK at first. but if you glanced at \il{runs/isoseq/unpolished/m16.summary.csv} -or maybe even made a histogram, -you know that they are definitely not evenly distributed. -Indeed, well over half are constructed from just 2-3 flnc reads. -But at the very start of this pipeline we had a lot more -data, right? The subreads. - -The next step is to take the draft `unpolished' transcripts and carefully polish -them with all that subread data we had at the very very start. This will allow -any reads that were dropped (e.g. because they were not full length) to still -contribute to the quality of the final sequences, and further allow for a -high quality ccs read from 10+ subreads to essentially be trusted more than -a ccs read with just a single subread. - -%%%%%%%%% -%TODO : running into error: Missing .pbi +Polishing has been deprecated and removed from the `isoseq3` pipeline, +(well polishing is still done during ccs creation) +as modern IsoSeq data is of sufficient quality it's not worth the time. -\begin{lstlisting} -# isoseq polish is no longer a thing... -\end{lstlisting} - -OK, this step is going to take a long time (maybe an hour?). It's working with 5GB of reads after all -But we actually want to be able to compare the pipelines in the end, so we didn't -want to unnecessarily subset this data. - -In the mean time we can continue with the "reference based, mapped reads" /"left most" +We can continue with the "reference based, mapped reads" /"left most" option. -%%%%%%%%% -%TODO : adapt back-up routine -% Maybe add a back-up folder to runs? - -If something goes wrong or if you get through the "Mapping FLNC reads" part -before it finishes. There is a copy of the expected output at \il{runs/_BackUpData/isoseq/polished/} -Feel free to copy the data from there if need be. -\lstset{language=bash, style=bashstyle, frame=trbl, rulecolor=\color{red}} - -\begin{lstlisting} -cp runs/_BackUpData/isoseq/polished/* runs/isoseq/polished/ -\end{lstlisting} -% printf "\nruns/isoseq/polished/\n" >> .gitignore \subsection{Coding Genome Definition} \subsubsection{Reference Based} \paragraph{Mapping FLNC reads.} -While polishing is running happily, we're going to check out what happens +We're going to check out what happens when we take the cleaned up "full-length non-chimeric consensus reads" and map them to the genome directly. This method will depend upon the genome to help clean up any errors in the reads, but let's face it, \emph{sometimes} @@ -350,31 +307,24 @@ but sometimes, often even. GMAP is a pretty fast way to map full transcripts or similar back to the genome. But like most aligners, it expects .fasta or .fastq files. -Open a new terminal (Ctr+Alt+T) so you can work while polishing runs. - -\lstset{language=bash, style=bashstyle} -\begin{lstlisting} -# activate the virtual environment in the new terminal -source $HOME/Documents/venv/bin/activate -\end{lstlisting} Now let's take a look at the `flnc' reads \begin{lstlisting} # we have a .bam file for the flnc reads -samtools view runs/isoseq/unpolished/m16.unpolished.flnc.bam | \ - head > runs/isoseq/unpolished/head_flnc.sam -less runs/isoseq/unpolished/head_flnc.sam +samtools view runs/isoseq/flnc/m16.flnc.bam | \ + head > runs/isoseq/flnc/head_flnc.sam +less runs/isoseq/flnc/head_flnc.sam # two things: # First, we finally have some cleaned-up mappable looking reads # Second, the data from a fasta file is just a subset of what we see # the read name (column 1), and the sequence (column 10) # we're just going to make a fasta file out of the bam with simple text manipulation -samtools view runs/isoseq/unpolished/m16.unpolished.flnc.bam | \ +samtools view runs/isoseq/flnc/m16.flnc.bam | \ awk '{print ">"$1"\n"$10}' > \ - runs/isoseq/unpolished/m16.unpolished.flnc.fa + runs/isoseq/flnc/m16.flnc.fa # we simply googled 'bam to fasta awk', just FYI -head runs/isoseq/unpolished/m16.unpolished.flnc.fa +head runs/isoseq/flnc/m16.flnc.fa \end{lstlisting} Back to GMAP, like the other aligners we've seen, GMAP is going to need @@ -390,7 +340,7 @@ mkdir -p runs/isoseq/mapped/ # add -t N to speed it up if polishing is finished gmap -D runs/gmap_index -d Athaliana -f samse -n 0 --cross-species \ --max-intronlength-ends 200000 -z sense_force \ - runs/isoseq/unpolished/m16.unpolished.flnc.fa 2> \ + runs/isoseq/flnc/m16.flnc.fa 2> \ runs/isoseq/mapped/m16.flnc.gmap.log > runs/isoseq/mapped/m16.flnc.sam # the parameters are simply those recommended for isoseq, but briefly # -D, -d : to find the index made above @@ -400,6 +350,7 @@ gmap -D runs/gmap_index -d Athaliana -f samse -n 0 --cross-species \ # -z sense_force : since sequences were oriented 5'-3' by poly-A tail # 2> runs/isoseq/mapped/m16.flnc.gmap.log : redirect copius standard error output to file # > runs/isoseq/mapped/m16.flnc.sam : finally redirect standard out to our sam file +# this will take some minutes... # now we have a "normal" sam/bam file less runs/isoseq/mapped/m16.flnc.sam @@ -418,6 +369,7 @@ run the Illumina half and the other the Iso-Seq half. %TODO: data missing in runs/hisat_results/treatment1.hisat.sam -> earlier script didn't run. %TODO: %TODO: Stopped here for now. +%TODO: add demo command to copy Hisat2 data from _backup \begin{lstlisting} # first to visualize the reads we'll need indexes @@ -466,13 +418,14 @@ Once you've got a feel for the differences between Illumina and Iso-Seq data we'll move on to collapsing mapped reads into transcripts. \begin{lstlisting} -mkdir collapsed +mkdir runs/isoseq/collapsed # the next script requires a _sorted_ sam file, so one more conversion samtools view runs/isoseq/mapped/m16.flnc.sorted.bam > \ runs/isoseq/mapped/m16.flnc.sorted.sam # and the actual command to collapse -collapse_isoforms_by_sam.py --input runs/isoseq/unpolished/m16.unpolished.flnc.fa \ - -s runs/isoseq/mapped/m16.flnc.sorted.sam -o collapsed/m16.flnc.to_genome \ +collapse_isoforms_by_sam.py --input runs/isoseq/flnc/m16.flnc.fa \ + -s runs/isoseq/mapped/m16.flnc.sorted.sam \ + -o runs/isoseq/collapsed/m16.flnc.to_genome \ --flnc_coverage 2 ls collapsed @@ -563,6 +516,8 @@ codon usage of genic and non-genic regions and can look for likely ORFs. This is somewhat organized and automated in public tools such as MAKER or PASA. \\ +That, or use Helixer ;-) + That said, that really starts to get beyond the scope of this course, so we will simply look at how you could use \emph{just} the Iso-Seq data to support the \emph{de novo} gene caller Augustus. \\