Skip to content
Snippets Groups Projects
Commit c283e72c authored by Alisandra Denton's avatar Alisandra Denton
Browse files

up to collapse isoforms is working

parent 867b707c
No related branches found
No related tags found
No related merge requests found
......@@ -256,22 +256,18 @@ So if we were using more samples, we'd want to run \il{dataset create}
first, which would link files in a way that they could all be processed together.
\begin{lstlisting}
mkdir -p runs/isoseq/polished/
mkdir -p runs/isoseq/flnc/
# from the help function:
# isoseq refine [options] <ccs.demux.bam|xml> <primer.fasta|xml> <flnc.bam|xml>
isoseq3 refine runs/isoseq/trimmed/m16.ccs.primer_5p--primer_3p.bam \
workflows/pacbiosciences_lima/clonetech_SMARTer.fa \
runs/isoseq/polished/m16.polished.bam --verbose --require-polya
runs/isoseq/flnc/m16.flnc.bam --verbose --require-polya
# again, we have a lot of output
ls -sh runs/isoseq/polished/
# there's two files we really care about, and both are .bam
# *.polished.flnc.bam contains the fully cleaned, AKA
# "full-length non-chimeric" reads. Finally.
# *.polished.bam has draft reconstructed transcripts
# we can also get a report on how many flnc reads were used
# for each draft transcript
less runs/isoseq/polished/m16.polished.report.csv
ls -sh runs/isoseq/flnc/
# there's one file we really care about, and both are .bam
# *.flnc.bam contains the fully cleaned, AKA
# "full-length non-chimeric" reads.
\end{lstlisting}
After this, the pipelines we're looking at today start to diverge,
......@@ -283,64 +279,25 @@ We got to this step, compared it to the reported results
from the paper to see if the basics (e.g. number of transcripts produced)
seemed similar, and freaked out a bit.
\begin{lstlisting}
# how many draft reconstructed transcripts did we get?
samtools view runs/isoseq/unpolished/m16.unpolished.bam|wc --lines
\end{lstlisting}
How many draft reconstructed transcripts did we get?
Can you see why we were worried? Can you also figure out why we decided
the difference was nothing to worry about, just a reason to read papers with care?
\subsubsection{Polishing}
In terms of quantity, each draft transcript was constructed
from an average of around 10 full-length non-chimeric reads, which sounds
OK at first. but if you glanced at \il{runs/isoseq/unpolished/m16.summary.csv}
or maybe even made a histogram,
you know that they are definitely not evenly distributed.
Indeed, well over half are constructed from just 2-3 flnc reads.
But at the very start of this pipeline we had a lot more
data, right? The subreads.
The next step is to take the draft `unpolished' transcripts and carefully polish
them with all that subread data we had at the very very start. This will allow
any reads that were dropped (e.g. because they were not full length) to still
contribute to the quality of the final sequences, and further allow for a
high quality ccs read from 10+ subreads to essentially be trusted more than
a ccs read with just a single subread.
%%%%%%%%%
%TODO : running into error: Missing .pbi
Polishing has been deprecated and removed from the `isoseq3` pipeline,
(well polishing is still done during ccs creation)
as modern IsoSeq data is of sufficient quality it's not worth the time.
\begin{lstlisting}
# isoseq polish is no longer a thing...
\end{lstlisting}
OK, this step is going to take a long time (maybe an hour?). It's working with 5GB of reads after all
But we actually want to be able to compare the pipelines in the end, so we didn't
want to unnecessarily subset this data.
In the mean time we can continue with the "reference based, mapped reads" /"left most"
We can continue with the "reference based, mapped reads" /"left most"
option.
%%%%%%%%%
%TODO : adapt back-up routine
% Maybe add a back-up folder to runs?
If something goes wrong or if you get through the "Mapping FLNC reads" part
before it finishes. There is a copy of the expected output at \il{runs/_BackUpData/isoseq/polished/}
Feel free to copy the data from there if need be.
\lstset{language=bash, style=bashstyle, frame=trbl, rulecolor=\color{red}}
\begin{lstlisting}
cp runs/_BackUpData/isoseq/polished/* runs/isoseq/polished/
\end{lstlisting}
% printf "\nruns/isoseq/polished/\n" >> .gitignore
\subsection{Coding Genome Definition}
\subsubsection{Reference Based}
\paragraph{Mapping FLNC reads.}
While polishing is running happily, we're going to check out what happens
We're going to check out what happens
when we take the cleaned up "full-length non-chimeric consensus reads" and
map them to the genome directly. This method will depend upon the genome
to help clean up any errors in the reads, but let's face it, \emph{sometimes}
......@@ -350,31 +307,24 @@ but sometimes, often even.
GMAP is a pretty fast way to map full transcripts or similar back to the genome.
But like most aligners, it expects .fasta or .fastq files.
Open a new terminal (Ctr+Alt+T) so you can work while polishing runs.
\lstset{language=bash, style=bashstyle}
\begin{lstlisting}
# activate the virtual environment in the new terminal
source $HOME/Documents/venv/bin/activate
\end{lstlisting}
Now let's take a look at the `flnc' reads
\begin{lstlisting}
# we have a .bam file for the flnc reads
samtools view runs/isoseq/unpolished/m16.unpolished.flnc.bam | \
head > runs/isoseq/unpolished/head_flnc.sam
less runs/isoseq/unpolished/head_flnc.sam
samtools view runs/isoseq/flnc/m16.flnc.bam | \
head > runs/isoseq/flnc/head_flnc.sam
less runs/isoseq/flnc/head_flnc.sam
# two things:
# First, we finally have some cleaned-up mappable looking reads
# Second, the data from a fasta file is just a subset of what we see
# the read name (column 1), and the sequence (column 10)
# we're just going to make a fasta file out of the bam with simple text manipulation
samtools view runs/isoseq/unpolished/m16.unpolished.flnc.bam | \
samtools view runs/isoseq/flnc/m16.flnc.bam | \
awk '{print ">"$1"\n"$10}' > \
runs/isoseq/unpolished/m16.unpolished.flnc.fa
runs/isoseq/flnc/m16.flnc.fa
# we simply googled 'bam to fasta awk', just FYI
head runs/isoseq/unpolished/m16.unpolished.flnc.fa
head runs/isoseq/flnc/m16.flnc.fa
\end{lstlisting}
Back to GMAP, like the other aligners we've seen, GMAP is going to need
......@@ -390,7 +340,7 @@ mkdir -p runs/isoseq/mapped/
# add -t N to speed it up if polishing is finished
gmap -D runs/gmap_index -d Athaliana -f samse -n 0 --cross-species \
--max-intronlength-ends 200000 -z sense_force \
runs/isoseq/unpolished/m16.unpolished.flnc.fa 2> \
runs/isoseq/flnc/m16.flnc.fa 2> \
runs/isoseq/mapped/m16.flnc.gmap.log > runs/isoseq/mapped/m16.flnc.sam
# the parameters are simply those recommended for isoseq, but briefly
# -D, -d : to find the index made above
......@@ -400,6 +350,7 @@ gmap -D runs/gmap_index -d Athaliana -f samse -n 0 --cross-species \
# -z sense_force : since sequences were oriented 5'-3' by poly-A tail
# 2> runs/isoseq/mapped/m16.flnc.gmap.log : redirect copius standard error output to file
# > runs/isoseq/mapped/m16.flnc.sam : finally redirect standard out to our sam file
# this will take some minutes...
# now we have a "normal" sam/bam file
less runs/isoseq/mapped/m16.flnc.sam
......@@ -418,6 +369,7 @@ run the Illumina half and the other the Iso-Seq half.
%TODO: data missing in runs/hisat_results/treatment1.hisat.sam -> earlier script didn't run.
%TODO: %TODO: Stopped here for now.
%TODO: add demo command to copy Hisat2 data from _backup
\begin{lstlisting}
# first to visualize the reads we'll need indexes
......@@ -466,13 +418,14 @@ Once you've got a feel for the differences between Illumina and Iso-Seq data
we'll move on to collapsing mapped reads into transcripts.
\begin{lstlisting}
mkdir collapsed
mkdir runs/isoseq/collapsed
# the next script requires a _sorted_ sam file, so one more conversion
samtools view runs/isoseq/mapped/m16.flnc.sorted.bam > \
runs/isoseq/mapped/m16.flnc.sorted.sam
# and the actual command to collapse
collapse_isoforms_by_sam.py --input runs/isoseq/unpolished/m16.unpolished.flnc.fa \
-s runs/isoseq/mapped/m16.flnc.sorted.sam -o collapsed/m16.flnc.to_genome \
collapse_isoforms_by_sam.py --input runs/isoseq/flnc/m16.flnc.fa \
-s runs/isoseq/mapped/m16.flnc.sorted.sam \
-o runs/isoseq/collapsed/m16.flnc.to_genome \
--flnc_coverage 2
ls collapsed
......@@ -563,6 +516,8 @@ codon usage of genic and non-genic regions and can look for likely ORFs.
This is somewhat organized and automated in public tools such as MAKER or PASA.
\\
That, or use Helixer ;-)
That said, that really starts to get beyond the scope of this course, so we will
simply look at how you could use \emph{just} the Iso-Seq data to support the
\emph{de novo} gene caller Augustus. \\
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment