Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
CEPLAS RNASeq Workshop 2022
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
HHU Plant Biochemistry
CEPLAS RNASeq Workshop 2022
Commits
c283e72c
Commit
c283e72c
authored
2 years ago
by
Alisandra Denton
Browse files
Options
Downloads
Patches
Plain Diff
up to collapse isoforms is working
parent
867b707c
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
_reader/sections/04_section_longreads.tex
+28
-73
28 additions, 73 deletions
_reader/sections/04_section_longreads.tex
with
28 additions
and
73 deletions
_reader/sections/04_section_longreads.tex
+
28
−
73
View file @
c283e72c
...
...
@@ -256,22 +256,18 @@ So if we were using more samples, we'd want to run \il{dataset create}
first, which would link files in a way that they could all be processed together.
\begin
{
lstlisting
}
mkdir
-
p runs
/
isoseq
/
polished
/
mkdir
-
p runs
/
isoseq
/
flnc
/
# from the help function:
# isoseq refine
[
options
]
<ccs.demux.bam|xml> <primer.fasta|xml> <flnc.bam|xml>
isoseq
3
refine runs
/
isoseq
/
trimmed
/
m
16
.ccs.primer
_
5
p
--
primer
_
3
p.bam
\
workflows
/
pacbiosciences
_
lima
/
clonetech
_
SMARTer.fa
\
runs
/
isoseq
/
polished
/
m
16
.polished
.bam
--
verbose
--
require
-
polya
runs
/
isoseq
/
flnc
/
m
16
.flnc
.bam
--
verbose
--
require
-
polya
# again, we have a lot of output
ls
-
sh runs
/
isoseq
/
polished
/
# there's two files we really care about, and both are .bam
#
*
.polished.flnc.bam contains the fully cleaned, AKA
# "full
-
length non
-
chimeric" reads. Finally.
#
*
.polished.bam has draft reconstructed transcripts
# we can also get a report on how many flnc reads were used
# for each draft transcript
less runs
/
isoseq
/
polished
/
m
16
.polished.report.csv
ls
-
sh runs
/
isoseq
/
flnc
/
# there's one file we really care about, and both are .bam
#
*
.flnc.bam contains the fully cleaned, AKA
# "full
-
length non
-
chimeric" reads.
\end
{
lstlisting
}
After this, the pipelines we're looking at today start to diverge,
...
...
@@ -283,64 +279,25 @@ We got to this step, compared it to the reported results
from the paper to see if the basics
(
e.g. number of transcripts produced
)
seemed similar, and freaked out a bit.
\begin
{
lstlisting
}
# how many draft reconstructed transcripts did we get?
samtools view runs
/
isoseq
/
unpolished
/
m
16
.unpolished.bam|wc
--
lines
\end
{
lstlisting
}
How many draft reconstructed transcripts did we get?
Can you see why we were worried? Can you also figure out why we decided
the difference was nothing to worry about, just a reason to read papers with care?
\subsubsection
{
Polishing
}
In terms of quantity, each draft transcript was constructed
from an average of around
10
full
-
length non
-
chimeric reads, which sounds
OK at first. but if you glanced at
\il
{
runs
/
isoseq
/
unpolished
/
m
16
.summary.csv
}
or maybe even made a histogram,
you know that they are definitely not evenly distributed.
Indeed, well over half are constructed from just
2
-
3
flnc reads.
But at the very start of this pipeline we had a lot more
data, right? The subreads.
The next step is to take the draft `unpolished' transcripts and carefully polish
them with all that subread data we had at the very very start. This will allow
any reads that were dropped
(
e.g. because they were not full length
)
to still
contribute to the quality of the final sequences, and further allow for a
high quality ccs read from
10
+
subreads to essentially be trusted more than
a ccs read with just a single subread.
%%%%%%%%%
%TODO : running into error: Missing .pbi
Polishing has been deprecated and removed from the `isoseq
3
` pipeline,
(
well polishing is still done during ccs creation
)
as modern IsoSeq data is of sufficient quality it's not worth the time.
\begin
{
lstlisting
}
# isoseq polish is no longer a thing...
\end
{
lstlisting
}
OK, this step is going to take a long time
(
maybe an hour?
)
. It's working with
5
GB of reads after all
But we actually want to be able to compare the pipelines in the end, so we didn't
want to unnecessarily subset this data.
In the mean time we can continue with the "reference based, mapped reads"
/
"left most"
We can continue with the "reference based, mapped reads"
/
"left most"
option.
%%%%%%%%%
%TODO : adapt back-up routine
% Maybe add a back-up folder to runs?
If something goes wrong or if you get through the "Mapping FLNC reads" part
before it finishes. There is a copy of the expected output at
\il
{
runs
/
_
BackUpData
/
isoseq
/
polished
/
}
Feel free to copy the data from there if need be.
\lstset
{
language
=
bash, style
=
bashstyle, frame
=
trbl, rulecolor
=
\color
{
red
}}
\begin
{
lstlisting
}
cp runs
/
_
BackUpData
/
isoseq
/
polished
/*
runs
/
isoseq
/
polished
/
\end
{
lstlisting
}
% printf "\nruns/isoseq/polished/\n" >> .gitignore
\subsection
{
Coding Genome Definition
}
\subsubsection
{
Reference Based
}
\paragraph
{
Mapping FLNC reads.
}
W
hile polishing is running happily, w
e're going to check out what happens
We're going to check out what happens
when we take the cleaned up "full
-
length non
-
chimeric consensus reads" and
map them to the genome directly. This method will depend upon the genome
to help clean up any errors in the reads, but let's face it,
\emph
{
sometimes
}
...
...
@@ -350,31 +307,24 @@ but sometimes, often even.
GMAP is a pretty fast way to map full transcripts or similar back to the genome.
But like most aligners, it expects .fasta or .fastq files.
Open a new terminal
(
Ctr
+
Alt
+
T
)
so you can work while polishing runs.
\lstset
{
language
=
bash, style
=
bashstyle
}
\begin
{
lstlisting
}
# activate the virtual environment in the new terminal
source
$
HOME/Documents/venv/bin/activate
\end{lstlisting}
Now let's take a look at the `flnc' reads
\begin
{
lstlisting
}
# we have a .bam file for the flnc reads
samtools view runs/isoseq/
unpolished/m16.unpolished
.flnc.bam |
\
head > runs/isoseq/
unpolished
/head
_
flnc.sam
less runs/isoseq/
unpolished
/head
_
flnc.sam
samtools view runs
/
isoseq
/
flnc
/
m
16
.flnc.bam |
\
head > runs
/
isoseq
/
flnc
/
head
_
flnc.sam
less runs
/
isoseq
/
flnc
/
head
_
flnc.sam
# two things:
# First, we finally have some cleaned
-
up mappable looking reads
# Second, the data from a fasta file is just a subset of what we see
# the read name
(
column
1
)
, and the sequence
(
column
10
)
# we're just going to make a fasta file out of the bam with simple text manipulation
samtools view runs/isoseq/
unpolished/m16.unpolished
.flnc.bam |
\
samtools view runs
/
isoseq
/
flnc
/
m
16
.flnc.bam |
\
awk '
{
print ">"
$
1"
\n
"
$
10
}
' >
\
runs/isoseq/
unpolished/m16.unpolished
.flnc.fa
runs
/
isoseq
/
flnc
/
m
16
.flnc.fa
# we simply googled 'bam to fasta awk', just FYI
head runs/isoseq/
unpolished/m16.unpolished
.flnc.fa
head runs
/
isoseq
/
flnc
/
m
16
.flnc.fa
\end
{
lstlisting
}
Back to GMAP, like the other aligners we've seen, GMAP is going to need
...
...
@@ -390,7 +340,7 @@ mkdir -p runs/isoseq/mapped/
# add
-
t N to speed it up if polishing is finished
gmap
-
D runs
/
gmap
_
index
-
d Athaliana
-
f samse
-
n
0
--
cross
-
species
\
--
max
-
intronlength
-
ends
200000
-
z sense
_
force
\
runs/isoseq/
unpolished/m16.unpolished
.flnc.fa 2>
\
runs
/
isoseq
/
flnc
/
m
16
.flnc.fa
2
>
\
runs
/
isoseq
/
mapped
/
m
16
.flnc.gmap.log > runs
/
isoseq
/
mapped
/
m
16
.flnc.sam
# the parameters are simply those recommended for isoseq, but briefly
#
-
D,
-
d : to find the index made above
...
...
@@ -400,6 +350,7 @@ gmap -D runs/gmap_index -d Athaliana -f samse -n 0 --cross-species \
#
-
z sense
_
force : since sequences were oriented
5
'
-
3
' by poly
-
A tail
#
2
> runs
/
isoseq
/
mapped
/
m
16
.flnc.gmap.log : redirect copius standard error output to file
# > runs
/
isoseq
/
mapped
/
m
16
.flnc.sam : finally redirect standard out to our sam file
# this will take some minutes...
# now we have a "normal" sam
/
bam file
less runs
/
isoseq
/
mapped
/
m
16
.flnc.sam
...
...
@@ -418,6 +369,7 @@ run the Illumina half and the other the Iso-Seq half.
%TODO: data missing in runs/hisat_results/treatment1.hisat.sam -> earlier script didn't run.
%TODO: %TODO: Stopped here for now.
%TODO: add demo command to copy Hisat2 data from _backup
\begin
{
lstlisting
}
# first to visualize the reads we'll need indexes
...
...
@@ -466,13 +418,14 @@ Once you've got a feel for the differences between Illumina and Iso-Seq data
we'll move on to collapsing mapped reads into transcripts.
\begin{lstlisting}
mkdir collapsed
mkdir
runs/isoseq/
collapsed
# the next script requires a
_
sorted
_
sam file, so one more conversion
samtools view runs/isoseq/mapped/m16.flnc.sorted.bam >
\
runs/isoseq/mapped/m16.flnc.sorted.sam
# and the actual command to collapse
collapse
_
isoforms
_
by
_
sam.py
--
input runs
/
isoseq
/
unpolished
/
m
16
.unpolished.flnc.fa
\
-
s runs
/
isoseq
/
mapped
/
m
16
.flnc.sorted.sam
-
o collapsed
/
m
16
.flnc.to
_
genome
\
collapse
_
isoforms
_
by
_
sam.py --input runs/isoseq/flnc/m16.flnc.fa
\
-s runs/isoseq/mapped/m16.flnc.sorted.sam
\
-o runs/isoseq/collapsed/m16.flnc.to
_
genome
\
--flnc
_
coverage 2
ls collapsed
...
...
@@ -563,6 +516,8 @@ codon usage of genic and non-genic regions and can look for likely ORFs.
This is somewhat organized and automated in public tools such as MAKER or PASA.
\\
That, or use Helixer ;-)
That said, that really starts to get beyond the scope of this course, so we will
simply look at how you could use
\emph
{
just
}
the Iso-Seq data to support the
\emph
{
de novo
}
gene caller Augustus.
\\
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment