Skip to content
Snippets Groups Projects

Notepad around the workshop

this is part of the notes collected during the workshop

RStudio Keyboard Shortcuts

  • Execute line / highlighted block of code: strg (ctrl) + Enter
  • Duplicate line / highlighted block: strg (ctrl) + shift + d
  • Delete line / highlighted block: strg (ctrl) + d
  • Interrupt R: ESC
  • (Un)Comment line / highlighted block: ctrl (strg) + shift + c

Notes on codes

  1. You can start an interpreter from terminal / command line
  • just type python, try some python commands, run quit() to exit back to terminal.
  • just type R, try some R commands, run q() to exit back to terminal.
  1. interpreter ~ environment ~ programming language
  • Most (not all) pre-installed on your machine
  • gazillion other languages and interpreters (perl, julia, fsharp, ...)
  1. You can write a script (in a simple text editor or IDE), store it and execute it from the command line
  • bash <nameOfScript>.sh
  • Rscript <nameOfScript>*.R
  • python <nameOfScript>*.py
  1. File extensions
  • more for human than machine
  • machine ~ default software to handle specific file types
  • File <-> Software association is not "fixed"
  • can add any extension, still works (try bash <nameOfScript>.randomExtension)
  1. IDEs (Integrated Development Environments)
  • Multi-purpose: Visual Studio Code (+ extensions)
  • Good for R: RStudio
  • Good for Python: Pycharm
  1. specify the interpreter in the first line (for terminal executability), e.g.
  • bash: #! /bin/bash
  • python: #! /usr/bin/env python3
  • r: #!/usr/bin/env Rscript

once you make the script executable (chmod +x <script>.sh), you can execute it directly (i.e. ./<script>.sh instead of bash./<script>.sh)

Challenge exercise Day 3

The pipeline we've shown you in the class was desigend to work (mostly smoothly) with the data, structure and parameters just as we've provided. Now, let's try to take this to the next level - i.e. transfer it to a real life challenge - by reproducing some RNA-Seq data from a published paper.

Tips:

  • Along this adventure, you'll probably run into other important topics concerning good scientific practice (or bad examples of those). So don't be afraid, if it's harder than it should be.
  • Consider this a challenge somewhere between peer-review, data reproducibility, positive controls (also for yourself ≈> is your pipeline correct?) and FAIR data management
  • To make life easier, don't take the first best paper, but rather search for one where you can somewhat easily answer the questions in (A)

A. Find data

  1. Find a paper from your research area of interest that used mRNA-Sequencing.

  2. Within that paper, find a figure that plots gene expression / transcript abundance of any kind, e.g.

    • bar / dot plots of gene expression
    • heat maps
    • ...
  3. Identify the experimental design, e.g.

    • What species was/were sequenced?
    • How many replicates?
    • Controls?
    • Different genotypes, ecotypes, treatments, other conditions, ...
  4. What RNASeq data was produced or re-used for analysis

    • What reference was used?
      • Transcriptome? Genome?
      • Version?
      • Can you find and access (i.e. download) it?
    • What reads were produced (i.e. *.fastq files)
      • Sequencer?
      • Read length?
      • Paired or single end?
      • Are the reads trimmed / filtered?
      • Can you find and access (i.e. download) them?

B. Design your in silico experiment

  • From step 3: Pick a small sample subset

    • e.g. 3 replicates wildtype and 3 replicates mutant

    • Write this down into a simple spreadsheet, e.g.

      file_name sample group
      wt_rep1.fastq.gz WT_rep1 WT
      wt_rep2.fastq.gz WT_rep2 WT
      wt_rep3.fastq.gz WT_rep2 WT
      mut_rep1.fastq.gz mut_rep1 mutant
      mu_rep2.fastq.gz mut_rep2 mutant
      mu_rep3.fastq.gz mut_rep3 mutant
  • From step 4: Download the relevant data, i.e.:

    • reference transcriptome or genome
    • raw fastq files for your sample subset
  • From step 2: pick a simple plot, that you feel manageable to reproduce

C. Run your analysis

  1. Put all data together into one directory (make it write-accessible by the docker container)
  2. Start the rnaseq docker container
  3. Kallisto build the index for the reference (section 6.4.1)
  4. Write a for loop that... (section 6.4.1)
    • ...trims the fastq files (trimmomatic)
    • ...runs fastqc before and after trimming
    • ...maps the fastq reads against the reference (kallisto quant)
  5. Start your RStudio docker
  6. Write an R script to
    • import the kallisto results (requires library(sleuth))
    • analyse the differential gene expression via sleuth
    • plot the results using ggplot