# Notepad around the workshop

> this is part of the notes collected during the workshop

- [RStudio Keyboard Shortcuts](#rstudio-keyboard-shortcuts)
- [Notes on codes](#notes-on-codes)
- [Challenge exercise Day 3](#challenge-exercise-day-3)
    - [A. Find data](#a-find-data)
    - [B. Design your *in silico* experiment](#b-design-your-in-silico-experiment)
    - [C. Run your analysis](#c-run-your-analysis)

## RStudio Keyboard Shortcuts

- Execute line / highlighted block of code: `strg (ctrl) + Enter`
- Duplicate line / highlighted  block: `strg (ctrl) + shift + d`
- Delete line / highlighted  block: `strg (ctrl) + d`
- Interrupt R: `ESC`
- (Un)Comment line / highlighted  block: `ctrl (strg) + shift + c`

## Notes on codes

1. You can start an interpreter from terminal / command line

- just type `python`, try some python commands, run `quit()` to exit back to terminal.
- just type `R`, try some R commands, run `q()` to exit back to terminal.

2. interpreter ~ environment ~ programming language

- Most (not all) pre-installed on your machine
- gazillion other languages and interpreters (perl, julia, fsharp, ...)

3. You can write a script (in a simple text editor or IDE), store it and execute it from the command line

- `bash <nameOfScript>.sh`
- `Rscript <nameOfScript>*.R`
- `python <nameOfScript>*.py`

4. File extensions

- more for human than machine
- machine ~ default software to handle specific file types
- File <-> Software association is not "fixed"
- can add any extension, still works (try `bash <nameOfScript>.randomExtension`)

5. IDEs (Integrated Development Environments)

- Multi-purpose: Visual Studio Code (+ extensions)
- Good for R: RStudio
- Good for Python: Pycharm

6. specify the interpreter in the first line (for terminal executability), e.g.
- bash: `#! /bin/bash`
- python: `#! /usr/bin/env python3`
- r: `#!/usr/bin/env Rscript`
> once you make the script executable (`chmod +x <script>.sh`), you can execute it directly (i.e. `./<script>.sh` instead of `bash./<script>.sh`) 


## Challenge exercise Day 3

The pipeline we've shown you in the class was desigend to work (mostly smoothly) with the data, structure and parameters just as we've provided.
Now, let's try to take this to the next level - i.e. transfer it to a real life challenge - by reproducing some RNA-Seq data from a published paper.

> Tips:
>
> - Along this adventure, you'll probably run into other important topics concerning *good scientific practice* (or bad examples of those). So don't be afraid, if it's harder than it should be.
> - Consider this a challenge somewhere between peer-review, data reproducibility, positive controls (also for yourself ≈> is your pipeline correct?) and FAIR data management
> - To make life easier, don't take the first best paper, but rather search for one where you can somewhat easily answer the questions in (A)

#### A. Find data

1. Find a paper from your research area of interest that used **mRNA-Sequencing**.
2. Within that paper, find a figure that **plots gene expression** / transcript abundance of any kind, e.g.
    - bar / dot plots of gene expression
    - heat maps
    - ...

3. Identify the **experimental design**, e.g.
    - What species was/were sequenced?
    - How many replicates?
    - Controls?
    - Different genotypes, ecotypes, treatments, other conditions, ...

4. What **RNASeq data** was produced or re-used for analysis
    - What reference was used?
        - Transcriptome? Genome?
        - Version?
        - Can you find and access (i.e. download) it?
    - What reads were produced (i.e. *.fastq files)
        - Sequencer?
        - Read length?
        - Paired or single end?
        - Are the reads trimmed / filtered?
        - Can you find and access (i.e. download) them?

#### B. Design your *in silico* experiment

- From step 3: Pick a small sample subset
  - e.g. 3 replicates wildtype and 3 replicates mutant
  - Write this down into a simple spreadsheet, e.g.

    | file_name         | sample   | group  |
    |:----------------- |:-------- |:------ |
    | wt_rep1.fastq.gz  | WT_rep1  | WT     |
    | wt_rep2.fastq.gz  | WT_rep2  | WT     |
    | wt_rep3.fastq.gz  | WT_rep2  | WT     |
    | mut_rep1.fastq.gz | mut_rep1 | mutant |
    | mu_rep2.fastq.gz  | mut_rep2 | mutant |
    | mu_rep3.fastq.gz  | mut_rep3 | mutant |

- From step 4: Download the relevant data, i.e.:
  - reference transcriptome or genome
  - raw fastq files for your sample subset
- From step 2: pick a simple plot, that you feel manageable to reproduce

#### C. Run your analysis

1. Put all data together into one directory (make it write-accessible by the docker container)
2. Start the *rnaseq* docker container
3. `Kallisto build` the index for the reference (section 6.4.1)
4. Write a for loop that... (section 6.4.1)
    - ...trims the fastq files (`trimmomatic`)
    - ...runs `fastqc` before and after trimming
    - ...maps the fastq reads against the reference (`kallisto quant`)
5. Start your RStudio docker
6. Write an R script to
    - import the kallisto results (requires `library(sleuth)`)
    - analyse the differential gene expression via `sleuth`
    - plot the results using `ggplot`
