Felix Jung
deepSTABp

Repository



deepSTABp
ARC for the paper DeepSTABp: A deep learning approach for the prediction of thermal protein stability. deepSTABp is an protein melting temperature predictor and was developt to overcome the limitations of classical experimental apporches, which are expensive, labor-intensive, and have limited proteome and species coverarge.
DeepSTABp uses a transformer-based Protein Language model for sequence embedding and state-of-the-art feature extraction in combination with other deep learning techniques for end-to-end protein Tm prediction.

Usage
deepSTAPp can either used directly at: deepSTABp
An alternative tu using the Web interface is to clone this ARC and to run it locally.

Setup environment
You can create a conda enviroment directly from the yml file located in /workflows

conda env create -f environment.yml


Running deepSTABp
Afterwards you can use the predict_main.py in workflows/TransformerBasedTMPrediction/prediction_model. You simply have to replace the example fasta sequence with a path to your .fasta file and run the predict_main.py file.

Training your own model
In case you want to retrain deepSTABp simply run the training.py file located in workflows/TransformerBasedTMPrediction/MLP_training.py . You can also experiment with different architectures by directly editing the modelstructure found in the MLP_training.py file.
The other file found in workflows/TransformerBasedTMPrediction/ is named tuning.py and you should run it after training with your already pretrained model to achiv optimal results.

Using the datafiles
All datafiles that were used to train deepSTABp can be found in the /runs/TransformerBasedTMPrediction/Datasets folder. The folder base has the compelete dataset, the folder training, testing, and validation have the sampled datasets that emerged from the base dataset. The datasets are availabe in csv and parqut format.

ARCs
for details, see https://github.com/nfdi4plants/ARC-specification

ARC directory structure

/isa.investigation.xlsx
main investigation file (in XLSX format), contains top-level information about the investigation and links to studies and assays
/LICENSE
license file for the arc
/arc.cwl
(optional) executes all runs


Studies
All input material or (external) data resources are considered studies (immutable input data)

/assays/<yourStudyName>
folder for an essay, name is arbitrary
/assays/<yourStudyName>/resources
assay dataset
/assays/<yourStudyName>/protocols
assay protocol as text (e.g SOP)
/assays/<yourStudyName>/isa.study.xlsx
per-assay ISA file, contains only info about this assay (in XLSX format)


Assays
All measurement (self-generated) datasets are considered assays (immutable input data)

/assays/<yourAssayName>
folder for an essay, name is arbitrary
/assays/<yourAssayName>/dataset
assay dataset
/assays/<yourAssayName>/protocols
assay protocol as text (e.g SOP)
/assays/<yourAssayName>/isa.assay.xlsx
per-assay ISA file, contains only info about this assay (in XLSX format)


Workflows
All programmatic components of an ARC should go here, in the form of code and environment.

/workflows/<yourWorkflowName>/
folder for code and its environment (workflow specifications), contains all files needed to specify a workflow. Also packages and other includes needed should go here.
/workflows/<yourWorkflowName>/Dockerfile
top-level Dockerfile [optional] in which all code/workflow should execute
/workflows/<yourWorkflowName>/[codefiles;...]
code files/scripts for computation


Runs
Runs are all artifacts that result by computation from studies, assays or other runs.

/runs/<yourRunResultName>
folder for results of your run aka. workflow execution
/runs/<yourRunResultName>/[files;...]
run result files (plots, tables, etc.)
/runs/<yourRunResultName>.cwl
executes a script or workflow and produces run results accordingly.