Skip to content
Snippets Groups Projects

deepSTABp

ARC for the paper DeepSTABp: A deep learning approach for the prediction of thermal protein stability. deepSTABp is an protein melting temperature predictor and was developed to overcome the limitations of classical experimental approaches, which are expensive, labour-intensive, and have limited proteome and species coverage.

DeepSTABp uses a transformer-based Protein Language model for sequence embedding and state-of-the-art feature extraction in combination with other deep learning techniques for end-to-end protein Tm prediction.

Usage

deepSTAPp can either be used directly at: deepSTABp

An alternative to using the Web interface is to clone this ARC and to run it locally.

Setup environment

You can create a conda environment directly from the yml file located in /workflows

conda create --name deepstabp --file environment.yml

Keep in mind that you must download the complete ARC which requires git and git lfs, as well as enough memory space. The complete ARC has a size of around 67 GB including all LFS files.

Running deepSTABp

Afterwards you can use the predict_cpu.py/predict.cuda in workflows/prediction_model. This means navigate to the respective folder in your command line. Activate the conda environment you created.

conda activate deepstabp

Afterwards run either predict_cpu or predict_cuda depending on your hardware. Followed by the fasta file you want to analyse.

python predict_cpu filepath

You can specify the growth temperature and protein environment in the following way

python predict_cpu filepath -g growthtemp<int> -m measurementcondition <str>

The only available measurement conditions are "Lysate" and "Cell". The default parameters for the growth temperature and the measurement conditios are 22 degrees Celsius and "Cell", respectivly. In case you want to store the output file you can do that by specifying the save path with the flag -s

python predict_cpu filepath -g growthtemp<int> -m measurementcondition<str> -s savepath<str>

Training your own model

In case you want to retrain deepSTABp simply run the training.py file located in workflows/TransformerBasedTMPrediction/MLP_training.py . You can also experiment with different architectures by directly editing the modelstructure found in the MLP_training.py file. The other file found in workflows/TransformerBasedTMPrediction/ is named tuning.py and you should run it after training with your already pretrained model to achiv optimal results.

Using the datafiles

All datafiles that were used to train deepSTABp can be found in the /runs/TransformerBasedTMPrediction/Datasets folder. The folder base has the compelete dataset, the folder training, testing, and validation have the sampled datasets that emerged from the base dataset. The datasets are availabe in csv and parqut format.

ARCs

for details, see https://github.com/nfdi4plants/ARC-specification

ARC directory structure

/isa.investigation.xlsx
main investigation file (in XLSX format), contains top-level information about the investigation and links to studies and assays
/LICENSE
license file for the arc
/arc.cwl
(optional) executes all runs

Studies

All input material or (external) data resources are considered studies (immutable input data)

/assays/<yourStudyName>
folder for an essay, name is arbitrary
/assays/<yourStudyName>/resources
assay dataset
/assays/<yourStudyName>/protocols
assay protocol as text (e.g SOP)
/assays/<yourStudyName>/isa.study.xlsx
per-assay ISA file, contains only info about this assay (in XLSX format)

Assays

All measurement (self-generated) datasets are considered assays (immutable input data)

/assays/<yourAssayName>
folder for an essay, name is arbitrary
/assays/<yourAssayName>/dataset
assay dataset
/assays/<yourAssayName>/protocols
assay protocol as text (e.g SOP)
/assays/<yourAssayName>/isa.assay.xlsx
per-assay ISA file, contains only info about this assay (in XLSX format)

Workflows

All programmatic components of an ARC should go here, in the form of code and environment.

/workflows/<yourWorkflowName>/
folder for code and its environment (workflow specifications), contains all files needed to specify a workflow. Also packages and other includes needed should go here.
/workflows/<yourWorkflowName>/Dockerfile
top-level Dockerfile [optional] in which all code/workflow should execute
/workflows/<yourWorkflowName>/[codefiles;...]
code files/scripts for computation

Runs

Runs are all artifacts that result by computation from studies, assays or other runs.

/runs/<yourRunResultName>
folder for results of your run aka. workflow execution
/runs/<yourRunResultName>/[files;...]
run result files (plots, tables, etc.)
/runs/<yourRunResultName>.cwl
executes a script or workflow and produces run results accordingly.