deepSTABp
ARC for the paper DeepSTABp: A deep learning approach for the prediction of thermal protein stability. deepSTABp is an protein melting temperature predictor and was developt to overcome the limitations of classical experimental apporches, which are expensive, labor-intensive, and have limited proteome and species coverarge.
DeepSTABp uses a transformer-based Protein Language model for sequence embedding and state-of-the-art feature extraction in combination with other deep learning techniques for end-to-end protein Tm prediction.
Usage
deepSTAPp can either used directly at: deepSTABp
An alternative tu using the Web interface is to clone this ARC and to run it locally.
Setup environment
You can create a conda enviroment directly from the yml file located in /workflows
conda env create -f environment.yml
Running deepSTABp
Afterwards you can use the predict_main.py in workflows/TransformerBasedTMPrediction/prediction_model. You simply have to replace the example fasta sequence with a path to your .fasta file and run the predict_main.py file.
Training your own model
In case you want to retrain deepSTABp simply run the training.py file located in workflows/TransformerBasedTMPrediction/MLP_training.py . You can also experiment with different architectures by directly editing the modelstructure found in the MLP_training.py file. The other file found in workflows/TransformerBasedTMPrediction/ is named tuning.py and you should run it after training with your already pretrained model to achiv optimal results.
Using the datafiles
All datafiles that were used to train deepSTABp can be found in the /runs/TransformerBasedTMPrediction/Datasets folder. The folder base has the compelete dataset, the folder training, testing, and validation have the sampled datasets that emerged from the base dataset. The datasets are availabe in csv and parqut format.
ARCs
for details, see https://github.com/nfdi4plants/ARC-specification
ARC directory structure
/isa.investigation.xlsx
- main investigation file (in XLSX format), contains top-level information about the investigation and links to studies and assays
/LICENSE
- license file for the arc
/arc.cwl
- (optional) executes all runs
Studies
All input material or (external) data resources are considered studies (immutable input data)
/assays/<yourStudyName>
- folder for an essay, name is arbitrary
/assays/<yourStudyName>/resources
- assay dataset
/assays/<yourStudyName>/protocols
- assay protocol as text (e.g SOP)
/assays/<yourStudyName>/isa.study.xlsx
- per-assay ISA file, contains only info about this assay (in XLSX format)
Assays
All measurement (self-generated) datasets are considered assays (immutable input data)
/assays/<yourAssayName>
- folder for an essay, name is arbitrary
/assays/<yourAssayName>/dataset
- assay dataset
/assays/<yourAssayName>/protocols
- assay protocol as text (e.g SOP)
/assays/<yourAssayName>/isa.assay.xlsx
- per-assay ISA file, contains only info about this assay (in XLSX format)
Workflows
All programmatic components of an ARC should go here, in the form of code and environment.
/workflows/<yourWorkflowName>/
- folder for code and its environment (workflow specifications), contains all files needed to specify a workflow. Also packages and other includes needed should go here.
/workflows/<yourWorkflowName>/Dockerfile
- top-level Dockerfile [optional] in which all code/workflow should execute
/workflows/<yourWorkflowName>/[codefiles;...]
- code files/scripts for computation
Runs
Runs are all artifacts that result by computation from studies, assays or other runs.
/runs/<yourRunResultName>
- folder for results of your run aka. workflow execution
/runs/<yourRunResultName>/[files;...]
- run result files (plots, tables, etc.)
/runs/<yourRunResultName>.cwl
- executes a script or workflow and produces run results accordingly.