deepSTABp
ARC for the paper DeepSTABp: A deep learning approach for the prediction of thermal protein stability. deepSTABp is an protein melting temperature predictor and was developed to overcome the limitations of classical experimental approaches, which are expensive, labour-intensive, and have limited proteome and species coverage.
DeepSTABp uses a transformer-based Protein Language model for sequence embedding and state-of-the-art feature extraction in combination with other deep learning techniques for end-to-end protein Tm prediction.
Usage
deepSTAPp can either be used directly at: deepSTABp
An alternative to using the Web interface is to clone this ARC and to run it locally.
Setup environment
You can create a conda environment directly from the yml file located in /workflows
conda create --name deepstabp --file environment.yml
Keep in mind that you must download the complete ARC which requires git and git lfs, as well as enough memory space. The complete ARC has a size of around 67 GB including all LFS files.
Running deepSTABp
Afterwards you can use the predict_cpu.py/predict.cuda in workflows/prediction_model. This means navigate to the respective folder in your command line. Activate the conda environment you created.
conda activate deepstabp
Afterwards run either predict_cpu or predict_cuda depending on your hardware. Followed by the fasta file you want to analyse.
python predict_cpu filepath
You can specify the growth temperature and protein environment in the following way
python predict_cpu filepath -g growthtemp<int> -m measurementcondition <str>
The only available measurement conditions are "Lysate" and "Cell". The default parameters for the growth temperature and the measurement conditios are 22 degrees Celsius and "Cell", respectivly. In case you want to store the output file you can do that by specifying the save path with the flag -s
python predict_cpu filepath -g growthtemp<int> -m measurementcondition<str> -s savepath<str>
Training your own model
In case you want to retrain deepSTABp simply run the training.py file located in workflows/TransformerBasedTMPrediction/MLP_training.py . You can also experiment with different architectures by directly editing the modelstructure found in the MLP_training.py file. The other file found in workflows/TransformerBasedTMPrediction/ is named tuning.py and you should run it after training with your already pretrained model to achiv optimal results.
Using the datafiles
All datafiles that were used to train deepSTABp can be found in the /runs/TransformerBasedTMPrediction/Datasets folder. The folder base has the compelete dataset, the folder training, testing, and validation have the sampled datasets that emerged from the base dataset. The datasets are availabe in csv and parqut format.
ARCs
for details, see https://github.com/nfdi4plants/ARC-specification
ARC directory structure
/isa.investigation.xlsx
- main investigation file (in XLSX format), contains top-level information about the investigation and links to studies and assays
/LICENSE
- license file for the arc
/arc.cwl
- (optional) executes all runs
Studies
All input material or (external) data resources are considered studies (immutable input data)
/assays/<yourStudyName>
- folder for an essay, name is arbitrary
/assays/<yourStudyName>/resources
- assay dataset
/assays/<yourStudyName>/protocols
- assay protocol as text (e.g SOP)
/assays/<yourStudyName>/isa.study.xlsx
- per-assay ISA file, contains only info about this assay (in XLSX format)
Assays
All measurement (self-generated) datasets are considered assays (immutable input data)
/assays/<yourAssayName>
- folder for an essay, name is arbitrary
/assays/<yourAssayName>/dataset
- assay dataset
/assays/<yourAssayName>/protocols
- assay protocol as text (e.g SOP)
/assays/<yourAssayName>/isa.assay.xlsx
- per-assay ISA file, contains only info about this assay (in XLSX format)
Workflows
All programmatic components of an ARC should go here, in the form of code and environment.
/workflows/<yourWorkflowName>/
- folder for code and its environment (workflow specifications), contains all files needed to specify a workflow. Also packages and other includes needed should go here.
/workflows/<yourWorkflowName>/Dockerfile
- top-level Dockerfile [optional] in which all code/workflow should execute
/workflows/<yourWorkflowName>/[codefiles;...]
- code files/scripts for computation
Runs
Runs are all artifacts that result by computation from studies, assays or other runs.
/runs/<yourRunResultName>
- folder for results of your run aka. workflow execution
/runs/<yourRunResultName>/[files;...]
- run result files (plots, tables, etc.)
/runs/<yourRunResultName>.cwl
- executes a script or workflow and produces run results accordingly.