Machine Learning Based Classification of CRISPR-Cas Proteins Using Complete Protein Spectrum




Madugula, Sita Sirisha
Arachchige, Vindi Mahesha Jayasinghe
Pham, Tyler
Nammi, Bharani
Wang, Shouyi
Liu, Jin


0000-0001-9944-117X (Madugula, Sita Sirisha)

Journal Title

Journal ISSN

Volume Title



Purpose: Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) and its associated (Cas) proteins together form the CRISPR-Cas system. The CRISPR-Cas system typically forms the machinery for innate defense mechanism in prokaryotes against foreign genetic elements such as phages and plasmids. The recent development of this mechanism into a gene editing technology holds a promise to correct gene level defects for several genetic diseases. The key element of CRISPR-Cas system is the Cas protein that are nucleases and possess the ability to edit gene of interest. Different types of Cas proteins are involved in different CRISPR-Cas systems. Cas proteins however suffer from inherent limitations like specificity and off-target effects which limits its widespread application as a gene editing tool. In the current study, a novel method has been developed for classifying the Cas9 and Cas12 families. Existing classification tools have a low overall accuracy and are usually built using only a few types of protein features. We also attempt to understand the different protein features governing the Cas9 and Cas12 classes using a multitude of protein features. Method: We built Random Forest (RF) binary classifiers to classify Cas12 and Cas9 proteins respectively using the complete spectrum of protein features (13,495 features) encoding the physiochemical, constitutional, and evolutionary information. Additionally, we also built multiclass RF classifiers that differentiates between Cas9, Cas12 and non-Cas proteins. The performance of all models was evaluated using a 5-fold cross validation and six evaluation metrices like accuracy, precision, recall, F1-score, AUC score and specificity. We also tested our models on the respective independent datasets that were developed in-house from various public domain databases. Results: The Cas12 and Cas9 models achieved a high overall accuracy of 0.97 and 0.96 on their independent datasets respectively while the multiclass classifier achieved a high F1 score of 1.0. We observed that amino acid composition, Qasi-sequence-order and Composition-based protein features are particularly important for the Cas12 and Cas9 family of proteins. Conclusions: We successfully built the classification models for Cas12 and Cas9 protein families and identified the protein features that are unique to each family, which enhance the understanding of the structure and functions of Cas9 and Cas12 proteins and also provide valuable insights into plausible structural modifications in these proteins to achieve enhanced specificity and reduced off-target effects.