Browsing by Author "Wang, Shouyi"

Now showing 1 - 3 of 3

Design of Man-made Miniature CRISPR-Cas Proteins Using Computational and Artificial Intelligence Technologies
(2023) Jayasinghe-Arachchige, Vindi; Madugula, Sita Sirisha; Nammi, Bharani; Nukala, Nihitha; Wang, Shouyi; Liu, Jin
Purpose: The CRISPR/Cas system is a popular genome editing technique that uses a guide RNA and specific proteins known as Cas proteins for its function. A major challenge in harnessing CRISPR-Cas technology for applications in living organisms is the lack of an efficient delivery system. Due to the larger size of available Cas proteins used in this tool, it is challenging to encapsulate the CRISPR components into a single vehicle for delivery. To address this issue, we have used computational and Artificial Intelligence (AI) tools on designing compact-size Cas proteins that have a similar function and are more efficient than available Cas proteins. Methods: The available crystal structures of the smallest CRISPR-Cas systems were utilized and further reduced. A novel method termed the "Blocks and Gaps approach” was employed to design new mini-Cas proteins with a size range of 450-500 amino acids in length. The generated protein sequences (1 million) were subsequently used in machine learning-based two classification models to filter out the non-Cas proteins from it. The resultant Cas protein sequences were used in homology-modeling-based (Swiss-Model) and AI-based (Alphafold2) protein structure prediction methods to obtain their 3D structures. Further, the global and local structural features as well as the solubility of these proteins were analyzed, and top candidates were subjected to molecular dynamics (MD) simulations including substrate DNA and gRNA. Results/Conclusions: A library of man-made miniature Cas proteins was generated, and these proteins are less than half the size of the widely used CRISPR-Cas such as Cas9 or Cas12a. 50% of these were predicted as Cas proteins by both the machine learning-based classification models used. And 90% of them show similar 3D structures as their original counterparts. 10% of these passed through the final validations. Experimental testing of the activity of these designed proteins is to be investigated at this point of the study.
Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum
(Cold Spring Harbor Laboratory, 2024-02-08) Madugula, Sita S.; Pujar, Pranav; Bharani, Nammi; Wang, Shouyi; Jayasinghe-Arachchige, Vindi M.; Pham, Tyler; Mashburn, Dominic; Artilis, Maria; Liu, Jin
The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations like large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In the current study, we aim to elucidate the unique protein attributes associated with Cas9 and Cas12 families and identify the features that distinguish each family from the other. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,495 features) encoding various physiochemical, topological, constitutional, and coevolutionary information of Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and Non-Cas proteins. All the models were evaluated rigorously on the test and independent datasets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 95% and 97% on their respective independent datasets, while the multiclass classifier achieved a high F1 score of 0.97. We observed that Quasi-sequence-order descriptors like Schneider-lag descriptors and Composition descriptors like charge, volume, and polarizability are essential for the Cas12 family. More interestingly, we discovered that Amino Acid Composition descriptors, especially the Tripeptide Composition (TPC) descriptors, are important for the Cas9 family. Four of the identified important descriptors of Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all the Cas9 proteins and were located within different catalytically important domains of the Cas9 protein structure. Among these four tripeptides, tripeptides DHI and HHA are well-known to be involved in the DNA cleavage activity of the Cas9 protein. We therefore propose the the other two tripeptides, PWN and PYY, may also be essential for the Cas9 family. Our identified important descriptors enhanced the understanding of the catalytic mechanisms of Cas9 and Cas12 proteins and provide valuable insights into design of novel Cas systems to achieve enhanced gene-editing properties.
Machine Learning Based Classification of CRISPR-Cas Proteins Using Complete Protein Spectrum
(2023) Madugula, Sita Sirisha; Arachchige, Vindi Mahesha Jayasinghe; Pham, Tyler; Nammi, Bharani; Wang, Shouyi; Liu, Jin
Purpose: Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) and its associated (Cas) proteins together form the CRISPR-Cas system. The CRISPR-Cas system typically forms the machinery for innate defense mechanism in prokaryotes against foreign genetic elements such as phages and plasmids. The recent development of this mechanism into a gene editing technology holds a promise to correct gene level defects for several genetic diseases. The key element of CRISPR-Cas system is the Cas protein that are nucleases and possess the ability to edit gene of interest. Different types of Cas proteins are involved in different CRISPR-Cas systems. Cas proteins however suffer from inherent limitations like specificity and off-target effects which limits its widespread application as a gene editing tool. In the current study, a novel method has been developed for classifying the Cas9 and Cas12 families. Existing classification tools have a low overall accuracy and are usually built using only a few types of protein features. We also attempt to understand the different protein features governing the Cas9 and Cas12 classes using a multitude of protein features. Method: We built Random Forest (RF) binary classifiers to classify Cas12 and Cas9 proteins respectively using the complete spectrum of protein features (13,495 features) encoding the physiochemical, constitutional, and evolutionary information. Additionally, we also built multiclass RF classifiers that differentiates between Cas9, Cas12 and non-Cas proteins. The performance of all models was evaluated using a 5-fold cross validation and six evaluation metrices like accuracy, precision, recall, F1-score, AUC score and specificity. We also tested our models on the respective independent datasets that were developed in-house from various public domain databases. Results: The Cas12 and Cas9 models achieved a high overall accuracy of 0.97 and 0.96 on their independent datasets respectively while the multiclass classifier achieved a high F1 score of 1.0. We observed that amino acid composition, Qasi-sequence-order and Composition-based protein features are particularly important for the Cas12 and Cas9 family of proteins. Conclusions: We successfully built the classification models for Cas12 and Cas9 protein families and identified the protein features that are unique to each family, which enhance the understanding of the structure and functions of Cas9 and Cas12 proteins and also provide valuable insights into plausible structural modifications in these proteins to achieve enhanced specificity and reduced off-target effects.