Browsing by Author "Madugula, Sita Sirisha"

Now showing 1 - 5 of 5

Design of Man-made Miniature CRISPR-Cas Proteins Using Computational and Artificial Intelligence Technologies
(2023) Jayasinghe-Arachchige, Vindi; Madugula, Sita Sirisha; Nammi, Bharani; Nukala, Nihitha; Wang, Shouyi; Liu, Jin
Purpose: The CRISPR/Cas system is a popular genome editing technique that uses a guide RNA and specific proteins known as Cas proteins for its function. A major challenge in harnessing CRISPR-Cas technology for applications in living organisms is the lack of an efficient delivery system. Due to the larger size of available Cas proteins used in this tool, it is challenging to encapsulate the CRISPR components into a single vehicle for delivery. To address this issue, we have used computational and Artificial Intelligence (AI) tools on designing compact-size Cas proteins that have a similar function and are more efficient than available Cas proteins. Methods: The available crystal structures of the smallest CRISPR-Cas systems were utilized and further reduced. A novel method termed the "Blocks and Gaps approach” was employed to design new mini-Cas proteins with a size range of 450-500 amino acids in length. The generated protein sequences (1 million) were subsequently used in machine learning-based two classification models to filter out the non-Cas proteins from it. The resultant Cas protein sequences were used in homology-modeling-based (Swiss-Model) and AI-based (Alphafold2) protein structure prediction methods to obtain their 3D structures. Further, the global and local structural features as well as the solubility of these proteins were analyzed, and top candidates were subjected to molecular dynamics (MD) simulations including substrate DNA and gRNA. Results/Conclusions: A library of man-made miniature Cas proteins was generated, and these proteins are less than half the size of the widely used CRISPR-Cas such as Cas9 or Cas12a. 50% of these were predicted as Cas proteins by both the machine learning-based classification models used. And 90% of them show similar 3D structures as their original counterparts. 10% of these passed through the final validations. Experimental testing of the activity of these designed proteins is to be investigated at this point of the study.
Leveraging Graph Attention Mechanisms to Create an Explainable Multi-Function Machine Learning Model
(2024-03-21) Mathew, Ezek; Madugula, Sita Sirisha; Emmitte, Kyle; Liu, Jin
Purpose: Identifying target-specific ligands is a difficult task, especially in cases where receptors display high structural similarity. Such is the case for metabotropic glutamate receptor subtype 2 (mGlu2) and metabotropic glutamate receptor subtype 3 (mGlu3), which are prime targets for various neurological treatments. However, signal transduction through these two receptors often yields opposing physiological function and differentially affect pathologies. Methods: Understanding the need to differentiate ligands based on their binding to mGlu2 and mGlu3, we employed a machine learning (ML) approach. The ML model performed three distinct tasks and leveraged transfer learning to inform each subsequent task. Task 1: Simple Classification was performed, as the ML model predicted if the ligands displayed selectivity for the mGlu2 or mGlu3 class. Task 2: Regression was performed, as the ML model estimated the IC50 values of individual input ligands. The classification weights from Task 1 were broadcasted into the attention layers of the ML model for Task 2, serving as a starting point. Task 3: Classification was performed, as the ML model sought to determine if a ligand displayed low or high potency for the target class. Classification weights and regression weights from previous tasks were broadcasted into the model. Results: The model yielded greater than 99% accuracy in the selectivity classification task, while also delivering satisfactory performance when predicting potency (72.80% error). The model yielded 83% accuracy in correctly identifying high potency mGlu2 ligands, as high potency mGlu2 compounds. Meanwhile, the algorithm displayed 75% accuracy in correctly identifying high potency mGlu3 ligands, as high potency mGlu3 compounds. Conclusions: This approach allows for prediction of multiple target properties using a single model. With access to other high-quality datasets, this model has the potential to apply to other ligand classes of interest, posing great potential for drug repurposing studies.
Machine Learning Based Classification of CRISPR-Cas Proteins Using Complete Protein Spectrum
(2023) Madugula, Sita Sirisha; Arachchige, Vindi Mahesha Jayasinghe; Pham, Tyler; Nammi, Bharani; Wang, Shouyi; Liu, Jin
Purpose: Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) and its associated (Cas) proteins together form the CRISPR-Cas system. The CRISPR-Cas system typically forms the machinery for innate defense mechanism in prokaryotes against foreign genetic elements such as phages and plasmids. The recent development of this mechanism into a gene editing technology holds a promise to correct gene level defects for several genetic diseases. The key element of CRISPR-Cas system is the Cas protein that are nucleases and possess the ability to edit gene of interest. Different types of Cas proteins are involved in different CRISPR-Cas systems. Cas proteins however suffer from inherent limitations like specificity and off-target effects which limits its widespread application as a gene editing tool. In the current study, a novel method has been developed for classifying the Cas9 and Cas12 families. Existing classification tools have a low overall accuracy and are usually built using only a few types of protein features. We also attempt to understand the different protein features governing the Cas9 and Cas12 classes using a multitude of protein features. Method: We built Random Forest (RF) binary classifiers to classify Cas12 and Cas9 proteins respectively using the complete spectrum of protein features (13,495 features) encoding the physiochemical, constitutional, and evolutionary information. Additionally, we also built multiclass RF classifiers that differentiates between Cas9, Cas12 and non-Cas proteins. The performance of all models was evaluated using a 5-fold cross validation and six evaluation metrices like accuracy, precision, recall, F1-score, AUC score and specificity. We also tested our models on the respective independent datasets that were developed in-house from various public domain databases. Results: The Cas12 and Cas9 models achieved a high overall accuracy of 0.97 and 0.96 on their independent datasets respectively while the multiclass classifier achieved a high F1 score of 1.0. We observed that amino acid composition, Qasi-sequence-order and Composition-based protein features are particularly important for the Cas12 and Cas9 family of proteins. Conclusions: We successfully built the classification models for Cas12 and Cas9 protein families and identified the protein features that are unique to each family, which enhance the understanding of the structure and functions of Cas9 and Cas12 proteins and also provide valuable insights into plausible structural modifications in these proteins to achieve enhanced specificity and reduced off-target effects.
Prediction of Ligand Selectivity and Efficacy Using Artificial Intelligence Algorithms
(2023) Mathew, Ezek; Wang, Duen-Shian; Liu, Kevin; Pham, Tyler; Madugula, Sita Sirisha; Emmitte, Kyle; Liu, Jin
Purpose: Identifying target-specific ligands is extremely challenging in drug discovery, especially in cases where receptors display high structural similarity. Such is the case for metabotropic glutamate receptor subtype 2 (mGlu2) and metabotropic glutamate receptor subtype 3 (mGlu3), which are prime targets for various neurological treatments. However, signal transduction through these two receptors often yields opposing physiological function and differentially affects pathologies. The purpose of this study is to develop artificial intelligence (AI) methods to predict ligand selectivity and efficacy on similar targets. Methods: Understanding the need to differentiate ligands based on their binding to mGlu2 and mGlu3, we employed a machine learning approach. Using patent-derived datasets, data was pre-processed into an eight-dimension vector space. Afterwards, the data was flattened, and a Multiple Input and Output (MIO) Model was designed to receive the incoming vectors. A classification arm was designated as an output, differentiating input structures as mGlu2 or mGlu3 ligands. In addition, this novel MIO Model with Functional application program interface (API) architecture also has the capacity to estimate efficacy of an input ligand by outputting Half-maximal inhibitory concentration (IC50) value. Results: The model yielded greater than 96% accuracy in the classification task to predict the binding selectivity of the ligands, while simultaneously delivering satisfactory performance when predicting efficacy. With regards to the regression arm, the model attained about 81% accuracy in correctly identifying high-affinity mGlu2 compounds, and 62% accuracy in correctly identifying high-affinity mGlu3 compounds. We then used docking studies, and the trained model to screen an available ZINC database, selecting the top 39 compounds out of 9270 ligands. Conclusions: This approach can pave the way for computer aided searches which screen for high efficacy ligands belonging to a certain class of interest. More specifically, this model can be used in combination with other established structure-based methodology like molecular docking, allowing for screening of even more drug candidates for further study and validation. With access to other high-quality datasets, this model has the potential to apply to other ligand classes of interest, posing great potential for drug repurposing studies.
Prediction of Protospacer Adjacent Motif (PAM) dependencies in CRISPR-Cas9 systems and design of novel Cas9 with broad PAM compatibility
(2023) Pham, Tyler; Madugula, Sita Sirisha; Liu, Jin
Purpose: CRISPR-Cas9 gene editing capabilities has experienced many limiting factors and biological constraints preventing its rapid adoption and mass utilization. Of interest is the Cas9 protein’s ability and necessity to recognize unique Protospacer Adjacent Motif (PAM) DNA sequences prior to its gene editing functionality. This recognition constraint has limited the scope of targetable regions of DNA, and thus prevents access to extensive sections of DNA by the Cas9 protein. Given the multitude of Cas9 species, it has become a challenge to fully comprehend the wide variations in this relationship. Machine learning (ML) applications have increasingly been developed to discern obscure patterns and relationships to aid in the analysis and design of the next generation of proteins. In this project, we hypothesize that the relationship between PAM-I domain sequences and its corresponding PAM DNA sequence can be computationally understood to predict new PAM DNA sequences and novel PAM-interacting (PAM-I) protein domain sequences. Specifically, our model attempts to directly associate such a relationship with a sequence-based approach between amino acid and DNA. The implementation of computational technologies into understanding biological function can facilitate the advancements in overcoming the innate constraints in the CRISPR-Cas9 system and provide a pipeline toward modern protein engineering. Method: Protein and DNA sequence data were extracted from public database sources. From EMBL-Uniprot queries, Cas9 protein sequences were obtained for various species. From NCBI-GenBank, protein CRISPR sequences were obtained from each protein’s respective genomic DNA matched by accession. Collection of PAM DNA sequences were gathered from predictive alignments utilizing NCBI-BLAST. Construction of our model utilizes a transformer architecture implementing text embedding on the sequence data. Results: The final construction of our database contains a total of 795 unique Cas9 protein sequences, from which their corresponding PAM-I domains were extracted. From their respective genomic DNA, a total of 18,445 CRISPR sequences were found. From which, we have aligned and collected a large set of PAM DNA targets for each protein species. With our collection of Cas9 protein domain sequences and their associated PAM DNA sequences, we have trained and tested a novel ML model to discern and classify the relationship between the two associated sequences. To further expand the relationship, a similar transformer ML model will be developed to methodically generate unique protein domain sequences capable of recognizing PAM DNA sequence targets. Final accuracy results from our prediction and generation models remain in progress and are pending, with the expectation of reaching at least 50% for both models. Conclusion: Given its absence and public availability, a unique database of protein PAM-I domains and their associated PAM DNA sequences has been successfully developed and curated to facilitate the development and testing of our novel ML models. The results and outcomes of this project can create an opportunity to directly integrate into a modern protein engineering pipeline to build and test new libraries of Cas9 proteins.