Prediction of Protospacer Adjacent Motif (PAM) dependencies in CRISPR-Cas9 systems and design of novel Cas9 with broad PAM compatibility




Pham, Tyler
Madugula, Sita Sirisha
Liu, Jin


Journal Title

Journal ISSN

Volume Title



Purpose: CRISPR-Cas9 gene editing capabilities has experienced many limiting factors and biological constraints preventing its rapid adoption and mass utilization. Of interest is the Cas9 protein’s ability and necessity to recognize unique Protospacer Adjacent Motif (PAM) DNA sequences prior to its gene editing functionality. This recognition constraint has limited the scope of targetable regions of DNA, and thus prevents access to extensive sections of DNA by the Cas9 protein. Given the multitude of Cas9 species, it has become a challenge to fully comprehend the wide variations in this relationship. Machine learning (ML) applications have increasingly been developed to discern obscure patterns and relationships to aid in the analysis and design of the next generation of proteins. In this project, we hypothesize that the relationship between PAM-I domain sequences and its corresponding PAM DNA sequence can be computationally understood to predict new PAM DNA sequences and novel PAM-interacting (PAM-I) protein domain sequences. Specifically, our model attempts to directly associate such a relationship with a sequence-based approach between amino acid and DNA. The implementation of computational technologies into understanding biological function can facilitate the advancements in overcoming the innate constraints in the CRISPR-Cas9 system and provide a pipeline toward modern protein engineering. Method: Protein and DNA sequence data were extracted from public database sources. From EMBL-Uniprot queries, Cas9 protein sequences were obtained for various species. From NCBI-GenBank, protein CRISPR sequences were obtained from each protein’s respective genomic DNA matched by accession. Collection of PAM DNA sequences were gathered from predictive alignments utilizing NCBI-BLAST. Construction of our model utilizes a transformer architecture implementing text embedding on the sequence data. Results: The final construction of our database contains a total of 795 unique Cas9 protein sequences, from which their corresponding PAM-I domains were extracted. From their respective genomic DNA, a total of 18,445 CRISPR sequences were found. From which, we have aligned and collected a large set of PAM DNA targets for each protein species. With our collection of Cas9 protein domain sequences and their associated PAM DNA sequences, we have trained and tested a novel ML model to discern and classify the relationship between the two associated sequences. To further expand the relationship, a similar transformer ML model will be developed to methodically generate unique protein domain sequences capable of recognizing PAM DNA sequence targets. Final accuracy results from our prediction and generation models remain in progress and are pending, with the expectation of reaching at least 50% for both models. Conclusion: Given its absence and public availability, a unique database of protein PAM-I domains and their associated PAM DNA sequences has been successfully developed and curated to facilitate the development and testing of our novel ML models. The results and outcomes of this project can create an opportunity to directly integrate into a modern protein engineering pipeline to build and test new libraries of Cas9 proteins.