Selection of Highly Informative Markers for Apportionment of Ancestry and Population Affiliation




Zeng, Xiangpei


Journal Title

Journal ISSN

Volume Title



Ancestry informative markers (AIMs) can be used to detect and adjust for population stratification and predict the ancestry of the source of an evidence sample. Autosomal single nucleotide polymorphisms (SNPs) are the best candidates for AIMs. It is essential to identify the most informative AIM SNPs across relevant populations. Several informativeness measures for ancestry estimation have been used for AIMs selection: Absolute Allele Frequency Differences (δ), F statistics (FST), and Informativeness for Assignment Measure (In). However, their efficacy has not been compared objectively, particularly for determining affiliations of major US populations. This doctoral dissertation research was conducted under the hypotheses that δ and FST perform better than In, and highly informative AIMs can be selected among human populations by using these three marker informativeness measures. The primary goal of this project was to develop a robust AIMs panel with a minimum number of markers that can be used for apportionment of ancestry and population affiliation of four major US populations, that is African American, US Caucasian, East Asian and Hispanic American. First, candidate SNPs were searched and downloaded from the HapMap Project. Then these SNPs were ranked for their informativeness based on the three measures (δ, FST, and In) in a population pairwise manner. The FST measure appeared to be the most informative measure, performing slightly better than δ. With this approach and population statistics assessment, a minimum number of AIMs, i.e., 23, was selected to characterize the four major American populations. The efficacy of these 23 SNPs was tested in silico using nine populations from the HapMap project and 1000 Genomes. Finally, empirical testing was performed using 189 individuals collected from four US populations to evaluate further the performance of the 23-AIMs panel. The results of this dissertation research indicated that these 23 AIMs can correctly assign individuals to the major population categories in silico. Empirical testing results showed that one SNP (rs12149261) on chromosome 16 had a duplicated region on chromosome 1. This SNP was removed from my list, in order to avoid erroneous results. The resultant 22-AIMs panel was able to resolve the four major populations as in the in silico study. PCA results showed that eight individuals were not assigned to the expected major population categories. The assignments of the 22 AIMs for these samples were consistent with AIMs results from the ForenSeqTM panel. No departures from Hardy-Weinberg equilibrium (HWE) and linkage disequilibrium (LD) were detected for all 22 SNPs in four US populations (after removing the eight problematic samples). The results indicated that the 22 AIMs can correctly assign individuals to the four major US population categories. These 22 SNPs could contribute to the candidate pool of AIMs for potential forensic identification purposes and population stratification studies for biomedical research in the major US populations.