The effect of ascertainment bias on detecting signatures of selection
Genotyping arrays have been broadly used to identify signatures of selection with genome-wide scans. It has been reported that the markers contained in arrays don’t accurately represent the variation in full sequence data, especially in non-European populations, and that this may affect the results of selection studies. The availability of whole genome sequence (WGS) data from various African populations has enabled the analysis of the extent to which ascertainment bias affects the detection of selection signals on this continent. Seven commonly used genotyping arrays were represented by creating in silico single nucleotide polymorphism (SNP) panels from WGS data of the African Genome Variation Project (AGVP) Baganda, Ethiopia and Zulu samples. Four types of selection scans (FST, iHS, XP-EHH and Tajima’s D) were performed on both the array and WGS datasets, and the accuracy of selection signals identified from array data was assessed in relation to the WGS results. It was found that selection scans performed with array data produced a significant proportion of false positives and false negative signals. The EHH-based methods were least affected by ascertainment bias and arrays with higher marker density generally produced more accurate results. The two arrays ascertained from African populations out-performed a more European-based array of similar size. Variation in marker density across the genome was found to underlie discrepancies between array and WGS selection signals, as genomic regions in array data containing fewer markers were less likely to be detected as selection signals. Of the selection signals identified from WGS but not array data, most were missed due to insufficient SNP density. To investigate the extent to which the selection signals from one Southeastern Bantu-speaking (SEB) group is shared by another SEB group, selection scans on two independent SEB groups, namely the Bt20 and AGVP Zulu samples. The overlap in selection signals between the samples was found to be limited, concurring with differential KhoeSan gene flow into these groups. It was found that various selection scan methods are differentially affected by ascertainment bias, and additionally, limited concordance was observed between the selection signals identified by different methods. A comparison of selection signals between the three AGVP populations revealed high population specificity of signals. Regions displaying signatures of selection were annotated for gene names and functionality, and both canonical and less well-established selection candidates were identified. These included genes associated with infectious diseases, cancer, metabolism, pigmentation, neuro-motor functions and high altitude adaptation.
A Dissertation submitted to the Faculty of Health Science, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Master of Science in Medicine. Johannesburg, 2019