A systematic assessment of the copy number variation (CNV) landscape in adme genes in Sub-Saharan African populations
Cottino, Laura Ann Alice
Various types of genetic variation exist within the human genome, ranging from single nucleotide changes to large-scale structural alterations. In recent years, several studies have identified and revealed the importance of a class of intermediate scale structural variation (SV), known as copy number variation (CNV). Despite its important impact on genetic diversity as well as disease, CNV remains widely understudied. The African continent carries a disproportionate burden of disease, and the treatments available are often ineffective or result in adverse drug reactions (ADRs). Genetic variation in absorption, distribution, metabolism, and excretion (ADME) genes has been shown to have a major impact on the pharmacokinetics, efficacy and safety of drugs. While single nucleotide variants in ADME genes have been well characterised, studies focused on the CNV landscape in ADME genes are lacking, especially in African populations. Additionally, CNVs can be identified using several approaches, including short-read and long-read whole genome sequencing (WGS) as well as from genotyping array data, with each approach posing its own limitations and advantages. This study aimed to (i) describe the CNV landscape in ADME genes in sub-Saharan African (SSA) populations, (ii) compare the CNV results from the short- and long-read sequencing data across the genome in order to assess whether short-read sequencing is adequate for accurate CNV calls and (iii) assess the utility of the Infinium H3Africa Consortium Array v2 to call CNVs. CNVs were called from the short-read, highcoverage, WGS of 953 individuals from across SSA using several CNV calling tools. A merged dataset was generated from the outputs of Genome STRiP, Manta, Delly, Lumpy and GATK that consisted of 362 ADME CNVs. The results of this study show that CNV is an additional and important source of genetic variability within ADME genes that has significant implications for drug response and precision therapy. There are, however, various limitations of SV calling from short-read sequencing data, resulting in generally low sensitivity and high false discovery rates, and long-read approaches have been shown to outperform short-read approaches. In this study, eight samples were selected for PacBio SMRT sequencing and a total of 42 041 CNVs were identified throughout the genome. When comparing the short- and long-read CNVs from the same eight individuals, ~50% of the short-read CNVs were concordant with the long-read data. Overall, the long-read sequencing approach allowed for the identification of a large proportion of additional CNVs and provided invaluable insights into the SSA CNV landscape. Sequencing of additional samples will further improve our knowledge. Lastly, genotype data generated from over 10 000 South African individuals on the Infinium H3Africa Consortium Array v2 was used to call CNVs with PennCNV. A total of 7 312 CNVs were identified across the genome. This approach was able to call large, rare CNVs but called far fewer CNVs compared to the sequencing approaches discussed above. This lack of resolution limits the utility of the array in generating a comprehensive population-wide CNV call set. Additionally, the concordance between the short-read CNVs and array CNVs was very low, but was far higher (~50%) when compared to the long-read CNVs. Overall, this study has shown that CNV is an important source of genetic variation and has provided a more complete set of pharmacogenetically relevant variants in the African context. Additionally, it has highlighted the challenges of calling CNVs, with a diverse set of results being obtained from the different approaches.
A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy to the Faculty of Health Science, School of Pathology, University of the Witwatersrand, Johannesburg, 2022