Link to home

Detection of copy number variation for chromosomal sliding windows using high throughput sequencing data in the R environment

Brian Knaus: USDA-ARS, Horticultural Crops Research Unit

<div>Inference of copy number variation presents a technical challenge because variant callers typically require the copy number of a genome or genomic region to be known a priori. Here we present a method to infer copy number that uses variant call format (VCF) data as input and is implemented in the R package vcfR. This method is based on the relative frequency of each allele (both genic and non-genic) sequenced at heterozygous positions throughout a genome. These heterozygous positions are summarized by using arbitrarily sized windows of heterozygous positions, binning the allele frequencies, and selecting the bin with the greatest abundance of positions. This provides a non-parametric summary of the frequency that alleles were sequenced at in each window. The method is applicable to organisms that have reference genomes that consist of full chromosomes or sub-chromosomal contigs. It differs from other software designed to detect copy number variation in that it does not rely on an assumption of base ploidy, but instead infers it. We validated these approaches with the model system of <em>Saccharomyces cerevisiae</em> and applied it to the oomycete <em>Phytophthora infestans</em>, both known to vary in ploidy. This functionality has been incorporated into the current release of the R package vcfR to provide modular and flexible methods to investigate copy number variation in genomic projects.</div>