Recurrent miscalling of missense variation from short-read genome sequence data

Matthew A. Field; Gaetan Burgio; Aaron Chuah; Jalila Al Shekaili; Batool Hassan; Nashat Al Sukaiti; Simon J. Foote; Matthew C. Cook; T. Daniel Andrews

doi:10.1186/s12864-019-5863-2

Recurrent miscalling of missense variation from short-read genome sequence data

Matthew A. Field, Gaetan Burgio, Aaron Chuah, Jalila Al Shekaili, Batool Hassan, Nashat Al Sukaiti, Simon J. Foote, Matthew C. Cook, T. Daniel Andrews^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

7 Citations (Scopus)

Abstract

Background: Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation. Results: We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2-300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3-5000 recurrent false positive variants per mouse - the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation. Conclusion: Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome - which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time.

Original language	English
Article number	546
Journal	BMC Genomics
Volume	20
DOIs	https://doi.org/10.1186/s12864-019-5863-2
Publication status	Published - Jul 16 2019

Keywords

Alignment
Exome
Miscall
Resampling
Single nucleotide variant

ASJC Scopus subject areas

Biotechnology
Genetics

Access to Document

10.1186/s12864-019-5863-2

Cite this

@article{e11fd1eb461a4a208d17e01ad3e84032,

title = "Recurrent miscalling of missense variation from short-read genome sequence data",

abstract = "Background: Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation. Results: We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2-300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3-5000 recurrent false positive variants per mouse - the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation. Conclusion: Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome - which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time.",

keywords = "Alignment, Exome, Miscall, Resampling, Single nucleotide variant",

author = "Field, {Matthew A.} and Gaetan Burgio and Aaron Chuah and {Al Shekaili}, Jalila and Batool Hassan and {Al Sukaiti}, Nashat and Foote, {Simon J.} and Cook, {Matthew C.} and Andrews, {T. Daniel}",

note = "Publisher Copyright: {\textcopyright} 2019 The Author(s).",

year = "2019",

month = jul,

day = "16",

doi = "10.1186/s12864-019-5863-2",

language = "English",

volume = "20",

journal = "BMC Genomics",

issn = "1471-2164",

publisher = "BioMed Central",

}

TY - JOUR

T1 - Recurrent miscalling of missense variation from short-read genome sequence data

AU - Field, Matthew A.

AU - Burgio, Gaetan

AU - Chuah, Aaron

AU - Al Shekaili, Jalila

AU - Hassan, Batool

AU - Al Sukaiti, Nashat

AU - Foote, Simon J.

AU - Cook, Matthew C.

AU - Andrews, T. Daniel

PY - 2019/7/16

Y1 - 2019/7/16

N2 - Background: Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation. Results: We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2-300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3-5000 recurrent false positive variants per mouse - the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation. Conclusion: Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome - which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time.

AB - Background: Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation. Results: We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. The miscalls arise from difficulties aligning short reads to redundant genomic regions, where the rate of sequencing error approaches the sequence diversity between redundant regions. We find the resultant miscalled variants to be sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. In human exome sequences, we identify 2-300 recurrent false positive variants per individual, almost all of which are present in public databases of human genomic variation. From the exomes of non-reference strains of inbred mice, we identify 3-5000 recurrent false positive variants per mouse - the number of which increasing with greater distance between an individual mouse strain and the reference C57BL6 mouse genome. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. As such, it is possible to identify more than two-thirds of false positive variation from only ten rounds of simulation. Conclusion: Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Variant miscalls arising are highly sequence intrinsic and are often specific to an individual, pedigree or ethnicity. Further, read length is a strong determinant of whether given false variants will be called for any given genome - which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time.

KW - Alignment

KW - Exome

KW - Miscall

KW - Resampling

KW - Single nucleotide variant

UR - http://www.scopus.com/inward/record.url?scp=85069468357&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069468357&partnerID=8YFLogxK

U2 - 10.1186/s12864-019-5863-2

DO - 10.1186/s12864-019-5863-2

M3 - Article

C2 - 31307400

AN - SCOPUS:85069468357

SN - 1471-2164

VL - 20

JO - BMC Genomics

JF - BMC Genomics

M1 - 546

ER -

Recurrent miscalling of missense variation from short-read genome sequence data

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this