WebSTR: 人类短串联重复变异的全人群数据库。
WebSTR: a population-wide database of short tandem repeat variation in humans.
发表日期:2023 Sep 05
作者:
Oxana Sachenkova Lundström, Max Verbiest, Feifei Xia, Helyaneh Ziaei Jam, Inti Zlobec, Maria Anisimova, Melissa Gymrek
来源:
JOURNAL OF MOLECULAR BIOLOGY
摘要:
短串联重复序列(STRs)是连续重复出现的一到六个核苷酸基序。由于重复单元插入或删除的高普遍性,主要是由于复制过程中的聚合酶滑动,因此它们具有高变异性。在人类中,STRs的遗传变异已被证明会影响一系列特征,包括基因表达、癌症风险和自闭症。直到最近,由于STRs对生物信息学分析提出了重大挑战,因此对其进行了少量研究。此外,对人群规模的基因组范围内STR变异进行全面分析需要大量数据和计算资源。然而,最近出现的基因组范围分析工具为涉及各种人群的数千名个体的近两百万个基因组位点的STR变异提供了多个大型基因组范围的数据集。在这里,我们提出了WebSTR,这是一个关于人类人群中基因组范围STR的遗传变异和其他特征的数据库。WebSTR基于使用最先进的重复注释方法创建的超过170万个人类STR的参考群体,可以轻松扩展以包括其他组或物种。它目前包含基于1000基因组计划、H3Africa、基因型-组织表达(GTEx)项目和来自TCGA数据集的结直肠癌患者的STR基因型的数据。WebSTR以关系型数据库实施,并通过API提供编程访问以及浏览数据的Web门户。Web门户可公开访问,网址为http://webstr.ucsd.edu。版权所有 © 2023 Elsevier Ltd.
Short tandem repeats (STRs) are consecutive repetitions of one to six nucleotide motifs. They are hypervariable due to the high prevalence of repeat unit insertions or deletions primarily caused by polymerase slippage during replication. Genetic variation at STRs has been shown to influence a range of traits in humans, including gene expression, cancer risk, and autism. Until recently STRs have been poorly studied since they pose significant challenges to bioinformatics analyses. Moreover, genome-wide analysis of STR variation in population-scale cohorts requires large amounts of data and computational resources. However, the recent advent of genome-wide analysis tools has resulted in multiple large genome-wide datasets of STR variation spanning nearly two million genomic loci in thousands of individuals from diverse populations. Here we present WebSTR, a database of genetic variation and other characteristics of genome-wide STRs across human populations. WebSTR is based on reference panels of more than 1.7 million human STRs created with state of the art repeat annotation methods and can easily be extended to include additional cohorts or species. It currently contains data based on STR genotypes for individuals from the 1000 Genomes Project, H3Africa, the Genotype-Tissue Expression (GTEx) Project and colorectal cancer patients from the TCGA dataset. WebSTR is implemented as a relational database with programmatic access available through an API and a web portal for browsing data. The web portal is publicly available at http://webstr.ucsd.edu.Copyright © 2023. Published by Elsevier Ltd.