DisGeNET的初步认识

编程入门 行业动态 更新时间:2024-10-27 03:32:35

在看疾病与基因组的关系时发现了该数据库,其中有一些分值的计算很有趣,特此记录下来供后续参考学习。

本文资料来自于DisGeNET - a database of gene-disease associations 的About页面。

DisGeNET Metrics DisGeNET指标

We have developed two scores to rank the gene-disease, and the variant-disease associations according to their level of evidence. These scores range from 0 to 1, and take into account the number and type of sources (level of curation, model organisms), and the number of publications supporting the association.

我们开发了两个分数(two scores)来对基因疾病(gene-disease)进行排序(rank),并根据(according to)其证据水平(level of evidence)对变异性疾病(variant-disease)关联(associations)进行排序。这些分数范围从0到1(range from 0 to 1),并考虑到(take into account)来源的数量和类型(治疗水平、模式生物)以及支持该关联的出版物数量。

GDA Score GDA得分

The DisGeNET Score (S) for GDAs is computed according to:

GDAs的DisGeNET分数根据以下公式计算:otherwise[ˈʌðəwaɪz]否则,不然,除此以外

GDA 得分

where:

  • N sources i is the number of CURATED sources supporting a   GDA Nsourcesi是支持GDA的CURATED来源的数量

    i ∈ CGI, CLINGEN, GENOMICS ENGLAND, CTD, PSYGENET, ORPHANET, UNIPROT i属于各种数据库

where:

  • j ∈ Rat, Mouse from RGD, MGD, and CTD   j属于大鼠,小鼠

where:

  • k ∈ HPO, CLINVAR, GWASCAT, GWASDB

where:

  • Npubs is the number of publications supporting a GDA in the sources LHGDN and BEFREE  N pubs是来源LHGDN和BEFREE中支持GDA的出版物数量

#整体算下来该得分最高为1,貌似是越高的各种数据库和文献支持度相对越高。

Distribution of the DisGeNET score for GDAs according to the number of sources reporting the association 根据报告关联的来源数量,GDA的DisGeNET得分分布(图略)

VDA Score VDA得分

The DisGeNET Score (S) for VDAs is computed according to:

VDA的DisGeNET分数根据以下公式计算:

where:

  • Nsourcesi is the number of CURATED sources supporting a VDA

    i ∈ UNIPROT,CLINVAR, GWASCAT, GWASDB

where:

  • Npubs is the number of publication supporting a VDA in the source in BeFree BeFree貌似是一个该数据库支持者开发的检索套路?

Distribution of the DisGeNET Score for VDAs according to the number of sources reporting the association 根据报告关联的来源数量,VDA的DisGeNET得分分布(图略)

Disease Specificity Index 疾病特异性指数

There are genes (or variants) that are associated wiht multiple diseases (e.g. TNF) while others are associated with a small set of diseases or even to a single disease. The Disease Specificity Index (DSI) is a measure of this property of the genes (and variants). It reflects if a gene (or variant) is associated to several or fewer diseases. It is computed according to:

有些基因(或变异)与多种疾病(如肿瘤坏死因子)相关,而另一些基因则与一小部分疾病(a small set of diseases)甚至单独的一种疾病(a single disease)相关。疾病特异性指数(Disease Specificity Index,DSI)是衡量(measure)基因(和变异)这种特性的一个指标。它反映(reflects)了一个基因(或变异)是否与几种或更少的疾病相关。根据(according to)以下公式计算:

where:

  • - N d is the number of diseases associated to the gene/variant   #N d是与基因/变体相关的疾病数量
    - N T is the total number of diseases in DisGeNET   #N T是DisGeNET中的疾病总数

The DSI ranges from 0.25 to 1. Example: TNF, associated to more than 1,500 diseases, has a DSI of 0.263, while HCN2 is associated to one disease, with a DSI of 1.

DSI范围从0.25到1。例如:与1500多种疾病相关的TNF的DSI为0.263,而HCN2与一种疾病相关,DSI为1。 说明DSI越小的基因或变异与越多的疾病相关?

If the DSI is empty, it implies that the gene/variant is associated only to phenotypes.

如果DSI为空,则意味着该基因/变体仅与表型相关。

Disease Pleiotropy Index 疾病 基因多效性 指数

The rationale is similar than for the DSI, but we consider if the multiple diseases associated to the gene (or variant) are similar among them (belong to the same MeSH disease class, e.g. Cardiovascular Diseases) or are completely different diseases and belong to different disease classes. The Disease Pleiotropy Index (DPI) is computed according to:

其基本原理(rationale)与DSI相似,但我们考虑与该基因(或变体)相关的多种疾病之间是否相似(属于同一医学主题词(MeSH)疾病类别,例如心血管疾病(Cardiovascular Diseases))或是完全不同的疾病且属于不同的疾病类别。疾病多效性指数(DPI)根据以下公式计算:

where:

  • - N dc is the number of the different MeSH disease classes of the diseases associated to the gene/variant #N dc是与该基因/变体相关的疾病的不同医学主题词MeSH疾病类别的数目
    - N TC is the total number of MeSH diseases classes in DisGeNET (29)   #N TC是DisGeNET中医学主题词MeSH疾病类的总数(29)

The DPI ranges from 0 to 1. Example: gene KCNT1 is associated to 39 diseases, 4 disease groups, and 18 phenotypes. 29 out of the 39 diseases have a MeSH disease class. The 29 diseases are associated to 5 different MeSH classes. The DPI index for KCNT1 = 5/29 ~ 0.172. Nevertheless, gene APOE, associated to more than 700 diseases of 27 disease classes has a DPI of 0.931.

DPI范围从0到1。例如:KCNT1基因与39种疾病、4个疾病组(disease groups)和18个表型(phenotypes)相关。39种疾病中有29种属于医学主题词MeSH疾病。这29种疾病与5种不同的医学主题词MeSH类型有关。KCNT1的DPI指数为5/29~0.172。尽管如此(Nevertheless)[ˌnevəðəˈles] ,与27个疾病类别的700多种疾病相关的APOE基因的DPI为0.931。

If the gene/variant has no DPI value, it implies that the gene/variant is associated only to phenotypes, or that the associated diseases do not map to any MeSH classes.

如果该基因/变体没有DPI值,则意味着该基因/变体仅与表型相关,或者相关疾病不映射到任何医学主题词MeSH类。

Evidence Level 证据级别

The Evidence Level (EL) is a metric developed by ClinGen that measures the strength of evidence of a gene-disease relationship that correlates to a qualitative classification: "Definitive", "Strong", "Moderate", "Limited", "Disputed" (Strande et al., 2017). GDAs that have been reported by ClinGen will have their corresponding Evidence Level. Furthermore, we have adapted a similar metric reported by Genomics England PanelApp to correspond to the same categories from ClinGen: GDAs marked by Genomics England PanelApp as High Evidence are labeled as strong in DisGeNET. Those labeled as Moderate Evidence are labeled as moderate and LowEvidence associations are labeled as limited.

证据水平(EL)是由ClinGen开发的一个度量标准,用于衡量与定性分类相关的基因-疾病关系证据的强度:“确定”(Definitive)、“强”(Strong)、“中等”(Moderate)、“有限”(Limited)、“有争议”(Disputed)(Strande et al.,2017)。ClinGen报告的GDA将有相应的证据级别。此外(Furthermore)[ˌfɜːðəˈmɔː(r)],我们还采用了英国基因组学PanelApp报告的类似指标,以对应于ClinGen中的相同类别:英国基因组学PanelApp标记的GDA作为高证据(High Evidence)被标记为在DisGeNET中很强(strong)。那些被标记为中度证据(Moderate Evidence)的被标记为中度(moderate),低证据关联(LowEvidence associations)被标记为有限(limited)。

Evidence Index 证据指数

The "Evidence index" (EI) indicates the existence of contradictory results in publications supporting the gene/variant-disease associations. This index is computed for the sources BeFree and PsyGeNET, by identifying the publications reporting a negative finding on a particular VDA or GDA. Note that only in the case of PsyGeNET, the information used to compute the EI has been validated by experts. The EI is computed as follows:

“证据指数”(EI)表明在支持基因/变异疾病关联的出版物中存在相互矛盾的结果。这个指数是为BeFree和PsyGeNET源计算的,通过识别报告某一特定VDA或GDA的阴性结果的出版物。请注意,只有在PsyGeNET的情况下,用于计算EI的信息才得到了专家的验证。EI计算如下:

EI = 1 indicates that all the publications support the GDA or the VDA, while EI < 1 indicates that there are publications that assert that there is no association between the gene/variants and the disease. If the gene/variant has no EI value, it indicates that the index has not been computed for this association.

EI=1表示所有出版物都支持GDA或VDA,而EI<1表示有出版物断言基因/变体与疾病之间没有关联。如果该基因/变体没有EI值,则表明尚未计算该关联的指数。

where:

  • Npubspositive is the number of publication supporting a GDA in BeFree or PsyGeNET, or a VDA in BeFree   #N pubs positive是支持BeFree或PsyGeNET中的GDA或BeFree中的VDA的发布数

    Npubstotal is the total number of publications in BeFree or PsyGeNET supporting that GDA, or in BeFree for VDAs   #N pubs total是BeFree或PsyGeNET中支持该GDA或BeFree中支持VDA的出版物总数

Data Attributes 数据属性

In order to ease the interpretation and analysis of gene-disease, variant-disease associations, and disease-disease associations we provide the following information for the data.

为了简化对基因疾病、变异疾病关联和疾病关联的解释和分析,我们为数据提供以下信息。

Genes 基因

For human genes, HGNC symbols, and Uniprot accession numbers (used by Uniprot) are converted to NCBI Entrez gene identifiers using an in house dictionary that cross references HGNC, Uniprot and NCBI-Gene information. For mapping of mouse and rat genes, we used files  HOM_MouseHumanSequence, and RGD_ORTHOLOGS, both with information of orthology from MGD and RGD, respectively to map rat and mouse Entrez gene identifiers to human Entrez identifiers. We discarded the associations involving rat or mouse genes without a human ortholog.

对于人类基因,HGNC符号和Uniprot登录号(Uniprot使用)使用内部字典(交叉引用HGNC、Uniprot和NCBI-Gene信息)转换为NCBI Entrez基因标识符。为了定位小鼠和大鼠的基因,我们使用了HOM\u-MouseHumanSequence和RGD\u-ORTHOLOGS文件,这两个文件分别带有来自MGD和RGD的矫形信息,将大鼠和小鼠Entrez基因标识符定位到人类Entrez标识符。我们放弃了涉及大鼠或小鼠基因而没有人类同源基因的关联。

Genes in DisGeNET are annotated with:   #DisGeNET中的基因注释如下:

  • the official gene symbol, from the NCBI   #NCBI的官方基因符号
  • the NCBI Official Full Name   #NCBI官方全名
  • the Uniprot  accession   #Uniprot登录号
  • the Disease Specificity Index (DSI)   #疾病特异性指数(DSI)
  • the Disease Pleiotropy Index (DPI)   #疾病多效性指数(DPI)
  • the pLI, defined as the probability of being loss-of-function intolerant, is a gene constraint metric provided by the GNOMAD consortium. A gene constraint metric aims at measuring how the naturally occurring LoF (loss of function) variation has been depleted from a gene by natural selection (in other words, how intolerant is a gene to LoF variation). LoF intolerant genes will have a high pLI value (>=0.9), while LoF tolerant genes will have low pLI values (<=0.1). The LoF variants considered are nonsense and essential splice site variants.

#pLI定义为功能丧失不耐受的概率,是由GNOMAD财团提供的一种基因约束度量。基因约束度量的目的是测量自然发生的LoF(功能丧失)变异是如何通过自然选择从基因中耗尽的(换句话说,基因对LoF变异的不容忍程度如何)。LoF不耐受基因的pLI值较高(>=0.9),而LoF耐受基因的pLI值较低(<=0.1)。所考虑的LoF变异体是无意义的和必需的剪接位点变异体。

  • the top level category from the  Drug Target Ontology.   #药物目标本体的顶级类别。

Variants 变体

Variants in DisGeNET are annotated with:   #DisGeNET中的变体注释如下:

  • The position in the chromosome   #染色体上的位置
  • The reference and alternative alleles   #参考和替代等位基因
  • The class of the variant: SNP, deletion, insertion, indel, somatic SNV, substitution, sequence alteration, and tandem repeat   #变异的类别:SNP、缺失、插入、indel、体细胞SNV、替换、序列改变和串联重复
  • These attributes are retrieved from dbSNP, the NCBI Short Genetic Variations database a catalog of short variations in nucleotide sequences from a wide range of organisms ( Sherry, et al., 2001 ). The data was retrieved in January 2020 (corresponding to NCBI dbSNP Human Build 153, and to Assembly GRCh38).   #这些属性是从dbSNP(NCBI短基因变异数据库)检索到的,该数据库是一个广泛生物体核苷酸序列短变异的目录(Sherry等人,2001)。数据于2020年1月检索(对应于NCBI dbSNP Human Build 153和装配GRCh38)。
  • The allelic frequency in genomes and exomes according to GNOMAD.   #根据GNOMAD基因组和外显子的等位基因频率(allelic frequency)。

  • The Genome Aggregation Database is a resource that aggregates and harmonizes both exome and genome sequencing data from a wide variety of large-scale sequencing projects ( Exome Aggregation Consortium, 2016). The data spans 125,748 exomes and 71,702 genomes from unrelated individuals sequenced as part of various disease-specific and population genetic studies ( release 2.1.1 for exomes and 3.0 for genomes).   #基因组聚合数据库是一种资源,可以聚合和协调来自各种大规模测序项目的外显子组和基因组测序数据(外显子组聚合联盟,2016)。这些数据涵盖了125748个外显子和71702个基因组,这些基因组是作为各种疾病特异性和群体遗传学研究的一部分测序的无关个体(外显子2.1.1版和基因组3.0版)。
  • The most severe consequence type according to the Variant Effect Predictor   #根据变异效应预测(Variant Effect Predictor)的最严重后果类型
  • The Ensembl Variant Effect Predictor determines the effect of a variant, or a list of variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions ( McLaren et al, 2016 ). We use the Ensembl API (release 11.2) to obtain the most severe consequence type of the variant.

#Ensembl变体效应预测因子确定变体或变体列表(SNP、插入、缺失、CNV或结构变体)对基因、转录物和蛋白质序列以及调控区域的影响(McLaren等人,2016)。我们使用Ensembl API(11.2版)来获得变体的最严重后果类型。

  • the gene correspoding to the consequence type assigend by VEP   #与VEP结果类型相关的基因
  • the Disease Specificity Index (DSI)   #疾病特异性指数(DSI)
  • the Disease Pleiotropy Index (DPI)   #疾病多效性指数(DPI)

Diseases 疾病

The vocabulary used for diseases in DisGeNET are the Concept Unique Identifiers (CUIs) from the Unified Medical Language System®(UMLS) Metathesaurus®(version UMLS 2019AA). The repositories of gene-disease associations use different disease vocabularies, OMIM® terms for diseases from UniProt, CTDTM, and MGD; MeSH terms used by CTDTM, LHGDN, and RGD; MONDO for ClinGen; HPO identifiers for HPO, UMLS® CUIs from CLINVAR and PsyGeNET; EFO for the GWAS Catalog; Orphanet identifiers are mapped using Orphanet cross-references and MESH, EFO and DO vocabularies for GWASdb. Disease names from the Genomics England PanelApp, and the Cancer Genome Interpreter are normalized using the UMLS Metathesaurus. We also used UMLS® Metathesaurus® concept structure to map OMIM, HPO and MeSH terms to UMLS® CUIs.

DisGeNET中用于疾病的词汇是来自统一医学语言系统的概念唯一标识符(cui)®(元叙词表®(版本UMLS 2019AA)。基因疾病关联的储存库使用不同的疾病词汇,OMIM® UniProt、CTDTM和MGD疾病术语;CTDTM、LHGDN和RGD使用的MeSH术语;MONDO代表ClinGen;HPO、UMLS的HPO标识符® CLINVAR和PsyGeNET的CUIs;GWAS目录的EFO;使用Orphanet交叉引用和GWASdb的MESH、EFO和DO词汇表映射Orphanet标识符。使用UMLS元词库对来自英格兰基因组学PanelApp和癌症基因组解释器的疾病名称进行规范化。我们还使用了UMLS® 元叙词表® 将OMIM、HPO和MeSH术语映射到UMLS的概念结构® CUIs。

Diseases in DisGeNET are annotated with:   #DisGeNET中的疾病注释如下:

  • the disease name, provided by the UMLS® Metathesaurus®   #疾病名称,由UMLS提供® 元叙词表®
  • the UMLS® semantic types   #UMLS® 语义类型

  • the MeSH class: We classify the diseases according the MeSH hierarchy using the upper level concepts of the MeSH tree branch C (Diseases) plus three concepts of the F branch (Psychiatry and Psychology: "Behavior and Behavior Mechanisms", "Psychological Phenomena and Processes", and "Mental Disorders").   #MeSH类:我们使用MeSH树分支C(疾病)的上层概念加上F分支(精神病学和心理学:“行为和行为机制”、“心理现象和过程”以及“精神障碍”)的三个概念,按照MeSH层次对疾病进行分类。
  • The top level concepts from the Human DiseaseOntology.   #人类疾病本体论的顶层概念。
  • The DisGeNET disease type: disease, phenotype and group.   #疾病类型、表型和群体。

We consider a disease entries mapping to the following UMLS® semantic types:   #我们考虑一个映射到以下uml的疾病条目® 语义类型:

  • - Disease or Syndrome疾病或综合征
    - Neoplastic Process肿瘤过程
    - Acquired Abnormality后天性异常
    - Anatomical Abnormality解剖异常
    - Congenital Abnormality先天畸形
    - Mental or Behavioral Dysfunction精神或行为功能障碍

We consider a phenotype entries mapping to the following UMLS® semantic types:   #我们考虑一个表型条目映射到以下UMLS® 语义类型:

  • - Pathologic Function病理功能
    - Sign or Symptom征兆或症状
    - Finding发现
    - Laboratory or Test Result实验室或测试结果
    - Individual Behavior个人行为
    - Clinical Attribute临床属性
    - Organism Attribute有机体属性
    - Organism Function机体功能
    - Organ or Tissue Function器官或组织功能
    - Cell or Molecular Dysfunction细胞或分子功能障碍

These classifications were manually checked. In addition, disease entries referring to disease groups such as "Cardiovascular Diseases", "Autoimmune Diseases", "Neurodegenerative Diseases, and "Lung Neoplasms" were classified as disease group .这些分类是手动检查的。此外,涉及疾病组的疾病条目,如“心血管疾病”、“自身免疫性疾病”、“神经退行性疾病”和“肺肿瘤”被归类为疾病组。

Additionally, we have removed terms considered as diseases by other sources, but are not strictly diseases, such as terms belonging to the following UMLS® semantic types:

此外,我们删除了被其他来源视为疾病的术语,但这些术语并不是严格意义上的疾病,例如属于以下uml的术语® 语义类型:

  • - Gene or Genome基因或基因组
    - Genetic Function遗传功能
    - Immunologic Factor免疫因子
    - Injury or Poisoning伤害或中毒

Gene-Disease Associations 基因疾病关联

For a seamless integration of gene-disease association data, we developed the DisGeNET association type ontology. All association types as found in the original source databases are formally structured from a parent GeneDiseaseAssociation class if there is a relationship between the gene/protein and the disease, and represented as ontological classes. For more information, see here.

为了实现基因-疾病关联数据的无缝集成,我们开发了非基因关联类型本体。如果基因/蛋白质与疾病之间存在关系,那么在原始源数据库中发现的所有关联类型都是从父基因疾病关联类正式构建的,并表示为本体类。有关更多信息,请参阅此处。

  • the DisGeNET score
  • the DisGeNET Gene-Disease Association Type
  • the Evidence Level
  • the Evidence Index
  • the year initial: First time that the association was reported
  • the year final: Last time that the association was reported
  • the publication(s) that reports the gene-disease association, with the Pubmed Identifier   #用Pubmed标识符报告基因疾病关联的出版物
  • a representative sentence from the publication describing the association between the gene and the disease (If a representative sentence is not found, we provide the title of the paper)

#出版物中描述基因与疾病关联的代表性语句(如果没有找到代表性语句,我们提供论文标题)

  • the original source reporting the Gene-Disease Association   #报告基因疾病关联的原始来源

Variant-Disease Associations 变异疾病关联

  • the DisGeNET score
  • the Evidence Index
  • the publication(s) that reports the variant-disease association, with the Pubmed Identifier
  • the year initial: First time that the association was reported
  • the year final: Last time that the association was reported
  • a representative sentence from the publication describing the association between the variant and the disease (If a representative sentence is not found, we provide the title of the paper)
  • the original source reporting the Variant-Disease Association   #报告变异疾病关联的原始来源

Disease-Disease Associations 疾病与疾病的关联

  • Jaccard Index based on shared genes
  • p-value based on shared genes
  • Jaccard Index based on shared variants

更多推荐

DisGeNET的初步认识

本文发布于:2023-06-10 20:29:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1341939.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:DisGeNET

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!