Background: Maturity-onset diabetes of the young (MODY) is a group of dominantly inherited monogenic diabetes, with HNF4A-MODY, GCK-MODY, and HNF1A-MODY as the three most common forms based on the causal genes. Molecular diagnosis of MODY is important for precise treatment. Although a DNA variant causing MODY can be assessed based on the criteria of the American College of Medical Genetics and Genomics (ACMG) guidelines, gene-specific assessment of disease-causing mutations is important to differentiate among MODY subtypes. As the ACMG criteria were not originally designed for machine-learning algorithms, they are not true independent variables.
Objective: The aim of this study was to develop machine-learning models for interpretation of DNA variants and MODY diagnosis using the ACMG criteria.
Methods: We applied machine-learning models for interpretation of DNA variants in MODY genes defined by the ACMG criteria based on the Human Gene Mutation Database (HGMD) and ClinVar database.
Results: With a machine-learning procedure, we found that the weight matrix of the ACMG criteria was significantly different between the three MODY genes HNF1A, HNF4A, and GCK. The models showed high predictive abilities with accuracy over 95%.
Conclusions: Our results highlight the need for applying different weights of the ACMG criteria in relation to different MODY genes for accurate functional classification. As proof of principle, we applied the ACMG criteria as feature vectors in a machine-learning model and obtained a precision-based result.
Monogenic diabetes results from DNA mutations in a single gene, which accounts for about 1%-4% of all cases of diabetes in the United States . The most common form of monogenic diabetes is maturity-onset diabetes of the young (MODY), an autosomal dominant disease that most commonly occurs in adolescence or early adulthood [ ]. Genetic sequencing is needed to identify the causal mutations and to diagnose different subtypes of MODYs [ ]. The DNA variant causing MODY can be specifically assessed using the criteria established by the American College of Medical Genetics and Genomics (ACMG), as published in their guidelines [ ]. Although the ACMG guidelines can be universally applied for all human DNA variants, our previous study suggested that a gene-specific assessment is important for identifying disease-causing mutations in different MODY genes [ ]. In addition, contradictory evidence is commonly seen in functional classification of genetic variations when using the ACMG guidelines [ ]. The ACMG guidelines may suggest a variant of uncertain significance; however, classification of the variant may have contradictory evidence, and some variants with contradictory evidence may turn out to have a reliable definite classification.
Machine learning has been advocated as an important tool for both clinical and research purposes in human diseases [, ]. In this study, we aimed to develop machine-learning models for interpretation of DNA variants using the ACMG criteria, with a focus on DNA variants of three MODY genes (HNF1A, HNF4A, and GCK) underlying the three most common types of MODYs [ ].
Data Collection for Machine-Learning Procedures
Known DNA variants of the three MODY genes HNF1A, HNF4A, and GCK were acquired from the dbSNP , the ClinVar database [ ], and the Human Gene Mutation Database (HGMD) 2019 professional version [ ]. Among the multihundred variants reported in these genes, approximately half have a classification of pathogenic/likely pathogenic (P/LP) according to the annotation in ClinVar or HGMD. According to the HGMD, the three genes were curated by Professor Andrew Hattersley, a leading genetic expert in MODYs [ ]. The classification of variants as benign/likely benign (B/LB) varies between the different databases according to the annotation of ClinVar or dbSNP. Overall, for the three genes, there are 899 unique variants reported in HNF1A, including 569 P/LP sites and 330 B/LB sites; 1037 unique variants for HNF4A, including 182 P/LP sites and 855 B/LB sites; and 1664 unique variants for GCK, including 1065 P/LP sites and 599 B/LB sites. However, several of these variants have different annotation features between the different databases.
Feature Vector Generation
The feature vectors used for machine-learning modeling were the criteria based on the ACMG guidelines . The criteria terms were generated based on InterVar [ ], a computational tool that uses a preannotated or variant call format file as an input and generates automated interpretation based on the ACMG criteria. It should be noted that not all 33 ACMG criteria can be computationally scored. For example, the PS3 criterion requires well-established in vitro or in vivo functional studies supportive of a damaging effect on the gene or gene product. As a result, the following 15 ACMG criteria were used, which was also the length of feature vectors for the three MODY genes: PVS1, PS1, PS4, PM1, PM2, PM4, PM5, PP2, PP3, PP5, BA1, BS1, BP4, BP6, and BP7.
Using machine-learning regression procedures, we normalized the weights for the evidence of different categories in accordance with the ACMG guidelines, assuming that the weight coefficient of PVS1 is 1, that of PS is 1/2, that of PM is 1/6, and that of PP is 1/12. We additionally assumed that the weight coefficient of BA1, BS, and BP is –1, –1/2, and –1/4, respectively. As the ACMG criteria were not originally designed for machine learning, these criteria are not true independent variables. Multicollinearity among feature vectors is commonly seen within each gene, which is the case for the PM1 and PP2 criteria. By definition, a PM1 hit means that the variant is located in a mutational hotspot or in a critical and well-established functional domain without benign variation, and a PP2 hit means that there is a missense variant in a gene that has a low rate of benign missense variation and in which missense variants are a common mechanism of disease. In many situations, PM1 and PP2 are consistent with each other, which increases the risk of inappropriate weighting of the two criteria because of multicollinearity. To detect the collinearity among feature vectors, we calculated the variance inflation factor (VIF) and pairwise correlation coefficient for the ACMG criteria. Feature vectors with a VIF greater than 10 or a correlation coefficient larger than 0.8 were removed before the learning procedures.
Learning Procedures and Predictive Modeling
The machine-learning procedure used in this study was a typical logistic regression based on the Scikit-learn package in Python . For detection of the weight matrix of the ACMG criteria, all variants, including P/LP and B/LB variants, were taken into account. For predictive modeling, we split the data based on 2-fold random shuffle processes. In other words, the P/LP and B/LB variants were split randomly into equally sized sets, with one set serving as training data and the other set serving as testing data, to determine the predictive capabilities of the model. This process was repeated 20 times to obtain the mean and standard deviation for accuracy measures, including sensitivity and specificity.
Variation in the Weight Matrix of ACMG Criteria Among the Three MODY Genes
Based on the machine-learning procedure, we found that the weight matrix of the ACMG criteria was significantly different between the three MODY genes HNF1A, HNF4A, and GCK (, ). The differences are nontrivial and must be taken into consideration in clinical interpretation of DNA variants for genetic diagnosis.
|HNF1A P/LPa||HNF1A B/LBb||HNF4A P/LP||HNF4A B/LB||GCK P/LP||GCK B/LB||HNF1A||HNF4A||GCK|
aP/LP: pathogenic/likely pathogenic variant.
bB/LB: benign/likely benign variant.
Evidence for PS is rarely observed for the MODY variants. By contrast, evidence for PS4 (ie, the prevalence of the variant in affected individuals is significantly increased compared with the prevalence in controls) is commonly observed but is often misclassified. As an example, the HNF1A variant 12:121420807-G-A (rs1183910) was reported to be associated with C-reactive protein, a marker of inflammation, in a genome-wide association study . However, as a common single nucleotide polymorphism with a minor allele frequency of 0.292 in European populations, this cannot be a variant causing the rare and dominantly inherited form of HNF1A-MODY.
With respect to evidence for PM criteria, PM1, which is defined as a variant located in a mutational hotspot or in a critical and well-established functional domain (eg, active site of an enzyme) without benign variation, and PM2 (absent from controls or at extremely low frequency if recessive in Exome Sequencing Project, 1000 Genomes Project, or Exome Aggregation Consortium) are both commonly observed, in support of pathogenic variants in the three MODY genes. However, PM2 is also commonly seen among B/LB variants in these three genes, thus lacking specificity for functional classification. In this study, PM2 showed a VIF of 79.0 in HNF1A and a VIF of 247 in GCK. Therefore, although PM2 is much more common than PM1 for the three MODY genes, the weight of PM2 in HNF1A is lower than that of PM1.
With respect to the evidence for PP criteria, PP2 (missense variant in a gene that has a low rate of benign missense variation and in which missense variants are a common mechanism of disease) is absent in HNF1A and GCK, but is commonly seen in HNF4A. However, PP2 showed a correlation coefficient of 0.932 with PM1, and therefore does not add substantial weight to the classification of P/LP variants in HNF4A.
Highly Accurate Predictive Ability for MODY Gene Pathogenicity
HNF4A-MODY (MODY1), GCK-MODY (MODY2), and HNF1A-MODY (MODY3) are the three most common types of MODYs, accounting for ~70% of all MODY genes . Therefore, a predictive model that can accurately recognize pathogenic variants would be useful for the diagnosis of novel mutations in these genes. As described in the Methods section, we used 2-fold random shuffle testing with 50% of the 3600 mutations as training data and the other 50% as testing data, and repeated the analysis 20 times. The logistic regression machine-learning model showed overall accuracy above 95% (1676/1786) for MODY gene mutations ( ). Both HNF1A (true negatives=163, false positives=2) and HNF4A (true negatives=428, false positives=0) had a specificity close to 100%, and the specificity in GCK was also above 95% (true negatives=289, false positives=10). This lower specificity is also consistent with the benign phenotype and mild clinical expression of GCK-MODY.
These results proved the principle that ACMG criteria could be applied as meaningful feature vectors in a machine-learning model, and such a model based on ACMG criteria could provide accurate pathogenic classification for other Mendelian disease genes in a gene-specific manner.
Our results highlight the need for applying different weights of the ACMG criteria in the functional classification of DNA variants of different MODY genes. In the past decade, sequencing technologies have evolved rapidly with the advance of high-throughput next-generation sequencing (NGS). By adopting NGS, clinical laboratories are now performing an ever-increasing volume of genetic testing for genetic disorders. However, increased complexity in genetic testing has been accompanied by new challenges in sequence interpretation, and multiple new standards have been implemented for physicians and genetic counselors regarding the interpretation and reporting of sequence variants at different levels of pathogenicity. Currently, there are multiple computational tools available based on different algorithms and databases that are being used to predict the pathogenicity of DNA variants, such as SIFT , MutationTaster [ ], likelihood ratio test [ ], FATHMM by a supervised machine-learning model [ ], GERP++ by maximum-likelihood evolutionary rate estimation [ ] for coding variants, and DANN for both coding and noncoding variants using a deep neural network [ ]. However, all of these computational tools assess each gene with a common rule, which is not based on biology, whereas this study proposes that a gene-specific assessment for pathogenicity is required, at least for MODY genes [ ].
The evolutionary selection pressures on MODYs vary across different genes, and is considered to be the lowest in the case of GCK-MODY . Similar issues exist with functional classification based on the ACMG criteria, which are globally applied for all human genes. The ACMG criteria contain 33 terms that lead to five categories of mutations (“pathogenic,” “likely pathogenic,” “uncertain significance,” “likely benign,” and “benign”), as one of the most commonly used standards.
MODY represents a group of dominantly inherited monogenic diabetes, and HNF4A-MODY (MODY1), GCK-MODY (MODY2), and HNF1A-MODY (MODY3) are the three most common subtypes of MODY. These MODY genes are involved in different molecular pathways. MODY variants of different genes show different clinical features and thus require different treatments. For example, HNF1A-MODY is characterized by a reduced beta cell mass or impaired function, and has been treated with sulfonylureas for decades with excellent results . Patients with HNF1A-MODY are highly sensitive to sulfonylurea treatment and may be susceptible to developing hypoglycemia during the treatment [ ]. HNF4A-MODY has similar clinical features with HNF1A-MODY, and the affected transcription network plays a role in the early development of the pancreas. The pancreatic beta cells produce adequate insulin in infancy but the capacity for insulin production declines thereafter [ ]. The beta cells in GCK-MODY have a normal capacity to make and secrete insulin, but do so only above an abnormally high glucose threshold, which results in a chronic, mild increase in blood sugar that is usually asymptomatic [ ]. Accordingly, treatment of GCK-MODY can be achieved by a healthy diet and exercise, while oral hypoglycemic agents or insulin is of no benefit for these patients [ ]. Therefore, accurate molecular diagnosis of these MODYs is important for precise treatment.
In conclusion, we applied a computational machine-learning method together with the ACMG criteria for functional classification of genetic variants of the three most common MODY genes, HNF1A, HNF4A and GCK. Our results show that a typical machine-learning model using 15 computational ACMG criteria as the feature vector has predictive abilities that are highly accurate (>95% accuracy) for hundreds of annotated variants in three MODY genes. Therefore, this model could serve as a fast, gene-specific method for physicians or genetic counselors assisting with diagnosis and reporting, especially when confronted by contradictory ACMG criteria. Moreover, we show that the weight of the ACMG criteria exhibits gene specificity, which advocates for the application of machine-learning methods with the ACMG criteria to capture the most relevant information for each disease-related variant.
The study was supported by Institutional Development Funds from the Children’s Hospital of Philadelphia to the Center for Applied Genomics, and The Children’s Hospital of Philadelphia Endowed Chair in Genomic Research to HH. We thank the Center for Applied Genomics staff and support from Children’s Hospital of Philadelphia.
YL and HQ conceptualized and designed the study, drafted the initial manuscript, and reviewed and revised the manuscript. AW, JQ, XC, JG, PS, and LT collected data, carried out the initial analyses, and reviewed and revised the manuscript. HH conceptualized and designed the study, and critically reviewed the manuscript.
Conflicts of Interest
- National Institute of Diabetes and Digestive and Kidney Diseases. URL: https://www.niddk.nih.gov/health-information/diabetes/overview/what-is-diabetes/monogenic-neonatal-mellitus-mody [accessed 2020-05-20]
- Shields BM, Hicks S, Shepherd MH, Colclough K, Hattersley AT, Ellard S. Maturity-onset diabetes of the young (MODY): how many cases are we missing? Diabetologia 2010 Dec 25;53(12):2504-2508. [CrossRef] [Medline]
- Rubio-Cabezas O, Hattersley AT, Njølstad PR, Mlynarski W, Ellard S, White N, International Society for Pediatric Adolescent Diabetes. ISPAD Clinical Practice Consensus Guidelines 2014. The diagnosis and management of monogenic diabetes in children and adolescents. Pediatr Diabetes 2014 Sep 03;15(Suppl 20):47-64. [CrossRef] [Medline]
- Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, ACMG Laboratory Quality Assurance Committee. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 2015 May;17(5):405-424 [FREE Full text] [CrossRef] [Medline]
- Li Q, Liu X, Gibbs RA, Boerwinkle E, Polychronakos C, Qu H. Gene-specific function prediction for non-synonymous mutations in monogenic diabetes genes. PLoS One 2014;9(8):e104452 [FREE Full text] [CrossRef] [Medline]
- Qu H, Wang X, Tian L, Hakonarson H. Application of ACMG criteria to classify variants in the human gene mutation database. J Hum Genet 2019 Nov 26;64(11):1091-1095. [CrossRef] [Medline]
- Luo W, Phung D, Tran T, Gupta S, Rana S, Karmakar C, et al. Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View. J Med Internet Res 2016 Dec 16;18(12):e323 [FREE Full text] [CrossRef] [Medline]
- Pande A, Mohapatra P, Nicorici A, Han JJ. Machine Learning to Improve Energy Expenditure Estimation in Children With Disabilities: A Pilot Study in Duchenne Muscular Dystrophy. JMIR Rehabil Assist Technol 2016 Jul 19;3(2):e7 [FREE Full text] [CrossRef] [Medline]
- Pihoker C, Gilliam LK, Ellard S, Dabelea D, Davis C, Dolan LM, SEARCH for Diabetes in Youth Study Group. Prevalence, characteristics and clinical diagnosis of maturity onset diabetes of the young due to mutations in HNF1A, HNF4A, and glucokinase: results from the SEARCH for Diabetes in Youth. J Clin Endocrinol Metab 2013 Oct;98(10):4055-4062 [FREE Full text] [CrossRef] [Medline]
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001 Jan 01;29(1):308-311 [FREE Full text] [CrossRef] [Medline]
- Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 2014 Jan;42(Database issue):D980-D985 [FREE Full text] [CrossRef] [Medline]
- Stenson PD, Mort M, Ball EV, Evans K, Hayden M, Heywood S, et al. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum Genet 2017 Jun;136(6):665-677 [FREE Full text] [CrossRef] [Medline]
- Human Gene Mutation Database. URL: http://www.hgmd.cf.ac.uk/docs/new_back.html [accessed 2020-05-20]
- Bahcall OG. Genetic testing. ACMG guides on the interpretation of sequence variants. Nat Rev Genet 2015 May;16(5):256-257. [CrossRef] [Medline]
- Li Q, Wang K. InterVar: Clinical Interpretation of Genetic Variants by the 2015 ACMG-AMP Guidelines. Am J Hum Genet 2017 Feb 02;100(2):267-280 [FREE Full text] [CrossRef] [Medline]
- Pedregosa F, Varoquaux G, Gramfort A, Vincent M, Thirion B. Scikit-learn: Machine learning in Python. J Machine Learn Res 2011;12:2825-2830 [FREE Full text]
- Ligthart S, Vaez A, Hsu Y, Inflammation Working Group of the CHARGE Consortium, PMI-WG-XCP, LifeLines Cohort Study, et al. Bivariate genome-wide association study identifies novel pleiotropic loci for lipids and inflammation. BMC Genomics 2016 Jun 10;17:443 [FREE Full text] [CrossRef] [Medline]
- Ellard S, Bellanné-Chantelot C, Hattersley AT, European Molecular Genetics Quality Network (EMQN) MODY group. Best practice guidelines for the molecular genetic diagnosis of maturity-onset diabetes of the young. Diabetologia 2008 Apr 23;51(4):546-553 [FREE Full text] [CrossRef] [Medline]
- Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 2003 Jul 01;31(13):3812-3814 [FREE Full text] [CrossRef] [Medline]
- Schwarz JM, Cooper DN, Schuelke M, Seelow D. MutationTaster2: mutation prediction for the deep-sequencing age. Nat Methods 2014 Apr;11(4):361-362. [CrossRef] [Medline]
- Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome Res 2009 Sep;19(9):1553-1561 [FREE Full text] [CrossRef] [Medline]
- Rogers MF, Shihab HA, Mort M, Cooper DN, Gaunt TR, Campbell C. FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics 2018 Feb 01;34(3):511-513 [FREE Full text] [CrossRef] [Medline]
- Davydov E, Goode D, Sirota M, Cooper G, Sidow A, Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 2010 Dec 02;6(12):e1001025 [FREE Full text] [CrossRef] [Medline]
- Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 2015 Mar 01;31(5):761-763 [FREE Full text] [CrossRef] [Medline]
- Chakera AJ, Steele AM, Gloyn AL, Shepherd MH, Shields B, Ellard S, et al. Recognition and Management of Individuals With Hyperglycemia Because of a Heterozygous Glucokinase Mutation. Diabetes Care 2015 Jul 23;38(7):1383-1392. [CrossRef] [Medline]
- Pearson ER, Starkey BJ, Powell RJ, Gribble FM, Clark PM, Hattersley AT. Genetic cause of hyperglycaemia and response to treatment in diabetes. Lancet 2003 Oct 18;362(9392):1275-1281. [CrossRef] [Medline]
- Odom DT, Zizlsperger N, Gordon DB, Bell GW, Rinaldi NJ, Murray HL, et al. Control of pancreas and liver gene expression by HNF transcription factors. Science 2004 Feb 27;303(5662):1378-1381 [FREE Full text] [CrossRef] [Medline]
|ACMG: American College of Medical Genetics and Genomics|
|B/LB: benign/likely benign|
|HGMD: Human Gene Mutation Database|
|MODY: maturity-onset diabetes of the young|
|NGS: next-generation sequencing|
|P/LP: pathogenic/likely pathogenic|
|VIF: variance inflation factor|
Edited by R Kukafka; submitted 30.05.20; peer-reviewed by C Doogan, F Palmieri; comments to author 22.06.20; revised version received 19.07.20; accepted 03.11.20; published 01.12.20Copyright
©Yichuan Liu, Hui-Qi Qu, Adam S Wenocur, Jingchun Qu, Xiao Chang, Joseph Glessner, Patrick Sleiman, Lifeng Tian, Hakon Hakonarson. Originally published in JMIR Biomedical Engineering (http://biomedeng.jmir.org), 01.12.2020.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Biomedical Engineering, is properly cited. The complete bibliographic information, a link to the original publication on http://biomedeng.jmir.org/, as well as this copyright and license information must be included.