A report from UCLA details the usage of analyzing large scale databases to impute phenotypes in databases that lack data on phenotypes. Phenotypes in these databases are often missing in many of the individuals in the databases, limiting the utility that these databases have. The group created a new imputation method for phenotypes based off of the UK Biobank dataset. The accuracy of their imputation method, dubbed AutoComplete, outperformed the next best method, called SoftImpute, with the greatest improvements being in the psychiatric and cardiovascular phenotypes.
The significance of their imputer is centered around the fact that they take into consideration not only the relationships between phenotypes and the data available, but also the metadata of the database used. They took into account the "patterns of missingness" in the data, which are often structured due to the way that the data is gathered and placed into the databases.
This report not only exemplifies the importance of having strong computational methods in genetics but also the significance of properly reporting data. Datasets that contain complete information not only make it easier to work with the data, but also provide more accurate results once the data is processed.
An, U., Pazokitoroudi, A., Alvarez, M. et al. Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries. Nat Genet (2023). https://doi.org/10.1038/s41588-023-01558-w
Reference to SoftImpute
Hastie, T., Mazumder, R., Lee, J. D., & Zadeh, R. (2015). Matrix completion and low-rank SVD via fast alternating least squares. The Journal of Machine Learning Research, 16(1), 3367-3402.