Imputation of Missing Values in Clinical Research
Currently, the pharmaceutical and scientific communities are highly interested in the functional role of “-omics” data in clinical research. Omics data (e.g. DNA, mRNA, microRNA) help to understand pathways and biological processes along with identifying genetic variation or differentially expressed genes as potential biomarkers for drug target discovery and patient stratification. Omics data, particularly gene expression data, often contain missing values. This missingness often occurs due to insufficient resolution, image corruption, dust or scratches on the slide, other various experimental and technical reasons or even may happen due to lack of collected tissue or limited funds. Many statistical methods for expression profile analyses require a complete (i.e., non-missing) data set of gene features. If gene features with missing values are ignored from these analyses, the statistical methods may yield biased results and decrease the statistical power of the study. In this respect, choosing the most effective imputation method is necessary.
At BioStat Solutions, Inc. (BSSI), we see the importance of estimating missing values accurately and apply statistical methods to the imputed data that gives more statistical power to identify biomarkers while controlling for Type I error rate (i.e., rate of calling non-significant biomarkers significant). We are also highly interested in the imputation methods that incorporate additional covariates like demographics, lifestyle, and clinical characteristics of patients with imputation of the missing values in order to get more accurate and valid estimates.
However, the conventional imputation methods have some limitations such as: “failure to account for uncertainty in imputed values, failure to make full use of observed values, possibilities for bias, and artificial low variance”1. The K-nearest neighbors (KNN) approach is one of the imputation methods that have widely been used with some effectiveness in omics clinical research2. This method replaces missing values of patients using a weighted average of K-most similar non-missing patients’ values. But the KNN method “replaces a missing value with a single number … and can threaten the validity of study results”3. Another imputation method that is considered as an effective approach of estimating missing values in clinical research is the multiple imputations method. This method replaces each missing value with multiple substitute values, say m. A completed data set is created by each set of draws. So the m imputations for each missing value create m complete data sets. However, the complete case analyses in multiple imputations treat the imputed data as fully observed and do not consider the imputation-induced dependence4.
A recent modification to the KNN method fully uses the observed values and efficiently accounts for additional covariates, as well as considers uncertainty in imputed values1,4. The modified KNN method, referred to as KNN dependent4, was motivated by a colorectal cancer study where microRNA expression was measured in paired tumor-normal samples of hundreds of patients, but data for many normal samples were missing due to lack of tissue availability. The proposed method finds the K most similar subjects based on demographic and lifestyle covariates of subjects and calculates imputed value as a weighted average of those K subjects. The novelty of the KNN dependent method is that it considers the dependence induced (among imputed and fully observed subjects) by weighted KNN4.
As demonstrated in a colorectal cancer study4, the KNN dependent method can be efficiently applied to clinical studies with 400 and more subjects where at least half of the subjects may have partially or completely missing values. The data sets should satisfy at least missing at random assumption, i.e., missing values in the data sets are not randomly distributed across all observations but are randomly distributed within one or more subsamples of data4.
In colorectal cancer studies, the KNN dependent method demonstrated higher precision of imputed values and better control of Type I error rate compared to the multiple imputation methods and had better sensitivity than doing no imputation at all. The advantages of the KNN dependent method over the multiple imputations become more apparent for larger sample sizes (400 and more subjects) and higher percentage (50% or more) of missingness (see Figure 1)4.
Figure 1 shows the performance (including power and false discovery rate (FDR) control) of the KNN dependent method over the weighted KNN ignoring the dependence (KNN independent), multiple imputation techniques using Markov chain Monte Carlo (MCMC) and Expectation-Maximization (EM) algorithms, as well as the case deletion technique which only considers fully-observed subjects (Case deletion) for the simulated data sets with sample sizes of 100, 200, and 400 subjects and with the missing percentages of 10, 30, and 50. For comparison purposes, the performance of the differential expression testing on the full data set (Full), i.e. no missingness, is also shown in Figure 1.
Figure 1 shows that the power (i.e., the true positive rate (TPR) values) increases with larger sample sizes. For 400 subjects and 50% missing values, which are the characteristics of the colorectal cancer study, there are distinct clusters of TPR and FDR values: Full, KNN dependent and Case deletion methods cluster separately, whereas the KNN independent, MCMC, and EM methods are grouped together. Although the KNN dependent method has slightly lower power than the other imputation methods (the TPR values are in the range of 0.93-0.98 for 400 subjects and 50% missing), it controls the FDR values below the threshold of 0.05, which is represented by red dotted lines in the figures. The KNN independent method, the MCMC and the EM algorithms have the highest power (the TPR values are in the range of 0.985-1 for 400 subjects and 50% missing), but lack control of FDR, i.e. the FDR values cross the threshold of 0.05 for all number of subjects and missing percentages. The case deletion method shows the lowest power, but maintains control of the FDR for all number of subjects and percentages of missing values4.
Depending on the study goals, researchers could select the conventional KNN or the multiple imputations methods that ignore the imputation-induced dependence to achieve more statistical power (with controlling the false discovery rate at 0.05) but risk a higher proportion of false discoveries. Otherwise, they could use the KNN dependent method that considers the dependence to have moderate loss of power but lower risk of false discoveries. In addition, the KNN dependent method is a non-parametric, robust, and computationally efficient.
Our highly trained Precision Statistics® approach used by the team at BSSI is devoted to addressing our clients’ needs by offering sound statistical methods. Through implementing up-to-date imputation methods we accurately predict missing values and account for uncertainty in the imputed data while considering the demographic, clinical, general health, genetic, and lifestyle variables, as well as other biologically related information.
- Stevens J.R., Suyundikov A., and Slattery M.L. Accounting for Missing Data in Clinical Research. Journal of the American Medical Association 2016, 315(5):517-518.
- Troyanskaya O., Cantor M., Sherlock G., Brown P., Hastie T., Tibshirani R., et al. Missing value estimation methods for DNA microarrays. Bioinformatics 2001;17(6): 520-525.
- Newgard C.D., Lewis R.J. Missing data: how to best account for what is not known. JAMA 20015:314(9):940-941.
- Suyundikov A., Stevens J.R., Corcoran C., Herrick J., Wolff R.K., and Slattery M.L. Incorporation of subject-level covariates in quantile normalization of miRNA data. BMC Genomics 2015, 16:1045.