--- title: "Analysis of Multiply Imputed Data Sets" output: rmarkdown::html_vignette: css: "css/vignette.css" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Analysis of Multiply Imputed Data Sets} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE, cache=FALSE} library(knitr) set.seed(123) options(width=87) opts_chunk$set(background="#ffffff", comment="#", collapse=FALSE, fig.width=9, fig.height=9, warning=FALSE, message=FALSE) ``` This vignette is intended to provide an overview of the analysis of multiply imputed data sets with `mitml`. Specifically, this vignette addresses the following topics: 1. Working with multiply imputed data sets 2. Rubin's rules for pooling individual parameters 3. Model comparisons 4. Parameter constraints Further information can be found in the other [vignettes](https://github.com/simongrund1/mitml/wiki) and the package [documentation](https://cran.r-project.org/package=mitml/mitml.pdf). ## Example data (`studentratings`) For the purposes of this vignette, we make use of the `studentratings` data set, which contains simulated data from 750 students in 50 schools including scores on reading and math achievement, socioeconomic status (SES), and ratings on school and classroom environment. The package and the data set can be loaded as follows. ```{r} library(mitml) library(lme4) data(studentratings) ``` As evident from its `summary`, most variables in the data set contain missing values. ```{r} summary(studentratings) ``` In the present example, we investigate the differences in mathematics achievement that can be attributed to differences in SES when controlling for students' sex. Specifically, we are interested in the following model. $$ \mathit{MA}_{ij} = \gamma_{00} + \gamma_{10} \mathit{Sex}_{ij} + \gamma_{20} (\mathit{SES}_{ij}-\overline{\mathit{SES}}_{\bullet j}) + \gamma_{01} \overline{\mathit{SES}}_{\bullet j} + u_{0j} + e_{ij} $$ Note that this model also employs group-mean centering to separate the individual and group-level effects of SES. ## Generating imputations In the present example, we generate 20 imputations from the following imputation model. ```{r, results="hide"} fml <- ReadDis + SES ~ 1 + Sex + (1|ID) imp <- panImpute(studentratings, formula = fml, n.burn = 5000, n.iter = 200, m = 20) ``` The completed data are then extracted with `mitmlComplete`. ```{r} implist <- mitmlComplete(imp, "all") ``` ## Transforming the imputed data sets In empirical research, the raw data rarely enter the analyses but often require to be transformed beforehand. For this purpose, the `mitml` package provides the `within` function, which applies a given transformation directly to each data set. In the following, we use this to (a) calculate the group means of SES and (b) center the individual scores around their group means. ```{r} implist <- within(implist, { G.SES <- clusterMeans(SES, ID) # calculate group means I.SES <- SES - G.SES # center around group means }) ``` This method can be used to apply arbitrary transformations to all of the completed data sets simultaneously. > **Note regarding** `dplyr`**:** > Due to how it is implemented, `within` cannot be used directly with `dplyr`. > Instead, users may use `with` instead of `within` with the following workaround. >```{r, eval=FALSE} >implist <- with(implist,{ > df <- data.frame(as.list(environment())) > df <- ... # dplyr commands > df >}) >implist <- as.mitml.list(implist) >``` > Advanced users may also consider using `lapply` for a similar workaround.` ## Fitting the analysis model In order to analyze the imputed data, each data set is analyzed using regular complete-data techniques. For this purpose, `mitml` offers the `with` function. In the present example, we use it to fit the model of interest with the R package `lme4`. ```{r} fit <- with(implist, { lmer(MathAchiev ~ 1 + Sex + I.SES + G.SES + (1|ID)) }) ``` This results in a list of fitted models, one for each of the imputed data sets. ## Pooling The results obtained from the imputed data sets must be pooled in order to obtain a set of final parameter estimates and inferences. In the following, we employ a number of different pooling methods that can be used to address common statistical tasks, for example, for (a) estimating and testing individual parameters, (b) model comparisons, and (c) tests of constraints about one or several parameters. #### Parameter estimates Individual parameters are commonly pooled with the rules developed by Rubin (1987). In `mitml`, Rubin's rules are implemented in the `testEstimates` function. ```{r} testEstimates(fit) ``` In addition, the argument `extra.pars = TRUE` can be used to obtain pooled estimates of variance components, and `df.com` can be used to specify the complete-data degrees of freedom, which provides more appropriate (i.e., conservative) inferences in smaller samples. For example, using a conservative value for the complete-data degrees of freedom for the fixed effects in the model of interest (Snijders & Bosker, 2012), the output changes as follows. ```{r} testEstimates(fit, extra.pars = TRUE, df.com = 46) ``` #### Multiple parameters and model comparisons Oftentimes, statistical inference concerns more than one parameter at a time. For example, the combined influence of SES (within and between groups) on mathematics achievement is represented by two parameters in the model of interest. Multiple pooling methods for Wald and likelihood ratio tests (LRTs) are implemented in the `testModels` function. This function requires the specification of a full model and a restricted model, which are then compared using (pooled) Wald tests or LRTs. Specifically, `testModels` allows users to pool Wald tests ($D_1$), $\chi^2$ test statistics ($D_2$), and LRTs ($D_3$ and $D_4$; for a comparison of these methods, see also Grund, Lüdtke, & Robitzsch, 2016b). To examine the combined influence of SES on mathematics achievement, the following restricted model can be specified and compared with the model of interest (using $D_1$). ```{r} fit.null <- with(implist, { lmer(MathAchiev ~ 1 + Sex + (1|ID)) }) testModels(fit, fit.null) ``` > **Note regarding the order of arguments:** > Please note that `testModels` expects that the first argument contains the full model, and the second argument contains the restricted model. > If the order of the arguments is reversed, the results will not be interpretable. Similar to the test for individual parameters, smaller samples can be accommodated with `testModels` (with method $D_1$) by specifying the complete-data degrees of freedom for the denominator of the $F$ statistic. ```{r} testModels(fit, fit.null, df.com = 46) ``` The pooling method used by `testModels` is determined by the `method` argument. For example, to calculate the pooled LRT corresponding to the Wald test above (i.e., $D_3$), the following command can be issued. ```{r} testModels(fit, fit.null, method="D3") ``` #### Constraints on parameters Finally, it is often useful to investigate functions (or constraints) of the parameters in the model of interest. In complete data sets, this can be achieved with a test of linear hypotheses or the delta method. The `mitml` package implements a pooled version of the delta method in the `testConstraints` function. For example, the combined influence of SES on mathematics achievement can also be tested without model comparisons by testing the constraint that the parameters pertaining to `I.SES` and `G.SES` are both zero. This constraint is defined and tested as follows. ```{r} c1 <- c("I.SES", "G.SES") testConstraints(fit, constraints = c1) ``` This test is identical to the Wald test given in the previous section. Arbitrary constraints on the parameters can be specified and tested in this manner, where each character string denotes an expression to be tested against zero. In the present example, we are also interested in the *contextual* effect of SES on mathematics achievement (e.g., Snijders & Bosker, 2012). The contextual effect is simply the difference between the coefficients pertaining to `G.SES` and `I.SES` and can be tested as follows. ```{r} c2 <- c("G.SES - I.SES") testConstraints(fit, constraints = c2) ``` Similar to model comparisons, constraints can be tested with different methods ($D_1$ and $D_2$) and can accommodate smaller samples by a value for `df.com`. Further examples for the analysis of multiply imputed data sets with `mitml` are given by Enders (2016) and Grund, Lüdtke, and Robitzsch (2016a). ###### References Enders, C. K. (2016). Multiple imputation as a flexible tool for missing data handling in clinical research. *Behaviour Research and Therapy*. doi: 10.1016/j.brat.2016.11.008 ([Link](https://doi.org/10.1016/j.brat.2016.11.008)) Grund, S., Lüdtke, O., & Robitzsch, A. (2016a). Multiple imputation of multilevel missing data: An introduction to the R package pan. *SAGE Open*, *6*(4), 1–17. doi: 10.1177/2158244016668220 ([Link](https://doi.org/10.1177/2158244016668220)) Grund, S., Lüdtke, O., & Robitzsch, A. (2016b). Pooling ANOVA results from multiply imputed datasets: A simulation study. *Methodology*, *12*, 75–88. doi: 10.1027/1614-2241/a000111 ([Link](https://doi.org/10.1027/1614-2241/a000111)) Rubin, D. B. (1987). *Multiple imputation for nonresponse in surveys*. Hoboken, NJ: Wiley. Snijders, T. A. B., & Bosker, R. J. (2012). *Multilevel analysis: An introduction to basic and advanced multilevel modeling*. Thousand Oaks, CA: Sage. --- ```{r, echo=F} cat("Author: Simon Grund (simon.grund@uni-hamburg.de)\nDate: ", as.character(Sys.Date())) ```