--- title: "Testing hierarchical parent degradation kinetics with residue data on dimethenamid and dimethenamid-P" author: Johannes Ranke date: Last change on 5 January 2023, last compiled on `r format(Sys.time(), "%e %B %Y")` geometry: margin=2cm toc: true bibliography: references.bib output: pdf_document: extra_dependencies: ["float", "listing"] --- ```{r setup, echo = FALSE, cache = FALSE} options(width = 80) # For summary listings knitr::opts_chunk$set( comment = "", tidy = FALSE, cache = TRUE, fig.pos = "H", fig.align = "center" ) ``` \clearpage # Introduction The purpose of this document is to demonstrate how nonlinear hierarchical models (NLHM) based on the parent degradation models SFO, FOMC, DFOP and HS can be fitted with the mkin package. It was assembled in the course of work package 1.1 of Project Number 173340 (Application of nonlinear hierarchical models to the kinetic evaluation of chemical degradation data) of the German Environment Agency carried out in 2022 and 2023. The mkin package is used in version `r packageVersion("mkin")`. It contains the test data and the functions used in the evaluations. The `saemix` package is used as a backend for fitting the NLHM, but is also loaded to make the convergence plot function available. This document is processed with the `knitr` package, which also provides the `kable` function that is used to improve the display of tabular data in R markdown documents. For parallel processing, the `parallel` package is used. ```{r packages, cache = FALSE, message = FALSE} library(mkin) library(knitr) library(saemix) library(parallel) n_cores <- detectCores() if (Sys.info()["sysname"] == "Windows") { cl <- makePSOCKcluster(n_cores) } else { cl <- makeForkCluster(n_cores) } ``` \clearpage # Data The test data are available in the mkin package as an object of class `mkindsg` (mkin dataset group) under the identifier `dimethenamid_2018`. The following preprocessing steps are still necessary: - The data available for the enantiomer dimethenamid-P (DMTAP) are renamed to have the same substance name as the data for the racemic mixture dimethenamid (DMTA). The reason for this is that no difference between their degradation behaviour was identified in the EU risk assessment. - The data for transformation products and unnecessary columns are discarded - The observation times of each dataset are multiplied with the corresponding normalisation factor also available in the dataset, in order to make it possible to describe all datasets with a single set of parameters that are independent of temperature - Finally, datasets observed in the same soil (`Elliot 1` and `Elliot 2`) are combined, resulting in dimethenamid (DMTA) data from six soils. The following commented R code performs this preprocessing. ```{r data} # Apply a function to each of the seven datasets in the mkindsg object to create a list dmta_ds <- lapply(1:7, function(i) { ds_i <- dimethenamid_2018$ds[[i]]$data # Get a dataset ds_i[ds_i$name == "DMTAP", "name"] <- "DMTA" # Rename DMTAP to DMTA ds_i <- subset(ds_i, name == "DMTA", c("name", "time", "value")) # Select data ds_i$time <- ds_i$time * dimethenamid_2018$f_time_norm[i] # Normalise time ds_i # Return the dataset }) # Use dataset titles as names for the list elements names(dmta_ds) <- sapply(dimethenamid_2018$ds, function(ds) ds$title) # Combine data for Elliot soil to obtain a named list with six elements dmta_ds[["Elliot"]] <- rbind(dmta_ds[["Elliot 1"]], dmta_ds[["Elliot 2"]]) # dmta_ds[["Elliot 1"]] <- NULL dmta_ds[["Elliot 2"]] <- NULL ``` \clearpage The following tables show the `r length(dmta_ds)` datasets. ```{r results = "asis"} for (ds_name in names(dmta_ds)) { print(kable(mkin_long_to_wide(dmta_ds[[ds_name]]), caption = paste("Dataset", ds_name), label = paste0("tab:", ds_name), booktabs = TRUE)) cat("\n\\clearpage\n") } ``` # Separate evaluations In order to obtain suitable starting parameters for the NLHM fits, separate fits of the four models to the data for each soil are generated using the `mmkin` function from the `mkin` package. In a first step, constant variance is assumed. Convergence is checked with the `status` function. ```{r f-sep-const, dependson = "data"} deg_mods <- c("SFO", "FOMC", "DFOP", "HS") f_sep_const <- mmkin( deg_mods, dmta_ds, error_model = "const", quiet = TRUE) status(f_sep_const) |> kable() ``` In the table above, OK indicates convergence, and C indicates failure to converge. All separate fits with constant variance converged, with the sole exception of the HS fit to the BBA 2.2 data. To prepare for fitting NLHM using the two-component error model, the separate fits are updated assuming two-component error. ```{r f-sep-tc, dependson = "f-sep-const"} f_sep_tc <- update(f_sep_const, error_model = "tc") status(f_sep_tc) |> kable() ``` Using the two-component error model, the one fit that did not converge with constant variance did converge, but other non-SFO fits failed to converge. \clearpage # Hierarchichal model fits The following code fits eight versions of hierarchical models to the data, using SFO, FOMC, DFOP and HS for the parent compound, and using either constant variance or two-component error for the error model. The default parameter distribution model in mkin allows for variation of all degradation parameters across the assumed population of soils. In other words, each degradation parameter is associated with a random effect as a first step. The `mhmkin` function makes it possible to fit all eight versions in parallel (given a sufficient number of computing cores being available) to save execution time. Convergence plots and summaries for these fits are shown in the appendix. ```{r f-saem, dependson = c("f-sep-const", "f-sep-tc")} f_saem <- mhmkin(list(f_sep_const, f_sep_tc), transformations = "saemix") ``` The output of the `status` function shows that all fits terminated successfully. ```{r dependson = "f-saem"} status(f_saem) |> kable() ``` The AIC and BIC values show that the biphasic models DFOP and HS give the best fits. ```{r dependson = "f-saem"} anova(f_saem) |> kable(digits = 1) ``` The DFOP model is preferred here, as it has a better mechanistic basis for batch experiments with constant incubation conditions. Also, it shows the lowest AIC and BIC values in the first set of fits when combined with the two-component error model. Therefore, the DFOP model was selected for further refinements of the fits with the aim to make the model fully identifiable. ## Parameter identifiability based on the Fisher Information Matrix Using the `illparms` function, ill-defined statistical model parameters such as standard deviations of the degradation parameters in the population and error model parameters can be found. ```{r dependson = "f-saem"} illparms(f_saem) |> kable() ``` According to the `illparms` function, the fitted standard deviation of the second kinetic rate constant `k2` is ill-defined in both DFOP fits. This suggests that different values would be obtained for this standard deviation when using different starting values. The thus identified overparameterisation is addressed by removing the random effect for `k2` from the parameter model. ```{r f-saem-dfop-tc-no-ranef-k2, dependson = "f-saem"} f_saem_dfop_tc_no_ranef_k2 <- update(f_saem[["DFOP", "tc"]], no_random_effect = "k2") ``` For the resulting fit, it is checked whether there are still ill-defined parameters, ```{r f-saem-dfop-tc-no-ranef-k2-illparms, dependson = "f-saem-dfop-tc-no-ranef-k2"} illparms(f_saem_dfop_tc_no_ranef_k2) ``` which is not the case. Below, the refined model is compared with the previous best model. The model without random effect for `k2` is a reduced version of the previous model. Therefore, the models are nested and can be compared using the likelihood ratio test. This is achieved with the argument `test = TRUE` to the `anova` function. ```{r f-saem-dfop-tc-no-ranef-k2-comparison, dependson = "f-saem-dfop-tc-no-ranef-k2"} anova(f_saem[["DFOP", "tc"]], f_saem_dfop_tc_no_ranef_k2, test = TRUE) |> kable(format.args = list(digits = 4)) ``` The AIC and BIC criteria are lower after removal of the ill-defined random effect for `k2`. The p value of the likelihood ratio test is much greater than 0.05, indicating that the model with the higher likelihood (here the model with random effects for all degradation parameters `f_saem[["DFOP", "tc"]]`) does not fit significantly better than the model with the lower likelihood (the reduced model `f_saem_dfop_tc_no_ranef_k2`). Therefore, AIC, BIC and likelihood ratio test suggest the use of the reduced model. The convergence of the fit is checked visually. ```{r convergence-saem-dfop-tc-no-ranef-k2, fig.cap = "Convergence plot for the NLHM DFOP fit with two-component error and without a random effect on 'k2'", dependson = "f-saem-dfop-tc-no-ranef-k2", echo = FALSE, fig.width = 9, fig.height = 8} plot(f_saem_dfop_tc_no_ranef_k2$so, plot.type = "convergence") ``` All parameters appear to have converged to a satisfactory degree. The final fit is plotted using the plot method from the mkin package. ```{r plot-saem-dfop-tc-no-ranef-k2, fig.cap = "Plot of the final NLHM DFOP fit", dependson = "f-saem-dfop-tc-no-ranef-k2", fig.width = 9, fig.height = 5} plot(f_saem_dfop_tc_no_ranef_k2) ``` Finally, a summary report of the fit is produced. ```{r summary-saem-dfop-tc-no-ranef-k2, dependson = "f-saem-dfop-tc-no-ranef-k2"} summary(f_saem_dfop_tc_no_ranef_k2) ``` \clearpage ## Alternative check of parameter identifiability The parameter check used in the `illparms` function is based on a quadratic approximation of the likelihood surface near its optimum, which is calculated using the Fisher Information Matrix (FIM). An alternative way to check parameter identifiability [@duchesne_2021] based on a multistart approach has recently been implemented in mkin. The graph below shows boxplots of the parameters obtained in 50 runs of the saem algorithm with different parameter combinations, sampled from the range of the parameters obtained for the individual datasets fitted separately using nonlinear regression. ```{r multistart-full, dependson = "f-saem"} f_saem_dfop_tc_multi <- multistart(f_saem[["DFOP", "tc"]], n = 50, cores = 15) ``` ```{r multistart-full-par, dependson = "multistart_full", fig.cap = "Scaled parameters from the multistart runs, full model", fig.width = 10, fig.height = 6} par(mar = c(6.1, 4.1, 2.1, 2.1)) parplot(f_saem_dfop_tc_multi, lpos = "bottomright", ylim = c(0.3, 10), las = 2) ``` The graph clearly confirms the lack of identifiability of the variance of `k2` in the full model. The overparameterisation of the model also indicates a lack of identifiability of the variance of parameter `g`. The parameter boxplots of the multistart runs with the reduced model shown below indicate that all runs give similar results, regardless of the starting parameters. ```{r multistart-reduced, dependson = "f-saem-dfop-tc-no-ranef-k2"} f_saem_dfop_tc_no_ranef_k2_multi <- multistart(f_saem_dfop_tc_no_ranef_k2, n = 50, cores = 15) ``` ```{r multistart-reduced-par, dependson = "multistart_reduced", fig.cap = "Scaled parameters from the multistart runs, reduced model", fig.width = 10, fig.height = 5} par(mar = c(6.1, 4.1, 2.1, 2.1)) parplot(f_saem_dfop_tc_no_ranef_k2_multi, ylim = c(0.5, 2), las = 2, lpos = "bottomright") ``` When only the parameters of the top 25% of the fits are shown (based on a feature introduced in mkin 1.2.2 currently under development), the scatter is even less as shown below. ```{r multistart-reduced-par-llquant, dependson = "multistart_reduced", fig.cap = "Scaled parameters from the multistart runs, reduced model, fits with the top 25\\% likelihood values", fig.width = 10, fig.height = 5} par(mar = c(6.1, 4.1, 2.1, 2.1)) parplot(f_saem_dfop_tc_no_ranef_k2_multi, ylim = c(0.5, 2), las = 2, llquant = 0.25, lpos = "bottomright") ``` \clearpage # Conclusions Fitting the four parent degradation models SFO, FOMC, DFOP and HS as part of hierarchical model fits with two different error models and normal distributions of the transformed degradation parameters works without technical problems. The biphasic models DFOP and HS gave the best fit to the data, but the default parameter distribution model was not fully identifiable. Removing the random effect for the second kinetic rate constant of the DFOP model resulted in a reduced model that was fully identifiable and showed the lowest values for the model selection criteria AIC and BIC. The reliability of the identification of all model parameters was confirmed using multiple starting values. # Acknowledgements The helpful comments by Janina Wöltjen of the German Environment Agency are gratefully acknowledged. # References \vspace{1em} ::: {#refs} ::: \clearpage # Appendix ## Hierarchical model fit listings ```{r listings, results = "asis", echo = FALSE} for (deg_mod in deg_mods) { for (err_mod in c("const", "tc")) { caption <- paste("Hierarchical mkin fit of the", deg_mod, "model with error model", err_mod) summary_listing(f_saem[[deg_mod, err_mod]], caption) } } ``` ## Hierarchical model convergence plots ```{r convergence-saem-sfo-const, fig.cap = "Convergence plot for the NLHM SFO fit with constant variance", dependson = "f-saem", echo = FALSE, fig.width = 9, fig.height = 5} plot(f_saem[["SFO", "const"]]$so, plot.type = "convergence") ``` \clearpage ```{r convergence-saem-sfo-tc, fig.cap = "Convergence plot for the NLHM SFO fit with two-component error", dependson = "f-saem", echo = FALSE, fig.width = 9, fig.height = 5} plot(f_saem[["SFO", "tc"]]$so, plot.type = "convergence") ``` \clearpage ```{r convergence-saem-fomc-const, fig.cap = "Convergence plot for the NLHM FOMC fit with constant variance", dependson = "f-saem", echo = FALSE, fig.width = 9, fig.height = 8} plot(f_saem[["FOMC", "const"]]$so, plot.type = "convergence") ``` \clearpage ```{r convergence-saem-fomc-tc, fig.cap = "Convergence plot for the NLHM FOMC fit with two-component error", dependson = "f-saem", echo = FALSE, fig.width = 9, fig.height = 8} plot(f_saem[["FOMC", "tc"]]$so, plot.type = "convergence") ``` \clearpage ```{r convergence-saem-dfop-const, fig.cap = "Convergence plot for the NLHM DFOP fit with constant variance", dependson = "f-saem", echo = FALSE, fig.width = 9, fig.height = 8} plot(f_saem[["DFOP", "const"]]$so, plot.type = "convergence") ``` \clearpage ```{r convergence-saem-dfop-tc, fig.cap = "Convergence plot for the NLHM DFOP fit with two-component error", dependson = "f-saem", echo = FALSE, fig.width = 9, fig.height = 8} plot(f_saem[["DFOP", "tc"]]$so, plot.type = "convergence") ``` \clearpage ```{r convergence-saem-hs-const, fig.cap = "Convergence plot for the NLHM HS fit with constant variance", dependson = "f-saem", echo = FALSE, fig.width = 9, fig.height = 8} plot(f_saem[["HS", "const"]]$so, plot.type = "convergence") ``` \clearpage ```{r convergence-saem-hs-tc, fig.cap = "Convergence plot for the NLHM HS fit with two-component error", dependson = "f-saem", echo = FALSE, fig.width = 9, fig.height = 8} plot(f_saem[["HS", "tc"]]$so, plot.type = "convergence") ``` \clearpage ## Session info ```{r, echo = FALSE} parallel::stopCluster(cl) sessionInfo() ``` ## Hardware info ```{r, echo = FALSE} if(!inherits(try(cpuinfo <- readLines("/proc/cpuinfo")), "try-error")) { cat(gsub("model name\t: ", "CPU model: ", cpuinfo[grep("model name", cpuinfo)[1]])) } if(!inherits(try(meminfo <- readLines("/proc/meminfo")), "try-error")) { cat(gsub("model name\t: ", "System memory: ", meminfo[grep("MemTotal", meminfo)[1]])) } ```