--- title: "Introduction to `ungroup`" author: "Marius D. Pascariu, Maciej J. Dańko, Jonas Schöley and Silvia Rizzi" date: "September 1, 2018" output: pdf_document: highlight: tango number_sections: yes toc: no fontsize: 11pt geometry: margin=1.1in bibliography: REFERENCES.bib biblio-style: apalike link-citations: true vignette: > %\VignetteIndexEntry{Introducing ungroup} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} library(knitr) opts_chunk$set(collapse = TRUE) ``` # Abstract The `ungroup` R package introduces a versatile method for ungrouping histograms (binned count data) assuming that counts are Poisson distributed and that the underlying sequence on a fine grid to be estimated is smooth. The method is based on the composite link model and estimation is achieved by maximizing a penalized likelihood. Smooth detailed sequences of counts and rates are so estimated from the binned counts. Ungrouping binned data can be desirable for many reasons: Bins can be too coarse to allow for accurate analysis; comparisons can be hindered when different grouping approaches are used in different histograms; and the last interval is often wide and open-ended and, thus, covers a lot of information in the tail area. Age-at-death distributions grouped in age classes and abridged life tables are examples of binned data. Because of modest assumptions, the approach is suitable for many demographic and epidemiological applications. For a detailed description of the method and applications see @rizzi2015. # Package Structure The package has two top level functions `pclm` and `pclm2D`, two auxiliary functions (`control.pclm` and `control.pclm2D`), several generic functions (`plot`, `summary`, `fitted`, `residuals`). A dataset (`ungroup.data`) is provided as well for testing purposes. All functions are documented in the standard way, which means that once you load the package using `library(ungroup)` you can just type for example `?pclm` to see the help file. ```{r} # Load the package library(ungroup) ``` # Usage ## Univariate Penalized Composite Link Model (PCLM) The PCLM method [@eilers2007] is based on the composite link model [@thompson1981], which extends standard generalized linear models. It implements the idea that the observed counts, interpreted as realizations from Poisson distributions, are indirect observations of a finer (ungrouped) but latent sequence. This latent sequence represents the distribution of expected means on a fine resolution and has to be estimated from the aggregated data. Estimates are obtained by maximizing a penalized likelihood. This maximization is performed efficiently by a version of the iteratively reweighted least-squares algorithm. Optimal values of the smoothing parameter are chosen by minimizing Bayesian or Akaike's Information Criterion [@hastie1990]. This is an example of estimation of the smooth age at death distributions from grouped death counts. First we have to define some grouped data: ```{r} # Input data # x: Age groups x <- c(0, 1, seq(5, 85, by = 5)) x # y: Death counts in the age group y <- c(294, 66, 32, 44, 170, 284, 287, 293, 361, 600, 998, 1572, 2529, 4637, 6161, 7369, 10481, 15293, 39016) # offset: Population exposed to risk in the age group offset <- c(114, 440, 509, 492, 628, 618, 576, 580, 634, 657, 631, 584, 573, 619, 530, 384, 303, 245, 249) * 1000 # nlast: the size of the last age interval (usually open) nlast <- 26 # This results in the last group being [85, 110). ``` ### Fitting and ungrouping data using PCLM The model can be fitted using `pclm` function: ```{r, message=FALSE, results='hide'} M1 <- pclm(x, y, nlast) ``` ### Output It generates different types of output stored in the created object. See `pclm` help page for detailed information about the output list (`?pclm`). ```{r} ls(M1) ``` ### Summary ```{r} summary(M1) ``` ### plot.pclm Generic plot: ```{r, fig.align='center', fig.asp=0.8, out.width = '60%'} plot(M1) # Print first 6 fitted values fitted(M1)[1:6] ``` ### out.step By default `pclm` ungroups data in intervals of length `1`. If higher granularity is required `out.step` argument can be used to specify this. For example, obtaining groups 222 groups of length 0.5 one can try: ```{r, message=FALSE, results='hide', fig.align='center', fig.asp=0.8, out.width = '60%'} M2 <- pclm(x, y, nlast, out.step = 0.5) plot(M2) ``` ```{r} # Print first 6 fitted values fitted(M2)[1:6] # Number of fitted values length(fitted(M2)) ``` ### control.pclm For controlling the PCLM fitting process `control.pclm` provides several options. The list of arguments needs to be specified using the `control` argument. For example, if we want to optimize the smoothing parameters in order to obtain a fit characterized by the small `AIC` level one can write: ```{r, eval=FALSE} # Optimise smoothing parameter: lambda, kr and deg M3 <- pclm(x, y, nlast, control = list(lambda = NA, opt.method = "AIC")) ``` ### Offset The `offset` argument can be used to estimate smooth death rates. `offset` must be a vector of the same length as `y`. ```{r, message=FALSE, results='hide'} M5 <- pclm(x, y, nlast, offset) ``` Generic plot: ```{r, fig.align='center', fig.asp=0.8, out.width = '60%'} plot(M5, type = "s") ``` ## Two-dimensional Penalized Composite Link Model The PCLM can be extended to a two-dimensional regression problem. The two-dimensional Penalized Composite Link Model to ungroup simultaneously coarse distributions for adjacent years can be fitted using `pclm2D` function, and the structure of the functions works as `pclm`. See the examples provided in the help page. Note that `pclm2D` might be slower, depending on the data and model specification provided in the functions. The two-dimensional regression analysis combines two approaches: the PCLM for ungrouping in one dimension and two-dimensional smoothing with P-splines [@currie2004]. As an example we can ungroup age-specific distributions from the coarsely grouped data and smooth across adjacent calendar years to estimate both detailed age-at-death distributions and mortality time trends. # Acknowledgment We thank Paul H.C. Eilers who provided insight and expertise that greatly supported the creation of this R package; and Catalina Torres and Tim Riffe for testing and offering feedback on the early versions of the software. The authors are also grateful to the following institutions for their support: * University of Southern Denmark; * Max-Planck Institute for Demographic Research; * SCOR Corporate Foundation for Science. # References