ungroup
The ungroup
R package introduces a versatile method for
ungrouping histograms (binned count data) assuming that counts are
Poisson distributed and that the underlying sequence on a fine grid to
be estimated is smooth. The method is based on the composite link model
and estimation is achieved by maximizing a penalized likelihood. Smooth
detailed sequences of counts and rates are so estimated from the binned
counts. Ungrouping binned data can be desirable for many reasons: Bins
can be too coarse to allow for accurate analysis; comparisons can be
hindered when different grouping approaches are used in different
histograms; and the last interval is often wide and open-ended and,
thus, covers a lot of information in the tail area. Age-at-death
distributions grouped in age classes and abridged life tables are
examples of binned data. Because of modest assumptions, the approach is
suitable for many demographic and epidemiological applications. For a
detailed description of the method and applications see Rizzi, Gampe, and Eilers (2015).
The package has two top level functions pclm
and
pclm2D
, two auxiliary functions (control.pclm
and control.pclm2D
), several generic functions
(plot
, summary
, fitted
,
residuals
). A dataset (ungroup.data
) is
provided as well for testing purposes.
All functions are documented in the standard way, which means that
once you load the package using library(ungroup)
you can
just type for example ?pclm
to see the help file.
The PCLM method (Eilers 2007) is based on the composite link model (Thompson and Baker 1981), which extends standard generalized linear models. It implements the idea that the observed counts, interpreted as realizations from Poisson distributions, are indirect observations of a finer (ungrouped) but latent sequence. This latent sequence represents the distribution of expected means on a fine resolution and has to be estimated from the aggregated data. Estimates are obtained by maximizing a penalized likelihood. This maximization is performed efficiently by a version of the iteratively reweighted least-squares algorithm. Optimal values of the smoothing parameter are chosen by minimizing Bayesian or Akaike’s Information Criterion (Hastie and Tibshirani 1990).
This is an example of estimation of the smooth age at death distributions from grouped death counts. First we have to define some grouped data:
# Input data
# x: Age groups
x <- c(0, 1, seq(5, 85, by = 5))
x
## [1] 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
# y: Death counts in the age group
y <- c(294, 66, 32, 44, 170, 284, 287, 293, 361, 600, 998,
1572, 2529, 4637, 6161, 7369, 10481, 15293, 39016)
# offset: Population exposed to risk in the age group
offset <- c(114, 440, 509, 492, 628, 618, 576, 580, 634, 657,
631, 584, 573, 619, 530, 384, 303, 245, 249) * 1000
# nlast: the size of the last age interval (usually open)
nlast <- 26
# This results in the last group being [85, 110).
The model can be fitted using pclm
function:
It generates different types of output stored in the created object.
See pclm
help page for detailed information about the
output list (?pclm
).
summary(M1)
##
## Penalized Composite Link Model (PCLM)
##
## Call:
## pclm(x = x, y = y, nlast = nlast)
##
## PCLM Type : Univariate
## Number of input groups : 19
## Number of fitted values : 111
## Length of estimate bins : 1
## Smoothing parameter lambda : 0
## B-splines intervals/knot (kr): 2
## B-splines degree (deg) : 3
## AIC : 39.97
## BIC : 59.81
Generic plot:
By default pclm
ungroups data in intervals of length
1
. If higher granularity is required out.step
argument can be used to specify this. For example, obtaining groups 222
groups of length 0.5 one can try:
For controlling the PCLM fitting process control.pclm
provides several options. The list of arguments needs to be specified
using the control
argument. For example, if we want to
optimize the smoothing parameters in order to obtain a fit characterized
by the small AIC
level one can write:
The PCLM can be extended to a two-dimensional regression problem. The
two-dimensional Penalized Composite Link Model to ungroup simultaneously
coarse distributions for adjacent years can be fitted using
pclm2D
function, and the structure of the functions works
as pclm
. See the examples provided in the help page. Note
that pclm2D
might be slower, depending on the data and
model specification provided in the functions.
The two-dimensional regression analysis combines two approaches: the PCLM for ungrouping in one dimension and two-dimensional smoothing with P-splines (Currie, Durban, and Eilers 2004). As an example we can ungroup age-specific distributions from the coarsely grouped data and smooth across adjacent calendar years to estimate both detailed age-at-death distributions and mortality time trends.
We thank Paul H.C. Eilers who provided insight and expertise that greatly supported the creation of this R package; and Catalina Torres and Tim Riffe for testing and offering feedback on the early versions of the software.
The authors are also grateful to the following institutions for their support: