Dealing with ugly data: Generalized Estimating Equations (GEE)

Recently I’ve been struggling with incorporating auto-correlation into analyses. Auto-correlation can be accounted for with relative ease when your data are normally distributed or can be transformed to be normally distributed. However, if you’re anything like me, rarely are you so lucky. My data are generally proportional, presence/absence (binary), or count data. And in most instances my response of interest is over-dispersed, highly zero-inflated, and utterly impervious to transformations. In short, it’s ugly.

Enter Generalized Estimating Equations (GEEs). Before I delve into the wonders that are GEEs, a caveat – I’m an ecology graduate student trying to navigate the rapidly expanding world of statistics. I am by no means a statistician. As such I’m going to limit my discussion to the general strengths and weaknesses of GEEs. For a more thorough discussion including the statistical nitty-gritty, I would recommend Zuur et al. (2009), Hocking (2012), and references therein.

First introduced by Liang and Zeger (1986), GEEs are generalized linear models (GLMs) which incorporate a correlation structure. Simply said, this means GEEs can accommodate both auto-correlated and non-normal data. Finally, a way to deal with my ugly data.

Advantages of GEEs:

  1. Can model non-normal responses. As a GLM hybrid, GEEs model a distribution (Poisson, binomial, etc.) linearized with a link function.
  2. Users specify an association structure to describe the relationship between response variables. This can be used to accommodate longitudinal or spatial data, interacting individuals, or situations in which responses are related up to a threshold distance or time.
  3. GEEs have an inherent over-dispersion term
  4. Assumptions of homogeneity of variances are relaxed.
  5. Models provide population level estimates, making them computationally simpler than GLMMs.
  6. GEEs perform better than GLMMs when there are few observations of each of many subjects.

Limitations of GEEs:

  1. GEEs all use quasi-likelihood estimation, so maximum likelihood estimation (MLE) tools are not appropriate for testing fit, comparing models, and conducting inference about parameters. But there are other options (check out Dan Hocking’s blog on it)
  2. GEEs don’t give subject specific estimates.
  3. These models perform poorly when there are many observations from a handful of subjects.

GEES have many strengths, and seem ideally suited for dealing with ugly data, particularly when the response of a population is of greatest interest.  Additionally, they can be implemented in most statistical programs, including R (geepack) and SAS. Despite this, GEEs have not gained traction amongst ecologists. Does anyone have thoughts as to why?

References/Additional Resources:
Hocking, D. J. 2012. The role of red-backed salamanders in ecosystems. Dissertation. University of New Hampshire.

Liang, K., and S. L. Zeger. 1986. Longitudinal data analysis using generalized linear models. Biometrika 73:13–22.

Zuur, A. F., E. N. Ieno, N. J. Walker, A. A. Saveliev, and G. M. Smith. 2009. Mixed effects models and extensions in ecology with R. Springer, New York.


One Comment Add yours

  1. Very nice post Britt, it really distills the pros and cons of GEE to the main points. I’ll just elaborate on the last limitation you mention: These models perform poorly when there are many observations from a handful of subjects.

    This line gets thrown around a lot in text books and such. However, I rarely if ever see the explanation or comparison with GLMMs. It seems to me that it’s not so much that the models somehow perform badly or are unstable or the like. It seems more a matter of inference. If you try to make inference on population-level responses (averaged over the random levels), it’s unlikely to be especially useful if there were few random levels. However, I would imagine that the same thing is true for GLMM. If you only have a few levels of a random effect, you assume that they were drawn from some distribution (usually normal), the estimates really probably only apply to those levels (e.g. sites, individuals, years, etc.) and shouldn’t be generalized too much. The difference is that with few random levels the GEE probably isn’t of any use whereas the subject-specific (conditional) estimates from the GLMM are at least useful in those cases, if not broadly applicable.

    It seems like ecologists always use GLMMs even when they are interested in more of the marginal (population-level) inference. That’s probably a mistake. I’m not sure why GEE aren’t used in ecology, maybe just because of a paucity of examples. Also ecologists tend to be overly obsessed with model selection, which is less well developed for GEE models. I will say that I’ve stuck with GLMM in part because of the fact that they are “easy” to extend. They’re a type of hierarchical model so it’s easy to move around in hierarchical modeling space (simple to super complex) with a unified framework.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s