I found a tweet on Poisson regression this morning on my feed and thought it would be a good time to write a tweetorial on Poisson regression. To keep things simple, I will assume we have a count response variable Y and a single predictor variable X. Here we go! #rstats
The response variable in Poisson regression is a count variable (e.g., the number of fish in a lake, measured every year). The marginal distribution of this variable can be visualized with an R command like:
The relationship between the response variable Y (e.g., yearly count of fish in a lake) and the predictor variable X (e.g., year of study, labelled as 0, 1, 2, etc.) can be plotted with the commands:
library(lattice)
xyplot(Y ~ X, data = Data, cex = 1.5, pch=19)
Notice the apparent non-linear relationship between year of study X and number of fish in the lake Y apparent from the plot of Y versus X in our example. Poisson regression will allow us to capture this relationship.
A Poisson regression model is easy to formulate and fit in R:
Model <- glm(Y ~ X,
family = poisson(link = "log"),
data = Data)
summary(Model)
What can be tricky though is interpreting the model summary.
To interpret the model summary of a Poisson regression model, we need to first understand what we are modelling:
We are modelling the log-transformed (true) 'mean' value of Y as a linear function of X.
Other people may use the words 'expected' or average' instead of 'mean'.
So how do we interpret the estimated intercept in our Poisson regression model? It is the estimated log-transformed 'mean' value of Y when X = 0. In other words, the log 'mean' count of fish in the lake in the first year of study is 0.25964.
The intercept becomes more interpretable if we exponentiate its value to get rid of the pesky log-transformation:
exp(0.25964) = 1.296463 is the 'mean' count of fish in the lake in the first year of study (which can be rounded for reporting purposes).
What about interpreting the estimated slope of X in our Poisson regression model?
Each additional year (i.e., each 1-unit increase in X) is associated with an (additive) increase of 0.30677 units in the log-transformed 'mean' count of fish in the lake.
The log-transformation strikes again! The estimated slope of X becomes more interpretable after exponentiation:
exp(0.30677) = 1.36.
Each additional year is associated with a multiplicative increase by a factor of 1.36 in the 'mean' count of fish in the lake (a 36% increase).
The effects package of R makes it easy to visualize the (additive) effect of X on the log-transformed 'mean' value of Y given X or the (multiplicative) effect of X on the 'mean' value of Y given X.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
If you haven't tried @dailyzad's concurve package to build p-value functions in R, make sure you give it a try - it is very nice. Details about the package can be found here: data.lesslikely.com/concurve//inde…. This thread illustrates some of the package functionality. #rstats (1)
First, install the latest version of the concurve package from Github:
Just came upon this interpretation in "Regression and Other Stories" by Gelman, Hill and Vehtari:
"We can say that, under the fitted model, the average difference in earnings, comparing two people of the same sex but one inch different in height, is $600."
I struggle with the interpretation provided because I don't understand why the authors talk about an "average difference" when comparing two individuals. I would understand talking about it when comparing two groups of individuals.
A linear regression model can be used to: 1. Predict true response values for individuals or 2. Estimate mean response values for groups of individuals (conditional on predictor values). The Gelman et al. interpretation seems to me to muddy the waters between these two purposes?
@tangming2005 Since nobody has mentioned the p-value function yet in this thread, I will jump into the fray to say that one way to convey more completely what is going on is by constructing the entire p-value function rather than reporting a single p-value. /1
@tangming2005 When plotted, the p-value function (also known as the confidence interval function) looks like a tepee. As explained at ebrary.net/72024/health/v…, it is closely related to the set of all confidence intervals for a given parameter. /2
@tangming2005 In fact, the p-value function is obtained by stacking all possible confidence intervals for the parameter of interest (e.g., 0%, 1%, ..., 99%, 100%) on top of each other, so that they all share the same center, which is the estimated value of that parameter. /3
An interesting question on Cross Validated got me thinking about the fact the heteroscedasticity is not a concept that is applicable to a count regression model. See below for my musings. The question is here: stats.stackexchange.com/questions/4890…. #rstats
Say we are fitting a Poisson regression model to our data:
m <- glm(y ~ x, family=poisson(link="log"), data = data)
and plot the residuals versus the fitted values. Should we worry about the obvious pattern of heteroscedasticity present in the plot?
Count regression models are predicated on the fact that the (conditional) mean of the response IS a function of the (conditional) variance of the response. What we need to do is consider whether the relationship between the mean and variance is adequately captured in our model.
Did you know you can use the map() function from the purrr package in R to simultaneously apply the same "computational or graphical recipe" to each column of a dataset? In this tweetorial, I will explain how you can do this. Ready? Let's go! #rstats
Let's say you work with R's airquality dataset and wish to compute the number of missing data values to each of its quantitative variables. Here is what you need to use map:
Columns: Ozone Solar.R Wind Temp
Recipe: Function for computing the number of missing values.
The R package naniar has the recipe you need - it's a function called n_miss() which reports the missing data values present in a dataset column/variable.
You can use the select() function in the dplyr package to select the columns of interest from the dataset.