Let's start with the {stats} way. The key function here is glm().
A logistic regression is a GLM using the binomial distribution. Thus, set `family = binomial` in glm().
Of course, you need response and predictor variables. Specify this with a formula and a data.frame/tibble.
You can save the output from glm() in a variable. Treat that variable like a list that contains the fitted values.
In this case, these values are probabilities. Using a threshold, say 50%, we can turn these predicted probabilities into predictions of our penguins' sex.
We can also use our glm object with predict() to, well, predict probabilities from observations that have not been in the training data set.
Note that predict() will show you the value of the linear predictor by default. But what you really want is the response (via type).
Now you know how to do logistic regression with stats::glm(). The same approach works for any other GLM.
For example, to do a Poisson regression change "family = binomial" to "family = poisson".
Also, you can change the link function, e.g. family = binomial(link = "probit")
Next, let us do the {tidymodels} way.
Notice that - just like {tidyverse} - {tidymodels} is not actually one package but a whole ecosystem of packages.
So technically speaking, let's do the {parsnip} way (that's the package handling model specifications.)
At first, {parsnip} looks way more complicated than glm(). That's because it's more general.
But the beautiful thing is that you can use the same interface to use different engines or even models.
For example, you could decide to switch from GLM to random forest.
To define a logistic regression, use logistic_reg() and apply set_engine() + set_mode() to it.
Each part in that chain refers to 1 part of the model spec. And you can easily exchange each one.
More on that later.
You could also do everything in one line by specifying logistic_reg(engine = "glm", mode = "classification").
If you ask me, that's just a matter of taste. I prefer the set_engine() and set_mode() way.
In the end, our model specification is really nothing but an instruction to do
- a classification
- using a logistic regression
- based on the stats::glm() function/engine (and not e.g. "keras" or "glmnet")
Saving that specification in a variable allows us to fit the described model using data.
To do so, pass the model spec to fit() and describe response and predictor variables.
This is similar to what you've done with glm().
Saving that fitted model into a variable let's us do predictions.
Once again, this is done with predict(). And it isn't actually much different from using {stats} (other than that the output is a bit nicer).
Finally, if you want to do a Poisson regression, exchange logistic_reg() for poisson_reg() and set the mode to "regression".
In fact, that’s how you switch to any other model, e.g. to random forests via `rand_forest()`. Within {parsnip} it's always one and the same interface 👌
Alright, alright, alright. That's a wrap for today. 🥳 Hope you learned a lot from this thread.
If you've enjoyed this thread, then feel free to follow @rappa753 to not miss out on any of my upcoming adventures.
And Remember: This thread is based on my newest blog post. It contains both, the math and the implementation, of GLMs.
Enjoy the rest of your day and see you next time 👋
Ever heard of logistic regression? Or Poisson regression? Both are generalized linear models (GLMs).
They're versatile statistical models. And by now, they've probably been reframed as super hot #MachineLearning. You can brush up on their math with this 🧵. #rstats#Statistics
Let's start with logistic regression. Assume you want to classify a penguin as male or female based on its
* weight,
* species and
* bill length
Better yet, let's make this specific. Here's a data viz for this exact scenario. It is based on the {palmerpenguins} data set.
As you can see, the male and female penguins form clusters that do not overlap too much.
However, regular linear regression (LR) won't help us to distinguish them. Think about it. Its output is something numerical. Here, we want to find classes.
I am rebuilding my #rstats blog from the bottom up with #quarto. This will let me use quarto's cool new tricks like tabs and easy columns.
I've already spent hours using quarto's great docs to build a custom blog. If you want to do the same, let me show you what I did.
Today, I will show you the first of many steps to your own quarto blog. First, create a new quarto blog project via RStudio.
Make sure to create a git repo as well. This lets you revert changes when you break your blog. You can follow along my repo at github.com/AlbertRapp/qua…
You can render your blog with `Render Website` from RStudio's `Build` tab.
The first easy changes happen in the `_quarto.yml` file.
1⃣ Set `theme: default`
2⃣ Name your blog via `title`
3⃣ Link your GitHub profile etc.
This will change the navbar at the top of your blog.