Albert Rapp Profile picture
Aug 19 18 tweets 6 min read
With #rstats, it's dead-simple to implement logistic regression or Poisson regressions. Or any other kind of generalized linear model.

Here's how you can do that with {stats} or with {tidymodels}. 🧵
#Statistics #MachineLearning
Need to brush up on the math behind these models before we get started?

My most popular thread may help you.
One more hint before we start:

All of my code examples can be copied from my newest blog post.

The data that I use here comes from {palmerpenguins}. And we're going to classify a penguin's sex based on its weight, species and bill length. 🐧 🐧

albert-rapp.de/posts/14_glms/…
Let's start with the {stats} way. The key function here is glm().

A logistic regression is a GLM using the binomial distribution. Thus, set `family = binomial` in glm().

Of course, you need response and predictor variables. Specify this with a formula and a data.frame/tibble.
You can save the output from glm() in a variable. Treat that variable like a list that contains the fitted values.

In this case, these values are probabilities. Using a threshold, say 50%, we can turn these predicted probabilities into predictions of our penguins' sex.
We can also use our glm object with predict() to, well, predict probabilities from observations that have not been in the training data set.

Note that predict() will show you the value of the linear predictor by default. But what you really want is the response (via type).
Now you know how to do logistic regression with stats::glm(). The same approach works for any other GLM.

For example, to do a Poisson regression change "family = binomial" to "family = poisson".

Also, you can change the link function, e.g. family = binomial(link = "probit")
Next, let us do the {tidymodels} way.

Notice that - just like {tidyverse} - {tidymodels} is not actually one package but a whole ecosystem of packages.

So technically speaking, let's do the {parsnip} way (that's the package handling model specifications.)
At first, {parsnip} looks way more complicated than glm(). That's because it's more general.

But the beautiful thing is that you can use the same interface to use different engines or even models.

For example, you could decide to switch from GLM to random forest.
To define a logistic regression, use logistic_reg() and apply set_engine() + set_mode() to it.

Each part in that chain refers to 1 part of the model spec. And you can easily exchange each one.

More on that later.
You could also do everything in one line by specifying logistic_reg(engine = "glm", mode = "classification").

If you ask me, that's just a matter of taste. I prefer the set_engine() and set_mode() way.
In the end, our model specification is really nothing but an instruction to do

- a classification
- using a logistic regression
- based on the stats::glm() function/engine (and not e.g. "keras" or "glmnet")
Saving that specification in a variable allows us to fit the described model using data.

To do so, pass the model spec to fit() and describe response and predictor variables.

This is similar to what you've done with glm().
Saving that fitted model into a variable let's us do predictions.

Once again, this is done with predict(). And it isn't actually much different from using {stats} (other than that the output is a bit nicer).
Finally, if you want to do a Poisson regression, exchange logistic_reg() for poisson_reg() and set the mode to "regression".

In fact, that’s how you switch to any other model, e.g. to random forests via `rand_forest()`. Within {parsnip} it's always one and the same interface 👌
Alright, alright, alright. That's a wrap for today. 🥳 Hope you learned a lot from this thread.

If you've enjoyed this thread, then feel free to follow @rappa753 to not miss out on any of my upcoming adventures.
And Remember: This thread is based on my newest blog post. It contains both, the math and the implementation, of GLMs.

Enjoy the rest of your day and see you next time 👋

albert-rapp.de/posts/14_glms/…
One more way to stay in touch is via my biweekly newsletter. Every other week, I write about
📈 dataviz,
🌐 Shiny or
🧮 statistics

Subscribe to the newsletter and all content goes straight into your inbox. 👌 alberts-newsletter.beehiiv.com/subscribe

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Albert Rapp

Albert Rapp Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @rappa753

Aug 14
This is a🧵🧵 about small steps you can take to learn #dataviz.

It's mainly a collection of things I wish I had known when I started learning. (And it comes with many resources)
I used to get stuck trying to create something "big". But building a dataviz from scratch is waaay too much effort at first.

Learning just 1 new trick, that's doable.

Baby steps. They're not as exciting as a flashy new viz. But they're motivating bite-size chunks of success.
1 // Data

First you need data.

Don't overthink what may be exciting or not. Just grab this week's #tidyTuesday data set.

(There are advantages of using TidyTuesday data. I'll come to that next) github.com/rfordatascienc…
Read 14 tweets
Aug 11
Forget regular heat maps. Use bubbles on a grid instead 🔵 🟢 🤯

A short #dataviz thread 🧵 🧵

#rstats #ggplot2

1/8 Image
Regular heat maps have the crucial flaw of not showing how much samples were used. 🤔

That's totally fine when the different sizes are shown (e.g. with colors). It's what I did with my calendar plot a while back (special heat map)


2/8
But look what happens when I use a color gradient for a summary statistic.

Here, I try to show a relationship between sale price of a house and a property's size + location.

This looks right. But it isn't. Some medians were estimated with ridiculously small samples. 😱 💔

3/8 Image
Read 9 tweets
Jul 27
💎💎🧵🧵
5 hidden gems from well-known #rstats packages to spice up your #dataviz game.

(with many code examples from the R twitter community)
1 // Bump charts

With {ggbump} it's easy to show rankings over time. Most of the heavy-lifting comes from its geom_bump() function.

Here's a nice example from @steodosescu.

Clearly, geom_bump() is the star of this package. But don't ignore its lesser-known helper functions.
1b // geom_sigmoid()

The smooth curved lines in bump charts are powered by geom_sigmoid().

You would think that this function is only a minor character in {ggbump} but NOOOOO! It's a hidden gem 💎.

Check out how @geokaramanis used it to create a stunning visual. 🤯
Read 19 tweets
Jul 10
The #rstats ecosystem makes splitting a stacked bar plot simple. 🥳 This way, comparing groups is sooo much easier! 👌🏽

✂️ Split stacked bars with facet_wrap()
🪢 Combine splits with totals via {patchwork}

Code: gist.github.com/AlbertRapp/cd8…

Details in thread 🧵
#dataviz
I picked up this trick of splitting bar charts from "Better Data Visualizations" by Jonathan Schwabish. amzn.to/3AEE4DB

I haven't finished the book yet but it contains many great nuggets of dataviz wisdom from page 1.

Now, let's implement this trick in ggplot.
The stacked bar plot can be created with geom_bar(). You will need to map the car classes to `fill`.

Here, I have used the mako color palette from {viridisLite}. I learned about this beautiful color palette from @c_gebhard today.
Read 9 tweets
Jun 18
Ever heard of logistic regression? Or Poisson regression? Both are generalized linear models (GLMs).

They're versatile statistical models. And by now, they've probably been reframed as super hot #MachineLearning. You can brush up on their math with this 🧵. #rstats #Statistics
Let's start with logistic regression. Assume you want to classify a penguin as male or female based on its

* weight,
* species and
* bill length

Better yet, let's make this specific. Here's a data viz for this exact scenario. It is based on the {palmerpenguins} data set.
As you can see, the male and female penguins form clusters that do not overlap too much.

However, regular linear regression (LR) won't help us to distinguish them. Think about it. Its output is something numerical. Here, we want to find classes.
Read 25 tweets
May 31
I am rebuilding my #rstats blog from the bottom up with #quarto. This will let me use quarto's cool new tricks like tabs and easy columns.

I've already spent hours using quarto's great docs to build a custom blog. If you want to do the same, let me show you what I did. ImageImage
Today, I will show you the first of many steps to your own quarto blog. First, create a new quarto blog project via RStudio.

Make sure to create a git repo as well. This lets you revert changes when you break your blog. You can follow along my repo at github.com/AlbertRapp/qua…
You can render your blog with `Render Website` from RStudio's `Build` tab.

The first easy changes happen in the `_quarto.yml` file.

1⃣ Set `theme: default`
2⃣ Name your blog via `title`
3⃣ Link your GitHub profile etc.

This will change the navbar at the top of your blog. ImageImage
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(