Since various SW benchmarks are going around today... A short thread on why I use #rstats.

Put simply, it offers by far the fastest & most efficient tools for the work I do (i.e. mostly data wrangling & applied econometrics).
(Disclaimer: This thread is *not* tying to get you to change from your preferred SW. You should use whatever you feel comfortable with. But I will try to highlight some objective facts that matter to me.)
For data wrangling, nothing comes close to consistently matching the performance of #rdatatable. Benchmarks here: Image
Em Janeiro o @Dadoscope1 publicou achados sobre dados abertos de pensionistas referentes a novembro e dezembro de 2018.
Agora com os novos dados conquistados e tratados pela @_fiquemsabendo é possível expandir dois meses para duas décadas de análises. Segue 🧶
#RStats #openData ImageImageImageImage
Em janeiro constatamos que filhas solteiras adultas recebiam pensões em valores maiores do que outras classes de filhas. No total acumulado desde 1994 isso se solidifica. ImageImage
Observando por faixas etárias vemos que mulheres solteiras saudáveis com mais de 50 anos representam a maior quantidade das classes analisadas. Image
1 #SCBMelb20
Biodiverse cities = healthy cities but how do we encourage wildlife to thrive?
#urbanEcology can help us decide where to prioritise resources for nature.🌿
In 2018 we showed how @cityofmelbourne can use #connectivity theory as a planning tool:
2 #SCBMelb20
All animals need to move for the same reasons as people🍔🏕️💑
But resources are fragmented across city-scapes, with barriers in between.
Ecological connectivity quantifies how easy it is for an animal to move around the landscape & where to place new connections.
3 #SCBMelb20
We use Spanowicz & Jaeger 2019 method to quantify the degree of habitat fragmentation across a landscape.
Habitat patches are connected if they are within a reasonable distance for that animal to cross, AND are not separated by a barrier. 🛣️🏫
I spent the weekend putting together a "Meta RMarkdown" blog post!

4 R Markdown Strategies:

1 - Literate Programming
2 - Data Product
3 - Control Document
4 - Template…

#rstats #datascience #tidyverse
1 - Literate Programming

Use RMarkdown like a reproducible scientific notebook, capturing code, comments, and specific outputs in a output document.

All in plain text that is easily human-readable in version control!
2 - Data Product

Generate all sorts of fancy outputs from RMarkdown, such as:
- Presentations (Powerpoint or web native like remark.js)
- Dashboards w/ flexdashboards
- Reports as HTML, PDF, Word, etc
- Entire websites w/ blogdown, hugodown, distill
I find stats talk around causality wanting: most of the time, the concept is simply side-stepped. To remedy this, I'll read through @yudapearl's seminal work 'Causality' using his 'Book of Why' as an interpretative key. In this thread, I'll share what I learn.
Bayesian Networks and Probability Distributions. How can we model a joint probability distribution in a computer? What is the probabilistic connection? d-separation and testable implications from graphical methods, Ladder of Causation..

#rstats ImageImageImageImage

Joint probability distributions are tricky to represent: in our heads and our computers. Let's model our conditional independence assumptions with a Graph: for any node, conditional on its parents, the variable is conditionally independent of all other preceding variables.
Read 22 tweets
After listening to an interview with @ScottGottliebMD where he referenced @OpenTable data, I wanted to explore Texas and it’s restaurant activity. Here’s a graph mapping the percentage change of reservations from the same dates in 2019 and 2020, including the phases of reopening.
That spike in the 75% capacity is from Father’s Day (June 21) which is the closest Texas restaurant reservations have returned to the 2019 values since the shutdown (-6%). The next closest? June 20. (-41.55%).
We’ve seen a slow growth but again, the fear of the virus is coming back, dropping that growth after Abbott rolled back some of the reopening, lowering capacity.
Read 5 tweets
a tweetorial on #broom and linear models with categorical features in #rstats based on a super common confusion and related feature request we get

suppose you estimate a linear regression where a categorical predictor is of primary interest, something like this:
by default, stats::lm() uses treatment coding, so the intercept represents the mean of the base level of cyl, in this case cyl=4. if you have other covariates, you'd get the mean conditional on cyl=4 after partialing out those covariates, more or less
when the regression is like this people often want to compare means between groups like you would with an ANOVA

(note that we have estimated four parameters: the intercept, cyl6, cyl8, and a variance parameter)
Read 16 tweets
Construímos um novo aplicativo para acompanhamento das despesas do governo central. Agora sob a ótica do COFOG padrão internacional ONU.
É um dashboard com quatro abas para análises exploratórias. (1/n)
#RStats #DataViz #OpenData…
Na primeira aba, uma explicação geral sobre o dashboard. A contextualização do COFOG.
Os primeiros achados (2/n)
Na segunda aba, como as sub-funções de governo se juntam para formar as funções de governo.
Read 10 tweets
Long #Missouri #COVID19 evening update 🧵 for Thursday, 7/7. I’ve pushed updates to all metrics to the website -….

The statewide 7-day average is basically unchanged, though the #KansasCity average hit another new peak value yesterday. 1/19
Expect this status quo to change tomorrow, as we absorb the 773 new cases in today’s DHSS release (which aren’t in my data tonight). @erinheff's reporting for @stltoday indicates that the state believes that this a backlog due to the holiday weekend. 2/19…
What matters more than our single day values is the trend - this may indeed be an outlier day in terms of raw numbers, but we’ll have to see what the next week brings in terms of new cases. 3/19
Read 20 tweets
Map of Brazil made from 3 to 500 edges Image
d <- geobr::read_country()
k <-c( .1, .32, .67, 1, 16.77 )/1000

f <- function(k){
ggplot2::ggplot() +
geom_sf(data = rmapshaper::ms_simplify(d, k)) +
t <- lapply(k, f)
g <- cowplot::plot_grid(t[[1]], t[[2]], t[[3]], t[[4]], t[[5]])
It's my first reproducible example that also fits in a tweet! #rstats
Read 3 tweets
For those that are new to the DataCamp controversy: I have been following it from the beginning, and I'm here to share a bit so you know why this is big news and why the #rstats community is so fired up:
Firstly, read this article: It is stellar reporting that includes a summary of the major happenings where executives at DataCamp knew about an incident where an employee was sexually harassed by the CEO, and then they decided to brush it under the rug.…
Every step that DataCamp has taken since then has been absolutely idiotic and insane. Some are documented in the article, and some haven't been. DataCamp released insincere apologies, and then doubled down on threatening and harassing people that disagreed with their actions.
Read 17 tweets
I've just come to the end of a 4-year tenure, redeveloping & teaching a flexible online first, textbook free, skills-focused and accessible first year psychology unit.
In this time, I've taught 5000+ students. I have opinions I'll roll out here, but ask me questions. #HPS111 1/n
To start with, we chucked the textbook, lectures, and conventional readings, and replaced them with weekly video series (each video ~12 mins) with time specific overlaid links to primary articles using @H5PTechnology. That's most of the on-campus teaching gone, back in 2017. Why?
Why kill textbooks? Because they cost too much, are frequently wrong, and put students attention in the wrong place. I don't want students citing textbooks as evidence, so we modelled the use of evidence we want to see from them.
Read 36 tweets
It says a lot about the #rstats community that, despite yet more badness involving DataCamp (*massive eyeroll*), they continue to generate and share free resources for learning R:
Read 7 tweets
A short thread of #rstats resources I have made freely available online! First off, an introductory textbook on statistics with R pitched at beginners focusing on base R rather than tidyverse tools.
A tidyverse focused course on "robust data science tools" covering data manipulation, data visualisation, an introduction to github and blogdown, as well as programming tutorials that incidentally teach you the basics of making generative artwork!
My YouTube channel has a series of videos that walk you through the entire robust tools class…
Read 11 tweets
Reminder: there are hundreds of great, FREE learning resources for #rstats out there.

There's no need to sign up to take courses with a disgusting, ethically bankrupt company with sniveling, feckless leadership.
I'm completely self-taught in R. Here's a list of the FREE, OPEN materials I've used on my journey:

For data wrangling and visualization, nothing beats Hadley's "R for Data Science"

Want to learn about data visualization? Check out @ClausWilke's "Fundamentals of Data Visualization." While not a book about R specifically, it's a great resource for learning what makes a good, interpretable viz. Plus, the book's code is available!

Read 20 tweets
Short #Missouri #COVID19 evening update 🧵 for Tuesday, 6/30. I’ve pushed updates to all metrics to the website -….

The statewide 7-day average jumped significantly to a new peak, with #KansasCity hitting a new peak, #StLouis at its highest point… 1/15
… since early May, and the outstate trend up as well, though it a bit short of its peak three days ago.

Every county in SW MO that I have been focusing on added cases yesterday. 2/15
In SE MO, Butler County added 24 new cases yesterday - easily the largest single day bump there - and is averaging over 5 new cases per day right now. Perry and Cape Girardeau have both continued to add new cases, too. 3/15
Read 15 tweets
Spatial plotting just improved a lot in the development version of ggplot2. In a nutshell, you can now mix and match regular geoms with `geom_sf()` and `coord_sf()`. If you're doing any geospatial plotting, please test this out. 1/n
#rstats #ggplot2 Image
The key idea is that `coord_sf()` now has a default coordinate system that it uses for any objects that are not sf (i.e., don't come with their own coordinate information). The default is longitude/latitude, which makes it super easy to, for example, mark a specific location. 2/n ImageImage
This works with essentially any geom, e.g., we can mark a couple of cities and draw an enclosing polygon. 3/n ImageImage
Read 9 tweets
I’ve been having a difficult time lately — partly because of [insert frantic gesturing at the state of the world], partly personal — but one thing has been a real bright light for me in the last few months. I think it has some broader lessons that might give some hope, so THREAD
In March I started teaching an #rstats class at the intro level to almost 700 psych undergrads. It was my first time teaching it, which meant spending months on an insane sprint creating 25 lectures, weekly tutorials, assessments, and answering zillions of emails.
Still, I asked for this. I love teaching SO much, and I love teaching coding and math more than any other kind. Coding and math are my happy place — meditative, soothing — and I love the challenge of getting people who fear and loathe it to see some of its beauty.
Read 28 tweets
Projecting single-cell transcriptomics data onto a reference T cell atlas to interpret immune responses…
Definition of robust, biologically relevant cell clusters or states by scRNA-seq analysis is typically an iterative, time-consuming process that requires advanced bioinformatics and biological domain expertise
Even after successful analyses, clusters are not directly comparable between studies, preventing us from learning general biological rules across cohorts, conditions and models…
Read 12 tweets
📢 NEW PAPER! #PaperThread #SciComm

Master’s student @samherniman recently published an article about avian habitat suitability in Remote Sensing Applications: Society and Environment. Sam has written this thread summarizing his findings...

In general, when a habitat has more #birds, it also has more of all living things. In scientific terms, we say that birds are good surrogates or indicators of #biodiversity.

This is excellent, because counting birds is really easy. Many of them sing or call. So, we can do a field survey of birds in a habitat and use that number to find a relative count of all biodiversity.

more birdsong ≈ more birds ≈ more biodiversity


Read 18 tweets
Short #Missouri #COVID19 evening update 🧵 for Tuesday, 6/16. I’ve pushed updates to all metrics to the website -….

The headline - outstate cases hit another new peak today - we’re averaging ~84 new cases per day outside of #StLouis and #KansasCity. 1/
The statewide 7-day average plots have a noticeable anomaly on the green trend line - a drop then spike back up ☝️ that is artifact from the largely missed day of reporting Sunday that I mentioned in my 🧵 yesterday. I don’t think yesterday’s data release overcompensates... 2/17 dumping more new cases than we would expect. Instead, this is reversion of the mean - i.e. our 7-day averages right now are probably a bit lower than they would have been had the state provided the regular update to their dashboard on Sunday, but not by all that much. 3/17
Read 17 tweets
Sale un hilo analizando positividad en los departamentos más comprometidos del país por #COVID19 utilizando los datos abiertos de @msalnacion. Hasta el 3/6 que es la última actualización. Incluye código reproducible.

#COVID19argentina #COVID19latam #RR #rstats #openscience
En CABA la positividad crece desde principios de Mayo. La alta positividad es un indicio de que se está subtesteando. Aquí se ve la misma información de manera más intuitiva: en escala logarítmica, si la curva roja se va pegando a la azul, crece la positividad.
(Los casos crecen más que los testeos). Está creciendo en CABA, pero en PBA y Chaco el porcentaje de positividad está arriba de 20%. Ahora vamos a analizar los datos por departamento.
El código del análisis que sigue puede revisarse aquí…
Read 18 tweets
So you may have seen the "8 Can't Wait" policies for police reform circulating the last few days. Since most of their hype comes from being "data-driven," I've taken a deeper look at the data.

A thread:
First of all, even if enacting all these policies really *would* reduce police killings by 72%, that's not enough. The system needs dismantling: you should be listening to the many Black abolitionists who have excellent critiques of why the reform approach is flawed.
But I also wanted to dig into the data itself. Something about the framing didn't sit right with me.

So I pulled the data from Campaign Zero and the related project, Mapping Police Violence.
Read 24 tweets
An important release of `modelsummary` just hit CRAN. This `R` package helps you build tables to summarize your statistical models. Yeah yeah, I know: tons of other packages can do that. So why should you try this particular one?! (Thread)
Before we get going, let's install version 0.3.0: install.packages("modelsummary") Done? Great! You now have the power to create PDF/LaTeX tables like this one:
Here's a similar HTML version. The cool thing is that both of those tables were created entirely programmatically. No manual editing whatsoever! `modelsummary` is both easy to use and super flexible.
Read 12 tweets

