👋 @LucyStats here! Today we’re going to do a little stats primer on testing for non-linear terms when fitting a model.
What do you do when trying to decide whether to include a non-linear term in a model?

1️⃣ test the nonlinear term, if significant leave it in
2️⃣ if you have enough dfs, include the nonlinear term regardless of significance
3️⃣ never include nonlinear terms
4️⃣ comment
It turns out if you make a decision to include the nonlinear term based on a significance test, you are at risk of inflating your Type 1 error 😱

📃 source:…
A basic guide to trouble shooting problems with #rstats
(A thread for my students and others new to R)

Something in R not working? Weird error message?
Go through this list of steps to try to resolve the problem. [thread; suggestions & other tips highly welcome]
1) Did you run a line of code without a ")" at the end?

Look at your code in the console. Is there a little "+" at the far left? R is probably waiting for a complete line of code. Click into the console, press ESC, add the ")", and try again
2) Is R waiting for user input?
Occasionally R will ask for user input via a pop up Window or within the console itself. Look in the console see if you need to give a numeric response (option 1, option 2) or look for the GUI's popup.
100% proud 😊 of this community achievement 💪🏽: a translation to Spanish of a must-read 📖 for those who are starting with #rstats - or learned it a while ago! ⌛️

Thread 1/n
As announced at #LatinR2019, the translation to Spanish of “R for Data Science” by @hadleywickham & @StatGarrett is complete ✔️ and can be found here

🔗 #rstat #rstates #r4ds

(yes! free, open, for the whole 🗺️) 2/n
I take advantage of this great @r4ds_es achievement to tell you an anecdote.

Its audience is meant to be those that, like me_version_2018 🐣, don’t dare to talk to the R community 🌟celebrities🌟 3/n
The 𝚋𝚊𝚌𝚔𝚋𝚘𝚗𝚎 package is now available for #rstats, co-authored with @rdomagalski and Bruce Sagan. It extracts a binary or signed network from a weighted network, getting rid of hairballs. (1/3) @SocNetAnalysts @ConnectionsSNA @PolNetworks @net_science
The first version includes three models for weighted networks that arise from bipartite projections: hypergeometric, fixed degree sequence, & stochastic degree sequence. Future versions will include functions for other types of weighted networks. (2/3)
Thanks to @schochastics & @GiovanniStrona for helpful feedback, and @NSF_SBE for funding. Details about the models here:… (3/3)
TGIF: I have a @CircOutcomes #CQOSpotlight on a simulation study evaluating the impact of clinic-based processes to achieve Million Hearts 2022 goals.
@CircOutcomes @mad_sters @bnallamo @thebyrdlab @KBibbinsDomingo @CircAHA @jordanbking @markjpletcher @ValyFontil @ADAlthousePhD Manuscript was originally published in June 2019. I just needed some time to get myself organized. Link is here:…
@CircOutcomes @mad_sters @bnallamo @thebyrdlab @KBibbinsDomingo @CircAHA @jordanbking @markjpletcher @ValyFontil @ADAlthousePhD Intro: The authors set out to evaluate knowing what we know, what are reasonable assumptions that would be required to achieve adequate blood pressure goals per the U.S. Million Hearts goal.
Read 28 tweets
Is the party deciding on @ewarren? @GregoryKoger and I take a look at the Democratic horse-race through the lens of Democratic voters' second-choice candidates, via @MorningConsult. Pretty pictures ensued. Now, on @MisOfFact.…
@ewarren @GregoryKoger @MorningConsult @MisOfFact For the geek-inclined, the #dataviz I used in the above-linked piece is really straightforward and a great way to visualize directional networks using #rstats. Here's the code I used:
@ewarren @GregoryKoger @MorningConsult @MisOfFact Consistent with our finding, a new polls shows Warren leading in Iowa. h/t @daveamp…
The House redistricting committee posted maps that were chosen by the PowerBall people yesterday. You can see them here:…

I am dubious though...
Here's the map for Alamance County that was randomly selected from the "best of 5."
The House Committee didn't make shapefiles available to the general public, so I redrew this county group in Dave's Redistricting App. I'd like to see how this district plan compares to the 1011 unique random plans that I drew for these districts.
1/7 In 2017, Hedge, Powell, and Sumner showed that robust cognitive tasks are unreliable, which calls into question the use of behavioral tasks for studying individual differences. In this blog post, I show that this conclusion is misguided (…)
2/7 Specifically, Hedge et al. found that robust effects such as the Stroop effect have test-retest correlations in the range of .5 to .6 (r = .5 in the plot of their Stroop effects shown here), which severely impacts our ability to rank the performance of individual subjects.
3/7 However, this conclusion is based on a model of behavior does not account for variability at the individual-subject level. While this may not seem very problematic at first glance, the assumption of no measurement error leads to substantially biased inference.
Your MDS Curator this week, @TiffanyTimbers, here.

This morning I would like to share with you some of the most influential resources that have shaped my #DataScience workflow:

1. @swcarpentry 's Version Control with Git lesson:
@TiffanyTimbers @swcarpentry @JennyBryan 3. "Good enough practices in scientific computing" by @gvwilson @JennyBryan Karen Cranston Justin Kitzes @lexnederbragt & @tracykteal…
Partie2️⃣du thread sur l’impact des énergies renouvelables sur les émissions de GES en France. ⤵️⤵️⤵️
La partie1️⃣est ici:
Comme je ne voulais pas rester sur une critique simple de l’étude de l’ADEME, j’ai fait quelques recherches afin d’en savoir plus… 😇
Que ça soit pour l’éolien et le photovoltaïque (PV), il y a eu des études prospectives dans les années 2008-2009 (cf sources) mais je n’ai rien trouvé a posteriori. L’étude “Bilan/Prospective” de l’@ademe 2015 sur le PV ne contient même pas le terme “CO2”…
Read 22 tweets
Rebecca ⁦@rlbarter⁩ kicking off by showing how you can present R code from a jupyter notebook #Rstats 👍👍👍
Oooo nice shortcut select your code and then “command I” makes everything nicely aligned! #rstats
The tribble function is a handy way of making a df #rstats - not as handy as datapasta of course but still pretty handy
Getting ready to start our Intro to R Workshop! #rstats #RLadies
The fabulous instructors for today’s workshop! @nikkirubinstein @goknurginer #RLadies
For those who are new, the #RLadies community is extremely supportive and welcoming!
1. PhD thesis 'Modelling BCG vaccination in the UK: What is the impact of changing policy?' submitted and everything is totally normal....

Written with #rstats, #bookdown, and #thesisdown.

Read here:

#phdchat #epitwitter #openscience #tuberculosis
2. Thanks to my supervisors @n3113n and @Christensen_H + my funders @HPRU_EI + @PHE_uk for the data.
3. Some results: 'getTBinR: an R package for accessing and summarising the World Health Organisation Tuberculosis data'



NEW PREPRINT 🎉 Synthetic datasets: A primer

By sharing synthetic datasets that mimic original datasets that could not otherwise be made open, researchers can ensure the reproducibility of their results while maintaining participant privacy

Openly accessible biomedical research data provides ENORMOUS utility for science and society. With open data, scholars can verify results, generate new knowledge, form new hypotheses, and reduce the unnecessary duplication of data collection.
Researchers who wish to share data while reducing the risk of disclosure have traditionally used data anonymization procedures to mask identities, in which explicit identifiers such as names, addresses, and national identity numbers are removed.
Read 15 tweets
I've compiled a short list of #rstats -based #bioinformatics and computational biology books and tutorials. (THREAD)
A cool-looking freebie: "A Little Book of R for Bioinformatics!" by Avril Coglan

Topics include: Alignment, Multiple alignment, phylo trees, gene finding, comparative genomics, HMM, protein-protein interactions
All #rstats code in the document.
Coglan "Little book of R for Bioinformatics" ( frequently references Cristianini and Hahn's (@3rdreviewer). "Introduction to Computational Genetics: A Case Study Approach"
MatLab code on their website
1. First release of {#idmodelr} is now on CRAN. {idmodelr} is a library of #infectiousdisease models and utilities for using them. Use case is exploration/education/research + signposting.

#rstats #epitwitter #dynamics #opensource #openscience
2. Current status is WIP with more features - and additional models - planned. Contributions much appreciated!

Planned features:…
3. {#idmodelr} has an accompanying #shiny app that demos some of the functionality and can be used as a standalone tool for exploring infectious disease models.

See here:…
I need to teach get ~40 new #rstats students up and running with #RStudio and using #rmarkdown. Here's some resources I'm looking checking out for use in my class
"Introduction to R Markdown" by Michael Clark has good overview information on #rmarkdown…
2/n #rstats
"Getting used to R, RStudio, and R Markdown" by the prolific @old_man_chester looks like it has some very practical material for #rmarkdown and #rstudio, including GIFs…
2/n #rstats
I'm going to live stream the creation of a scientific paper, including both #Rstats coding and writing.

Here's why and how...

Watching other people play computer games has become phenomenally popular. If you think this sounds absurd then you're probably over the age of thirty or don't have any kids yourself. E-sports viewer numbers will soon overtake conventional sports…
Along with e-sports, there’s a smaller community of people that live stream their coding and writing. I've tried live streaming a few R coding sessions myself. These were daunting (but fun) experiences
A bit late to this party, but I spent some time today trying to understand how the group_split() function from the dplyr package works. This thread illustrates its use when we need to split our data into several groups and create the same type of plot for each group. #rstats
The idea behind group_split() is simple. If we write something like this inside a function:

data %>%
group_split( {{ g }} )

then R will split the data into as many groups as unique values/categories of the factor g.
But what makes the use of group_split() a bit challenging is the fact that the data groups it creates are unnamed. So I found it easier to keep track separately of the name of the grouping variable (groupvar) and the values of the grouping variable (groupval).
Got an email from a new grad student asking for recommendations for resources to better understand research design and statistical inference. Here’s what I’m going to tell them...
My first recommendation on learning these concepts is to begin by improving *how* you learn and work. I use the pomodoro technique, which I outline in this thread
I’m not impressed when someone tells me they work 60+ hours a week—anyone can sit in front of a computer for 60 hours a week but what matters is what you *do* when you’re in front of your computer
Read 16 tweets
1. New (first) paper now available: Exploring the effects of BCG #vaccination in patients diagnosed with #tuberculosis: Observational study using the Enhanced Tuberculosis Surveillance system


#phdchat #rstats
2. Highlights

Evidence of an association between BCG vaccination and reduced all-cause mortality in TB cases.

Weaker evidence of an association between BCG vaccination and reduced repeat TB episodes in TB cases.

Little evidence of an association with other TB outcomes.
3. Background: Bacillus Calmette–Guérin (BCG) is one of the most widely-used vaccines worldwide. BCG primarily reduces the progression from infection to disease, however, there is evidence that BCG may provide additional benefits.
tidyverse getting a lot of heat lately. I was a died-in-the-wool #rstats base user until about two years ago. Here’s what changed.
[terrible gifs warning]
1/ I started teaching R to people who knew a lot of SQL. SQL is ubiquitous & dplyr syntax does a great job of making R easy to switch to. It makes R an easy sell (it matters) & brings in new users.
2/ I started working in places where code was frequently shared & reused. Consistency helps with communication & tidyverse is strong on style principles. [As an academic, code was often one-off and this didn’t matter as much.]
Happy 4th of July!!
One area of intergration of ML and econometrics is providing inference after variable selection (Post selection Inference) #rstats #econtwitter #Rladies #ML #econometrics 1/n
Most popular technique in economics is the 'Double LASSO' which provides inference on the treatment effect after variable selection using LASSO. Check out the R package 'hdm'.…
#rstats #econtwitter 2/n
We are often interested in conducting inference on other selected covariates (controls) as well. Check out R package 'selectiveInference' which conducts inference on multiple covariates.…
#rstats #econtwitter 3/n
First up: The LASSO family

Perhaps the most popular ML technique in economics is LASSO - a variable selection technique. The R package 'glmnet' gives users a range of distributions of the response variable to choose from: normal, binomial, poisson, multinomial,cox and others!
Interaction terms in the model are common in econ (& other soc. sciences). Want to select interactions along with the main effects? R package 'hierNet' implements Hierarchical LASSO. Users can choose the kind of hierarchy condition based on the research Q!
In grouped data, we might want to select variables from among groups as well as within group level. Check out Sparse-Group LASSO using R package 'SGL'.
