, 22 tweets, 33 min read Read on Twitter
Thank you all for the invite and attendance! It was awesome! What an awesome group! While I didn't post my slides since I have others' images and screenshots in them, I'll do a quick tweetorial here, also as a way of recognizing resources/people who have influenced me in DS 1/n
My #1 rule, which was also the only rule to be numbered, is "Always back up your work!" Do this either with a continuous option like Dropbox or make sure you do a daily backup to Github, Google Drive etc 2/n
Next, I emphasized that data analysis is a many-stage process and gave the example of the "data pipeline" from @rdpeng's and @jtleek's great commentary nature.com/news/statistic…, noting that I do a lot of work on data cleaning, exploratory data analysis, statistical modeling 3/n
@rdpeng @jtleek In particular, I want to understand the data and the scientific question of interest really well before just returning a list of genes/proteins/metabolites of interest. Having an appropriate study design is essential, as is having the data in an analysis-ready format... 4/n
@rdpeng @jtleek Which bring me to... how do you organize your data or ask collaborators to organize it? Here I basically shared @kwbroman's amazing resource kbroman.org/dataorg/, now also available as an article tandfonline.com/doi/full/10.10… by Karl and @kara_woo 5/n
@rdpeng @jtleek @kwbroman @kara_woo My summary of Karl's points + my own experience: 6/n
@rdpeng @jtleek @kwbroman @kara_woo Much of this is related to the concept of "tidy data" and "tidy data analysis," as introduced by @hadleywickham in jstatsoft.org/article/view/v…. There are many packages for tidy data analysis at tidyverse.org - Principles are valuable whether or not you use these 7/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham I then moved on to "naming things." After all, we'd much rather not be in this scenario right? phdcomics.com/comics/archive… I ❤️ @JennyBryan's "low-tech" wisdom at speakerdeck.com/jennybc/how-to…. As with many things, a little upfront planning now can save heartache/headache later. 8/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan Also, Jenny is so awesome that she even uses the ISO 8601 standard for items in her freezer!
_So_ useful! 9/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan For collaborative papers, it can help to use a system like Google docs or Overleaf (for LaTeX), though it depends on your collaboration/number of people. At least have a system that makes sense for you so you're not drowning in poorly named versions. 10/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan In terms of organizing the entire analysis, try to separate code, data, R objects etc. I still use this approach that my PhD adviser Giovanni Parmigiani introduced me to way back when. 11/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan I also discussed a bit about reproducibility - there's obviously so much here, so I just referenced the original paper by @rdpeng science.sciencemag.org/content/334/60… and mentioned R markdown. I also love this motivation for reproducibility 12/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan I then spent a bunch of time on exploratory data analysis cause it's _so important_. My major overview is below. I would also recommend @rdpeng's book leanpub.com/exdata and maybe a look at my checklist at siminab.github.io/2018/09/05/omi… 13/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan Exploratory data analysis can help check for diff. kinds of confounding healthknowledge.org.uk/e-learning/epi… Ex: Socioeconomic factors can confound relationship between diet and disease risk. Remember: Correlation is not causation! (true even if you have lots of data, “big data” etc!) 14/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan Why plot? What do you consider when plotting? Here is my take. Also, @jtleek has some great examples in leanpub.com/datastyle 15/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan My overview on visualizing big data: Remember though, also we see plots in 2D, adding point color, type, and size can add layers that you can think of as "extra dimensions." 16/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan I personally find this one of the most beautiful examples of PCA plots ncbi.nlm.nih.gov/core/lw/2.0/ht… (genes mirroring geography in Europe) from ncbi.nlm.nih.gov/pmc/articles/P…. Maybe it's just that I'm a bit of a genetics/history nerd! 17/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan Another cool example of PCA and MDS: Gene expression data from post-mortem prefrontal cortex samples. Note the clear age dependence! ncbi.nlm.nih.gov/pmc/articles/P… (panel C) from ncbi.nlm.nih.gov/pmc/articles/P… 18/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan Also an example from my own paper 😂 journals.plos.org/plosone/articl… I usually do PCA plots to check for artifacts so was very happy to see that cases and controls _did not cluster separately_ (which could indicate batch effects) and also there was no clustering by study site 19/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan In response to a question about what processing should be done before PCA, I recommended @SherlockpHolmes's paper journals.plos.org/plosone/articl…. This paper + the convo we had on it convinced me to stop scaling omics data prior to PCA 20/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan @SherlockpHolmes Finally, my tips on modeling - just one slide since it was already running long/I was talking too fast and there are so many great resources on this (also you can get a PhD in stat/biostat if you really want to go in depth 😂)! 21/n
@rdpeng @jtleek @kwbroman @kara_woo @hadleywickham @JennyBryan @SherlockpHolmes And a plug for our new "Health Informatics and Data Science" 1-year MS program @ICBI_Georgetown healthinformatics.georgetown.edu as well as for our awesome annual informatics symposium, to be held on October 18th this year! icbi.georgetown.edu/symposium/ Thank you again!! 22/22
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Simina Boca
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Follow Us on Twitter!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3.00/month or $30.00/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!