The 17 #BICCN@nature papers on the primary motor cortex in mouse (+some human & marmoset) that were published yesterday are a major step forward in terms of open science for an @NIH consortium. For reference, links to the open access papers are here: nature.com/collections/ci… 1/🧵
First, the #BICCN required preprints of all the papers to be posted on @biorxivpreprint, and as a result the papers were already online 1-1.5 years ago. Of course the final versions now published have been revised in response to peer review. 2/
Speaking of peer review, almost all the papers were published along with the reviews. In combination with the preprints, this provides an unprecedented view of how consortium work is reviewed and how authors respond. Real data for this perennial debate:
Some of the reviews were superficial. For example referee #1 of the “flagship” paper (nature.com/articles/s4158…) wrote one paragraph summarizing the work + 3 minor comments (for a paper whose goal was to synthesize results from complex data published in 11 other papers!). 4/
Some reviews were brutally honest. Referee #1 of nature.com/articles/s4158… wrote "what we have..is a very well collected catalogue utterly devoid of either a conceptual framework or even an idea... the experience is like reading a phone book" . They signed the review (@blamlab).5/
Some reviews were serious(ly helpful). E.g., in our paper (one of the 17, namely @sinabooeshaghi et al., nature.com/articles/s4158…), a referee was concerned about batch effects & artifacts, leading us into a deep dive that revealed batch effects in the consortium #scRNAseq data. 6/
This helped us clean up our analysis in an important way. 7/
This is not the first time reviews of papers are published (@eLife has been doing this for a while), but having the referee reports (+ responses & non-responses) exposed for an entire consortium-worth of papers is a dataset ripe for study (& beyond the scope of this thread). 8/
Another aspect of open science is freely available data. In that regard, the #BICCN consortium has been exemplary. All the data generated is freely available, for example the #scRNAseq data used in @booeshaghi et al. is here (and has been for years): data.nemoarchive.org/biccn/grant/u1… 10/
Unfortunately, despite the fact that computational methods (including machine learning tools) are an essential piece of the #BICCN, many of the consortium papers fail to even medal by the standards of @autobencoder et al.
Many papers released no code to reproduce results or figures from their papers, and omitted key analysis details. This is not specific to #BICCN; it reflects widespread belief in the genomics community that data trumps methods, and rejection of the idea that #methodsmatter. 13/
However, most of the data generated for the 17 @nature papers published by the #BICCN was generated quickly and its the analysis that has taken several years. The difficulty of analysis can be seen in the papers that did release code (some achieved bronze 🥉). 14/
The #BICCN datasets were so large that it was challenging to enable reproducibility. In our paper (nature.com/articles/s4158…) @booeshaghi struggled to achieve "one-click" reproducibility with @GoogleColab that we strive for. I'd say we achieved bronze trending towards silver.. 16/
In summary, the steps taken towards open science by the #BICCN represent real progress. Having now participated in consortia from the mouse genome (2002) to the mouse brain (2021), I can say the progress is astounding. But we're still not at platinum.17/17
In 2008, as a new professor of molecular and cell biology @UCBerkeley I presented at a seminar series intended to introduce 1st year students to research in the department. Two profs. presented each time, with food beforehand. I was paired with Thai food and Peter Duesberg. 2/
I knew of Peter Duesberg and his HIV/AIDS denialism, but I hadn't realized that he worked @UCBerkeley. We were now colleagues in the same department. 😱 3/
but I propose an additional platinum standard for one click reproducibility.1/
By "one click", I mean that the entire analysis be reproducible in a (free) interactive online session of @colab (or other similar service). All steps of the analysis, from downloading data to generating figures are then not only automated but accessible for users. 2/
In response to questions & comments by @hippopedoid, @adamgayoso, @akshaykagrawal et al. on "The Specious Art of Single-Cell Genomics", Tara Chari & I have posted an update with some new results. Tl;dr: definitely time to stop making t-SNE & UMAP plots.🧵biorxiv.org/content/10.110…
In a previous thread I talked about the (von Neumann) elephant in the dimension reduction room: t-SNE & UMAP don't preserve local or global structure, they distort distances, and they are arbitrary. Almost everybody knows this but they are used anyway...
There were some interesting technical questions about our work. One question was the extent to which PCA pre-conditioning affects results. We examined this (Supp. Fig. 3). Tl;dr: it's time to stop making t-SNE & UMAP plots (with or without PCA pre-conditioning).
It's time to stop making t-SNE & UMAP plots. In a new preprint w/ Tara Chari we show that while they display some correlation with the underlying high-dimension data, they don't preserve local or global structure & are misleading. They're also arbitrary.🧵biorxiv.org/content/10.110…
On t-SNE & UMAP preserving structure: 1) we show massive distortion by examining what happens to equidistant cells and cell types. 2) neighbors aren't preserved. 3) Biologically meaningful metrics are distorted. E.g., see below:
These distortions are inevitable. Cells or cell types that are equidistant in high dimension must exhibit increasing distortion as they increase in number. Actually, UMAP and t-SNE distortions are even worse (much worse!) than the lower bounds from theory.
While it’s fun to banter about what constitutes a good lab, the part of this that is uncomfortable to discuss is that leaving a bad lab is in many cases near impossible. Few universities offer much support and PIs can and do retaliate, in some cases ending careers.
My first committee meeting of a biology student @UCBerkeley, when I was still a junior prof., resulted in a student breaking down in tears as he told us of abuse his advisor was inflicting on him. We brought this up with the advisor and department.
What happened? A few years later the professor was promoted to chair of the department.
If you're working on spatial transcriptomics, I think you'll find @LambdaMoses' "Museum of Spatial Transcriptomics", which analyzes the field via its metadata, to be an incredibly useful resource. biorxiv.org/content/10.110… 1/11
The museum is organized as a main paper that provides an overview of a book (i.e. the Supplementary Material) which is based on a database of papers in the field compiled by @LambdaMoses. First the database... docs.google.com/spreadsheets/d…
It contains several hundred papers. 2/11
To undertake a comprehensive study of the field, @LambdaMoses read all these papers carefully, starting with "prequel" literature to establish historical context. The database has detailed metadata including a summary of each paper. This timeline is just of the prequel. 3/11