A thread about curated databases in genomics:

The first database I curated by hand was for my Ph.D. thesis. It consisted of a database of 117 orthologous human and mouse genes (this was in the late 90s before either genome was sequenced!). It's still up: cb.csail.mit.edu/cb/crossspecie…
Compiling this database was hard. It required combing through GENBANK, performing alignments to check for orthology, examine proteins for homology etc. The database was generated for benchmarking a gene prediction tool, but I found that the curation had much more value than that.
The process of compiling the database taught me a ton about the state of gene sequences in GENBANK, challenges in sequence alignment, functional annotation etc. I learned a lot making this database. Also others found it useful in derivative work: korflab.ucdavis.edu/~genis/documen….
Of course my 117 human and mouse genes database is now obsolete. This ends up being the case with most hand curated databases. I think that's ok. The value of engaging in the process is, in my opinion, undervalued. And the databases can be very useful while they last.
One database that is very useful in the single-cell RNA-seq domain right now is this one compiled by @vallens, that we just published:
To make this database @vallens didn't just scrape Google Scholar with some script. The detailed information in the different fields required reading the papers. It is a Herculean task, and probably impossible if one started now. @vallens started this not longer after day one.
There is a complementary dataset, not of papers publishing #scRNAseq datasets, but of tools for their analysis made by @_lazappi_: journals.plos.org/ploscompbiol/a…
Together these databases tell an interesting story of a rapidly expanding field, where new datasets are driving tool development and vice versa:
One of the great things about these databases (unlike mine in the late 90s) is that they can be scraped. The plot above was made using this @GoogleColab notebook today: colab.research.google.com/github/pachter…
FYI: Nucleic Acids Research has a database issue every year, and many of the database are valuable, but sadly not all are open and usable in the ways described above.

academic.oup.com/nar/issue/48/D1
Back to my thesis: I had one hardcopy I kept for the last 20 years. It was stored in a box in our lab and two months ago was destroyed in a flood (thank you 2020!) Then again, half of it consisted of a printout of the entire 117 gene database. It wasn't on anyones reading list...

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Lior Pachter

Lior Pachter Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lpachter

8 Dec
There are some difficult truths when it comes to publication modes and publication costs, and @rsidd120 makes some good points here.

His on-point thread reminded me of my black and white paper: 🧵
In 2006 I went on a year-long sabbatical to @UniofOxford from @UCBerkeley. My grants were just ending and I thought I'd reset by doing some math after several years of genome consortia (I didn't have a biology mentor to tell me R01s can be renewed, so I didn't know & didn't try).
At @UniofOxford I was hosted by Philip Maini in Maths and @JotunHein in the Stats. It was a fun year in which I met @satijalab who was a student at the time. We ended up writing a paper on phylogenetics, alignment and annotation: academic.oup.com/bioinformatics…
Read 12 tweets
1 Dec
A friend (who does not work in science) asked me today whether it is true that "protein folding has been solved". My short answer:

The AlphaFold method produced very impressive results on CASP14. Protein folding is not a solved problem.
The AlphaFold results are impressive not just because they are (on average) much better than other methods, but because the improvement is so great in just the last 2 years that it suggests much more is still possible.
Also, the AlphaFold results are just markedly different from what a lot of other methods are producing. This is not an incremental improvement.
Read 7 tweets
20 Nov
There has been discussion over the past week about what the new @Apple M1 chip means for bioinformatics. Some have predicted the end of compbio on @Apple. Others are more optimistic.

We got a Mac Mini & @pmelsted easily compiled kallisto bustools #scRNAseq on it. Results below: Image
Several points:
1. Compilation of code on the M1 ARM architecture was easy for kallisto and bustools because they have few dependencies. In fact we did it before for the ARM Rock64 which is why this time there was no problem with the M1.
2. @Apple has done a great job with Rosetta 2. M1 emulating x86 is still faster than previous Macs. And the extra cores are great for running kallisto. macrumors.com/2020/11/15/m1-…
Read 6 tweets
20 Nov
"Trust me, citations are not a proxy for quality."

Meanwhile...
Read 15 tweets
5 Oct
In @NobelPrize news, the 2013 chemistry laureate links to a thread that says NIAID is "reminding people of their importance" right now because of a "vested interest" in maintaining high levels of @NIH funding, funding which they do not deserve. 1/10
He writes that this conspiracy theory “cannot come as a surprise to any US doctor nor any journalist of any repute.” 2/10
First of all, NIAID does not receive more funding than all other @NIH institutes as claimed in the thread: niaid.nih.gov/grants-contrac… 3/10
Read 11 tweets
4 Oct
From the outset of the #covid19 pandemic, it's been clear that risk of death increases sharply with age. But why? The intuitive hypothesis is that ACE2 expr. increases w/ age, but early in April, @sinabooeshaghi and I showed the opposite is true in mice. biorxiv.org/content/10.110…
Now, in a paper from the labs of @tuuliel and Christenson, @silvakasela et al. have performed a careful analysis in human, and they find the same.
BTW we saw the same patterns for ACE2 expression with sex in mice, namely males had *lower* levels of ACE2, and @silvakasela et al. find the same in humans despite the risk of death being much *higher* for males.
Read 6 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!