PhD. Systems Bio, comparing things to each other, protein language models, plants, evo, proteomics. Princeton, Lewis-Sigler Scholar, prev: UT Austin
Nov 8, 2022 • 10 tweets • 5 min read
New preprint with @ProfMonaSingh. We present vcMSA, a totally new algorithm for multiple sequence alignment that's based on clustering protein language representations of amino acids. No gaps penalties, substitution matrices, or guide trees required. biorxiv.org/content/10.110…
Language models output a long numeric vector for each amino acid in a sequence. Amino acids at equivalent positions in different sequences should have similar vectors. Clustering these vectors is the basis of vcMSA (vector-clustering Multiple Sequence Alignment).
Jan 1, 2019 • 7 tweets • 7 min read
This morning I attacked my problem of showing hierarchical clusters without just making a dendrogram tree (left pic). Lots of plots were made in through the course of the day to get to the plot on the right, so I decided to show the whole process #dataviz 1/5
First approach was putting the unclustered graph into Large Graph Layout. I have to consult my own blogpost to run the program, so I guess write for yourself clairemcwhite.github.io/lgl-guide/. The graph is too dense for even the magic bullet of LGL to get good separation w/o clustering 2/5
May 27, 2018 • 5 tweets • 1 min read
It is a frustrating situation. To give context to others
I track plant proteomics. When I read Mesnage 2016 something felt off. In the suppl data, protein masses were way too small, and there were protein IDs listed multiple times w/ both increasing & decreasing fold changes 1/5
I found that while the papers claimed to find protein-level changes, they actually quantified peptides. If any peptide measured differently between samples the entire protein was called perturbed - Even if all other peptides mapping to that protein had no/conflicting changes 2/5