In Feb 2020 SARS-CoV-2 still felt far away, a twitter feed of China + isolated cases in other countries (remember #ncov2019?)
I was driving to NC to start my new job at @UNC_Lineberger and was thinking about making a peptide vaccine for SARS-CoV-2...
2/
Why a peptide vaccine? It was honestly the primary approach I had experience with from my work with @BhardwajLab / @OpenVax on the PGV trials at @IcahnMountSinai. I had seen that peptides+poly-ICLC could get strong T-cell responses *and*...
3/
...even seen evidence from other trials for antibody/B-cell responses against linear B-cell epitopes.
In March, these nascent thoughts merged with work the @bgvincentlab was doing on viral epitope identification and we started working on SARS-CoV-2 vaccine design together.
4/
Since we were working in the frantic 1st month of Covid-19 in the US, we tried being radically collaborative. I started an open slack channel (#DownWithTheCrown), invited people from different groups and everyone who contributed data or analyses ended up being an author.
5/
The basic approach we settled on was:
- predicting T-cell epitopes
- identifying linear B-cell epitopes from convalescent patient data
- filtering epitope predictions to on accessibility/polymorphism/&c
- rolling up epitope predictions into longer vaccine peptides
6/
T-cell epitope prediction:
Like many other groups we started with population coverage of predicted HLA-I/II ligands for common alleles.
We then further filtered these predictions to increase the specificity our epitope selection...
7/
I think the most significant filters were:
- only keeping the most abundant viral proteins (which we identified as S, M, & N from mass spec data)
- dropping epitopes which overlap polymorphic sites (based on GISAID data in Spring 2020)
- immunogenicity prediction!
8/
We also predicted binding of peptides to murine MHCs (H2b, H2d haplotypes) since we want to be able to test the vaccine in common mouse strains (BALB/c & BL/6).
Hotspots w/ predicted high affinity / high frequency HLA-I/II ligands were in the low abundance ORF1b polyprotein
9/
At this point I think it's important to say that MHC binding prediction != T-cell epitope prediction.
Even if your MHC predictions are perfect, they only capture *potential* T-cell epitopes. High sensitivity, low specificity. To go further...
10/
...you need to model both intracellular factors which determine which peptides can even make it onto MHC (antigen processing) and factors relating to T-cell recognition of presented MHC ligands (e.g. preference for larger residues bulging out of the MHC).
11/
To capture immunogenicity beyond MHC binding we built a model based on viral peptide T-cell tetramer data in IEDB.
We used tetramer data (and not e.g. ELISpot) because tetramers have biologically unambiguous allele assignment.
12/
The features were a mix of amino acid sequence features and predictor outputs from NetMHCpan (NetMHCIIpan for CD4+), as well as MHCflurry's multiple outputs (for CD8+)
CD4+ T-cell responses turned to be easier to predict, I think artifactually (HLA-II tetramers are trickier)
13/
For linear B-cell epitopes, we combined data from multiple assays screening convalescent patient serum for reactivity against linear regions of the SARS-CoV-2 spike protein.
(peptide arrays, PhIPseq & good old ELISA)
We kept regions were recurrent across sources...
14/
...and further filtered by an estimate of glycosylated residue accessibility computed by @glycam.
The accessible recurrent linear B-cell epitopes were further filtered to remove glycosites, polymorphic residues and restricted to be near known functional regions.
15/
Which functional regions? The receptor binding domain (RBD), fusion peptide (FP) and HR1/2 (heptad repeats).
These were what we knew from the SARS literature to be domains which were essential for viral entry.
After all these filters we had only 3 linear B-cell epitopes!
16/
One was in the receptor binding motif (RBM), which makes contact with ACE2 on human cells. A promising target!
Another was immediately downstream of the RBD in a hinge region and the last was at the end of the fusion peptide, which also overlaps with the S2' cleavage site. 17/
Now are these the most important antibody epitopes on the spike protein? No way! We know the contact sites of many neutralizing antibodies for SARS-CoV-2 and almost all of them are highly conformational.
But if you're going to use peptides, these are a good start
18/
We took all these human T-cell epitope predictions, murine MHC ligands, filtered linear B-cell epitopes and rolled them up into different 27mer vaccine peptide sets according to combinatorial criteria:
"Rolling up" means choosing subsequences of the spike (S), membrane (M), or nucleocapsid (N) proteins which have the most epitopes of the desired categories.
Different vaccine peptide set criteria often struck on the same regions, so we ended up with 22 vaccine peptides
20/
We then tested our vaccine peptide selection in two ways:
(1) Since some time had passed since our preprint, by the end of 2020 there were a good number of published SARS-CoV-2 T-cell epitopes. How many of those did our vaccine peptides contain?
(2) Vaccinate some mice!
21/
When we looked at overlap with known T-cell epitopes we found that we had captured the two most recurrent identified T-cell epitopes in humans, both of which are in the nucleocapsid protein. Our strict abundance filter made us miss, however, epitopes in nsp3, ORF3a, &c
22/
For vaccine experiments we reduced the vaccine peptide set to 16 peptides (due to overlap) and then vaccinated BALB/c mice with poly(I:C) +/- vaccine peptides.
We got T-cell responses to the peptides containing T-cell epitopes but not those containing only B-cell epitopes
23/
Unfortunately, we didn't get any binding of mouse antibodies to spike protein (in an ELISA experiment) after vaccination with our selected linear B-cell epitopes.
My hunch is that the flexibility of linear peptides makes them not cross-reactive with the protein
24/
tl;dr
We designed a peptide vaccine for SARS-CoV-2 using mostly computational methods (but with some B-cell assay data). Some of the T-cell targets ended up being common epitopes in humans & elicit desired responses in mice. Ab responses are either deficient or irrelevant!
Let's say you want to publish in a top-tier journal and need to have a high accuracy predictor of something of great medical importance, such as survival of cancer patients in response to immunotherapy. The easiest route is just to cheat: (X,y).predict(X) model.fit
Since you're just memorizing the test data (or maybe a superset of it), then the predictor works better as you add more features.
Conversely, if you don't cheat but use kinda uninformative features (eg extracted 4mer subsequences from predicted neoantigens), you get models of very low predictive value -- there might be a tiny bit of signal in there (maybe driven by TMB) but mostly survival curves overlap
Finishing up my Defense Against the Dark Arts syllabus and realized I don't have anything that's really specific to large deep learning models applied to -omics or other biomedical data.
What are some egregious missteps there which show up in high profile papers?
Tentative list of topics:
1) Just make it up: image manipulation and tables which don't add up 2) Try everything and report the small p-values: uncorrected multiple hypothesis testing
3) My predictor is perfect on the training set: lack of independent validation data over-estimates accuracy
4) My predictor is perfect if I tell it the labels: information leakage between training and test datasets
Preprint live: biorxiv.org/content/10.110…
tl;dr We did not disprove the role of T-cells in clearing SARS-CoV-2 (sorry to disappoint) but we did manage to make a vaccine which induces strong but useless T-cell responses against SARS-CoV-2. Details & guess at interpretation below 1/n
Many SARS-CoV-2 vaccines present the immune system with the spike protein or the parts of the spike most vulnerable to neutralization (S1 portion or even just the RBD). This antigenic content can be made directly (protein subunit), encoded as mRNA, DNA, DNA in a virus, &c
2/n
Versions of all of these approaches can all effectively induce neutralizing antibodies. They also often achieve T-cell responses whose significance has been hotly debated.
(some people feel strongly that T-cells save us from new variants, some claim they do nothing at all)
3/n