Self-supervised learning (SSL) algorithms can drastically reduce the need for labeling by pretraining on unlabeled data
But designing SSL methods is hard and can require lots of domain-specific intuition and trial and error
2/
We designed DABS to drive progress in domain-agnostic SSL
Our benchmark addresses three core modeling components in SSL algorithms:
(1) architectures (2) pretraining objectives (3) transfer methods
3/
1) Architectures:
Most models are designed for particular modalities (e.g. ResNets for images)
But Transformers have recently been applied to many settings, and Perceivers are even more general
What architectures are general, efficient, and learn the best representations?
4/
2) Pretraining objectives:
We currently have domain-specific ways to extract signal from unlabeled data
Language modeling prevails in NLP, while contrastive learning is more common in vision
Can we uncover unifying principles and methods that work well on any domain?
5/
3) Transfer learning
Full finetuning, linear evaluation, p/prompt/prefix tuning… there's a whole range of techniques to adapt models to downstream tasks.
Do these work equally well across domains? What are the tradeoffs, and do better methods exist?
6/
Datasets & Domains
DABS is organized into 7 domains: natural images, speech, English-language text, multilingual text, wearable sensors, chest x-rays, and images w/ text descriptions.
Each domain has an unlabeled dataset for pretraining and downstream datasets for transfer
7/
The goal is to find a *single* SSL algorithm that performs well across all of these domains
We kick off the challenge with two new baselines using transformers, where the pretraining objectives are based on the input embeddings. There's a lot of headroom left!
8/
To assess real-world generalization, DABS is a *living benchmark*—
We'll be adding additional domains focusing on scientific and other real-world applications
Proposed algorithms will be tested on these new domains to see how well they hold up
9/
We hope DABS helps yield new insights about why / when SSL works, and helps make it a more mature technology that can be used off-the-shelf in scientific, medical, and other high-impact fields
10/
Also—If you're a domain expert interested in adding a domain for your field (unlabeled dataset + labeled downstream tasks), please reach out!
11/
This is joint work w/ Vincent Liu, Rongfei Lu, Daniel Fein, Colin Schultz, and Noah Goodman! @StanfordAILab@stanfordnlp
Love the "data science maturity levels" in @Patterns_CP
Interesting way to contextualize research at a glance (reminds me a bit of @justsaysinmice)
Full list in thread:
1) Concept
Basic principles of a new data science output observed and reported (e.g., statement of principles, dataset, new algorithm, new theoretical concept, theoretical system infrastructure)
2) Proof-of-concept
Data science output has been formulated, implemented, and tested for one domain/problem (e.g., dataset with rich domain-specific metadata, algorithm coded up as software, principles with expanded guidance on how to implement them)
Some takeaways from @openai's impressive recent progress, including GPT-3, CLIP, and DALL·E:
[THREAD]
👇1/
1) The raw power of dataset design.
These models aren't radically new in their architecture or training algorithm
Instead, their impressive quality is largely due to careful training at scale of existing models on large, diverse datasets that OpenAI designed and collected.
2/
Why does diverse data matter? Robustness.
Can't generalize out-of-domain? You might be able to make most things in-domain by training on the internet
But this power comes w/ a price: the internet has some extremely dark corners (and these datasets have been kept private)
3/