Love the "data science maturity levels" in @Patterns_CP
Interesting way to contextualize research at a glance (reminds me a bit of @justsaysinmice)
Full list in thread:
1) Concept
Basic principles of a new data science output observed and reported (e.g., statement of principles, dataset, new algorithm, new theoretical concept, theoretical system infrastructure)
2) Proof-of-concept
Data science output has been formulated, implemented, and tested for one domain/problem (e.g., dataset with rich domain-specific metadata, algorithm coded up as software, principles with expanded guidance on how to implement them)
3) Development/pre-production
Data science output has been rolled out/validated across multiple domains/problems
4) Production
Data science output is validated, understood, and regularly used for multiple domains/problems (e.g., operational data-sharing service across institutes/countries, ML algorithm to tag images, shared data infrastructure to manage access to compute/archive resources)
5) Mainstream
Data science output is well understood and (nearly) universally adopted (e.g., the iInternet, citation of articles using DOIs)
How can we choose examples for a model that induce the intended behavior?
We show how *active learning* can help pretrained models choose good examples—clarifying a user's intended behavior, breaking spurious correlations, and improving robustness!
A quick thread for PhD admits thinking about potential advisors:
I see a lot of discussion about "hands-on" vs "hands-off" advisors
But I think there are at least 3 underlying dimensions here, each of which is worth considering in its own right:
👇 [THREAD]
1/
1) Directiveness—how much your advisor directs your research, in terms of the problems you work on or day-to-day activities
2/
Low directiveness can mean lots of freedom and the space to think big and chart your own path. However, it can also leave some feeling adrift or unproductive.
3/
Some takeaways from @openai's impressive recent progress, including GPT-3, CLIP, and DALL·E:
[THREAD]
👇1/
1) The raw power of dataset design.
These models aren't radically new in their architecture or training algorithm
Instead, their impressive quality is largely due to careful training at scale of existing models on large, diverse datasets that OpenAI designed and collected.
2/
Why does diverse data matter? Robustness.
Can't generalize out-of-domain? You might be able to make most things in-domain by training on the internet
But this power comes w/ a price: the internet has some extremely dark corners (and these datasets have been kept private)
3/