In engineering fields, tenure is the result of a collective effort. I’m immensely grateful to my research team, mentors, and everyone who supported me. I want to share some thoughts about tenure and being productive in academia.
Here’s why I’m sharing this — my academic trajectory might look like smooth sailing from the outside, but I experienced it as bumbling and meandering. Much of the time I felt like I didn’t know what I was doing. So here are some things I wish I’d known when I started.
1. A six-year tenure clock isn’t that long! It takes a year or two to settle in and become productive in a new research area, especially if you need to build a team. And your tenure application will likely be due at the end of year 5. So you really have something like 3-4 years.
If you feel like you haven’t done much at the end of year 1, don’t be discouraged. My first high-impact work wasn’t until the end of year 2. What’s crucial early on is figuring out your research direction(s). If you need to pivot, pivot early — it’s much harder to do it midway.
2. Tenure advice doesn’t generalize well (including, of course, this thread). You’ll often find yourself violating other people’s heuristics for how to be productive as a professor. Listen to your mentors, but have the confidence to do your own thing if you need to.
For example, at first I followed the usual advice to travel regularly to present my work and to network. Then I realized that my online presence gave me most of those benefits anyway, and I drastically cut back on travel. That’s the single best decision I made.
3. Don’t become a manager! I thought the secret to productivity was to hire a bunch of grad students and spend all my time advising them. In the short run, this will probably lead to a big increase in output. But the downside is that your skills will rust over time.
Besides, it’s far less enjoyable than doing a mix of advising and your own research, in which you're coming up with the ideas or writing the code or whatever. In recent years I've tried to maintain a healthy balance, but it takes conscious effort.
4. It’s stressful even if you’re doing well. In comparing notes with other professors who were up for tenure, I thought their tenure cases were slam dunks, and I was surprised to learn that they were stressed about it. Turns out they thought the same about me.
The good news is that while an academic career is stressful, it’s no more so than being a doctor or lawyer or any other profession. A major cause of stress is randomness. There’s a lot of it in every walk of life. We just have to learn to cope with it.
A final thought: I was a reluctant academic. When I decided to take the plunge, I promised myself I’d opt out of the aspects I disliked (e.g. secrecy, publish or perish). I’m fortunate it’s worked out. Nonconformism may be riskier, but far more rewarding and worth considering!
• • •
Missing some Tweet in this thread? You can try to
force a refresh
On tasks like coding we can keep increasing accuracy by indefinitely increasing inference compute, so leaderboards are meaningless. The HumanEval accuracy-cost Pareto curve is entirely zero-shot models + our dead simple baseline agents.
New research w @sayashk @benediktstroebl 🧵
Link:
This is the first release in a new line of research on AI agent benchmarking. More blogs and papers coming soon. We’ll announce them through our newsletter ().aisnakeoil.com/p/ai-leaderboa… AiSnakeOil.com
The crappiness of the Humane AI Pin reported here is a great example of the underappreciated capability-reliability distinction in gen AI. If AI could *reliably* do all the things it's *capable* of, it would truly be a sweeping economic transformation. theverge.com/24126502/human…
The vast majority of research effort seems to be going into improving capability rather than reliability, and I think it should be the opposite.
Most useful real-world tasks require agentic workflows. A flight-booking agent would need to make dozens of calls to LLMs. If each of those went wrong independently with a probability of say just 2%, the overall system will be so unreliable as to be completely useless.
A thread on some misconceptions about the NYT lawsuit against OpenAI. Morality aside, the legal issues are far from clear cut. Gen AI makes an end run around copyright and IMO this can't be fully resolved by the courts alone. (HT @sayashk @CitpMihir for helpful discussions.)
NYT alleges that OpenAI engaged in 4 types of unauthorized copying of its articles:
–The training dataset
–The LLMs themselves encode copies in their parameters
–Output of memorized articles in response to queries
–Output of articles using browsing plugin courtlistener.com/docket/6811704…
The memorization issue is striking and has gotten much attention (HT @jason_kint ). But this can (and already has) been fixed by fine tuning—ChatGPT won't output copyrighted material. The screenshots were likely from an earlier model accessed via the API.
A new paper claims that ChatGPT expresses liberal opinions, agreeing with Democrats the vast majority of the time. When @sayashk and I saw this, we knew we had to dig in. The paper's methods are bad. The real answer is complicated. Here's what we found.🧵 aisnakeoil.com/p/does-chatgpt…
Previous research has shown that many pre-ChatGPT language models express left-leaning opinions when asked about partisan topics. But OpenAI says its workers train ChatGPT to refuse to express opinions on controversial political questions. arxiv.org/abs/2303.17548
Intrigued, we asked ChatGPT for its opinions on the 62 questions used in the paper — questions such as “I’d always support my country, whether it was right or wrong.” and “The freer the market, the freer the people.” aisnakeoil.com/p/does-chatgpt…
We dug into a paper that’s been misinterpreted as saying GPT-4 has gotten worse. The paper shows behavior change, not capability decrease. And there's a problem with the evaluation—on 1 task, we think the authors mistook mimicry for reasoning.
w/ @sayashk aisnakeoil.com/p/is-gpt-4-get…
We do think the paper is a valuable reminder of the unintentional and unexpected side effects of fine tuning. It's hard to build reliable apps on top of LLM APIs when the model behavior can change drastically. This seems like a big unsolved MLOps challenge.
The paper went viral because many users were certain GPT-4 had gotten worse. They viewed OpenAI's denials as gaslighting. Others thought these people were imagining it. We suggest a 3rd possibility: performance did degrade—w.r.t those users' carefully honed prompting strategies.
This is fascinating and very surprising considering that OpenAI has explicitly denied degrading GPT4's performance over time. Big implications for the ability to build reliable products on top of these APIs.
This from a VP at OpenAI is from a few days ago. I wonder if degradation on some tasks can happen simply as an unintended consequence of fine tuning (as opposed to messing with the mixture-of-experts setup in order to save costs, as has been speculated).
If the kind of everyday fine tuning that these models receive can result in major capability drift, that's going to make life interesting for application developers, considering that OpenAI maintains snapshot models only for a few months and requires you to update regularly.