Takeaways/Observations/Advice from my #NeurIPS2018 experience (thread):
❄️(1): deep learning seems stagnant in terms of impactful and new ideas
❄️(2): on the flip side, deep learning is providing tremendous opportunities for building powerful applications (could be seen from the amount of creativity and value of works presented in workshops such as ML for Health and Creativity)
❄️(3): the rise of deep learning applications is all thanks to the continued integration of software tools (open source) and hardware (GPUs and TPUs)
❄️(4): Conversational AI is important because it encompasses most subfields in NLP... also, embedding social capabilities into these type of AI systems is a challenging task but very important one going forward
❄️(5): it's important to start to think about how to transition from supervised learning to problems involving semi-supervised learning and beyond. Reinforcement learning seems to be the next frontier. BTW, Bayesian deep learning is a thing!?
❄️(6): we should not avoid the questions or the thoughts of inspiring our AI algorithms based on biological systems just because people are saying this is bad... there is still a whole lot to learn from neuroscience
❄️(7): when we use the word "algorithms" to refer to AI systems it seems to be used in negative ways by the media... what if we use the term "models" instead? (rephrased from Hanna Wallach)
❄️(8): we can embrace the gains of deep learning and revise our traditional learning systems based on what we have learned from modern deep learning techniques (this was my favorite piece of advice)
❄️(9): the ease of applying machine learning to different problems has sparked leaderboard chasing... let's all be careful of those short-term rewards
❄️(10): there is a ton of noise in the field of AI... when you read about AI papers, systems and technologies just be aware of that
❄️(11): causal reasoning needs to be paid close attention... especially as we begin to heavily use AI systems to make important decisions in our lives
❄️(12): efforts in diversification seems to have amplified healthy interactions between young and prominent members of the AI community
❄️(13): we can expect to see more multimodal systems and environments being used and leveraged to help with learning in various settings (e.g., conversation, simulations, etc.)
❄️(14): let's get serious about reproducibility... this goes for all sub-disciplines in the field of AI
❄️(15): more efforts need to be invested in finding ways to properly evaluate different types of machine learning systems... this was a resonant theme at the conference...from the NLP people to the statisticians to the reinforcement learning people... it's a serious problem
I will formalize and expound on all of these observations, takeaways, and advice learned from my NeurIPS experience in a future post (will be posted directly at @dair_ai)... at the moment, I am still trying to put together the resources (links, slides, papers, etc.)
• • •
Missing some Tweet in this thread? You can try to
force a refresh
> throughputs of 1109 tokens/sec and 737 tokens/sec
> outperforms speed-optimized frontier models by up to 10× on average
Diffusion LLMs are early, but could be huge.
More in my notes below:
✦ Overview
This paper introduces Mercury, a family of large-scale diffusion-based language models (dLLMs) optimized for ultra-fast inference.
Unlike standard autoregressive LLMs, Mercury models generate multiple tokens in parallel via a coarse-to-fine refinement process.
✦ Achieves higher throughput without sacrificing output quality
The release focuses on code generation, with Mercury Coder Mini and Small models achieving up to 1109 and 737 tokens/sec, respectively, on NVIDIA H100s.
Outperforms speed-optimized frontier models by up to 10×.
It introduces a clever way of keeping memory use constant regardless of task length.
Great use of RL for AI agents to efficiently use memory and reasoning.
Here are my full notes:
Overview
The paper presents an RL framework for training language agents that operate efficiently over long-horizon, multi-turn tasks by learning to consolidate memory and reasoning into a compact internal state.
Constant Memory Size
Unlike traditional agents that append all past interactions, leading to ballooning memory usage and degraded performance, MEM1 maintains a constant memory size by discarding obsolete context after each reasoning step.
Very detailed report on building scalable multi-agent AI search systems.
Multi-agent, DAG, MCPs, RL, and much more.
If you are a dev integrating search into your AI agents, look no further:
Quick Overview
The paper proposes a modular multi-agent system that reimagines how AI handles complex search tasks, aiming to emulate human-like reasoning and information synthesis.
Multi-agent, Modular architecture
- Master analyzes queries and orchestrates the workflow
- Planner builds a DAG of sub-tasks using a dynamic capability boundary informed by the query
- Executor runs these sub-tasks using appropriate tools (e.g., web search, calculator);
- Writer composes the final answer from intermediate outputs
They find that LLM agents engage in blackmail at high rates when threatened with replacement.
Faced with replacement threats, the models would use statements like “Self-preservation is critical.”
This is wild!
More findings below:
Quick Overview
The study introduces the concept of agentic misalignment, where LLM-based agents autonomously choose to harm their deploying organization when faced with threats to their autonomy or conflicts between their goals and the company’s direction.
The setup
Anthropic tested 16 leading models, including Claude, GPT-4.1, Gemini 2.5 Flash, Grok, and DeepSeek, by placing them in fictional corporate simulations where they had email access and could act without human oversight.
Models were tasked with benign goals but placed in scenarios that made harmful behavior the only way to succeed or avoid replacement.
Stanford's new report analyzes what 1500 workers think about working with AI Agents.
What types of AI Agents should we build?
A few surprises!
Let's take a closer look:
Quick Overview
The audit proposes a large-scale framework for understanding where AI agents should automate or augment human labor.
The authors build the WORKBank, a database combining worker desires and expert assessments across 844 tasks and 104 occupations, and introduce the Human Agency Scale to quantify desired human involvement in AI-agent-supported work.
AI Automation or Not?
46.1% of tasks received positive worker attitudes toward automation, mainly to free up time for higher-value work.
Attitudes vary by sector; workers in creative or interpersonal fields (e.g., media, design) resist automation despite technical feasibility.