Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

elvis

@omarsar0

Dec 14, 2018 • 16 tweets • 3 min read • Read on X

Takeaways/Observations/Advice from my #NeurIPS2018 experience (thread):
❄️(1): deep learning seems stagnant in terms of impactful and new ideas

❄️(2): on the flip side, deep learning is providing tremendous opportunities for building powerful applications (could be seen from the amount of creativity and value of works presented in workshops such as ML for Health and Creativity)

❄️(3): the rise of deep learning applications is all thanks to the continued integration of software tools (open source) and hardware (GPUs and TPUs)

❄️(4): Conversational AI is important because it encompasses most subfields in NLP... also, embedding social capabilities into these type of AI systems is a challenging task but very important one going forward

❄️(5): it's important to start to think about how to transition from supervised learning to problems involving semi-supervised learning and beyond. Reinforcement learning seems to be the next frontier. BTW, Bayesian deep learning is a thing!?

❄️(6): we should not avoid the questions or the thoughts of inspiring our AI algorithms based on biological systems just because people are saying this is bad... there is still a whole lot to learn from neuroscience

❄️(7): when we use the word "algorithms" to refer to AI systems it seems to be used in negative ways by the media... what if we use the term "models" instead? (rephrased from Hanna Wallach)

❄️(8): we can embrace the gains of deep learning and revise our traditional learning systems based on what we have learned from modern deep learning techniques (this was my favorite piece of advice)

❄️(9): the ease of applying machine learning to different problems has sparked leaderboard chasing... let's all be careful of those short-term rewards

❄️(10): there is a ton of noise in the field of AI... when you read about AI papers, systems and technologies just be aware of that

❄️(11): causal reasoning needs to be paid close attention... especially as we begin to heavily use AI systems to make important decisions in our lives

❄️(12): efforts in diversification seems to have amplified healthy interactions between young and prominent members of the AI community

❄️(13): we can expect to see more multimodal systems and environments being used and leveraged to help with learning in various settings (e.g., conversation, simulations, etc.)

❄️(14): let's get serious about reproducibility... this goes for all sub-disciplines in the field of AI

❄️(15): more efforts need to be invested in finding ways to properly evaluate different types of machine learning systems... this was a resonant theme at the conference...from the NLP people to the statisticians to the reinforcement learning people... it's a serious problem

@dair_ai

I will formalize and expound on all of these observations, takeaways, and advice learned from my NeurIPS experience in a future post (will be posted directly at @dair_ai)... at the moment, I am still trying to put together the resources (links, slides, papers, etc.)

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @omarsar0

elvis

@omarsar0

Jul 19

Context Rot

Great title for a report, but even better insights about how increasing input tokens impact the performance of top LLMs.

Banger report from Chroma.

Here are my takeaways (relevant for AI devs):

Context Rot

The research evaluates how state-of-the-art LLMs perform as input context length increases, challenging the common assumption that longer contexts are uniformly handled.

Testing 18 top models (including GPT-4.1, Claude 4, Gemini 2.5, Qwen3), the authors show that model reliability degrades non-uniformly even on simple tasks as input grows, what they term "context rot."

Simple tasks reveal degradation

Even basic benchmarks like semantic variants of Needle-in-a-Haystack, repeated word copying, or long QA logs (LongMemEval) expose accuracy drops as context length increases.

The decline is more dramatic for semantically ambiguous inputs or outputs that scale with length.

Read 8 tweets

elvis

@omarsar0

Jul 18

A Survey of Context Engineering

160+ pages covering the most important research around context engineering for LLMs.

This is a must-read!

Here are my notes:

The paper provides a taxonomy of context engineering in LLMs categorized into foundational components, system implementations, evaluation methodologies, and future directions.

The context engineering evolution timeline from 2020 to 2025 involves foundational RAG systems to complex multi-agent architectures.

Read 12 tweets

elvis

@omarsar0

Jul 17

Agent Leaderboard v2 is here!

> GPT-4.1 leads
> Gemini-2.5-flash excels at tool selection
> Kimi K2 is the top open-source model
> Grok 4 falls short
> Reasoning models lag behind
> No single model dominates all domains

More below:

@rungalileo introduces Agent Leaderboard v2, a domain-specific evaluation benchmark for AI agents designed to simulate real enterprise tasks across banking, healthcare, insurance, telecom, and investment.

Unlike earlier tool-calling benchmarks that saturate at 90%+ accuracy, v2 focuses on Action Completion (AC) and Tool Selection Quality (TSQ) in complex, multi-turn conversations.

Read 7 tweets

elvis

@omarsar0

Jul 14

One Token to Fool LLM-as-a-Judge

Watch out for this one, devs!

Semantically empty tokens, like “Thought process:”, “Solution”, or even just a colon “:”, can consistently trick models into giving false positive rewards.

Here are my notes:

Overview

Investigates the surprising fragility of LLM-based reward models used in Reinforcement Learning with Verifiable Rewards (RLVR).

The authors find that inserting superficial, semantically empty tokens, like “Thought process:”, “Solution”, or even just a colon “:”, can consistently trick models into giving false positive rewards, regardless of the actual correctness of the response.

"Master keys" break LLM judges

Simple, generic lead-ins (e.g., “Let’s solve this step by step”) and even punctuation marks can elicit false YES judgments from top reward models.

This manipulation works across models (GPT-4o, Claude-4, Qwen2.5, etc.), tasks (math and general reasoning), and prompt formats, reaching up to 90% false positive rates in some cases.

Read 6 tweets

elvis

@omarsar0

Jul 10

BREAKING: xAI announces Grok 4

"It can reason at a superhuman level!"

Here is everything you need to know:

Elon claims that Grok 4 is smarter than almost all grad students in all disciplines simultaneously.

100x more training than Grok 2.

10x more compute on RL than any of the models out there.

Performance on Humanity's Last Exam

Elon: "Grok 4 is post-grad level in everything!"

Read 21 tweets

elvis

@omarsar0

Jul 8

MemAgent

MemAgent-14B is trained on 32K-length documents with an 8K context window.

Achieves >76% accuracy even at 3.5M tokens!

That consistency is crazy!

Here are my notes:

Overview

Introduces an RL–driven memory agent that enables transformer-based LLMs to handle documents up to 3.5 million tokens with near lossless performance, linear complexity, and no architectural modifications.

RL-shaped fixed-length memory

MemAgent reads documents in segments and maintains a fixed-size memory updated via an overwrite mechanism.

This lets it process arbitrarily long inputs with O(N) inference cost while avoiding context window overflows.

Read 6 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

elvis

Try unrolling a thread yourself!

More from @omarsar0

elvis

elvis

elvis

elvis

elvis

elvis

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!