Watch this evaluator loop agent I just built using sub agents and / commands.
This is one of the fastest ways to build custom agentic workflows.
Claude Code is no joke!
I'm impressed to see how easy it is to control how the sub agents communicate with each other (i.e., chain, loop, hierarchical, critic, etc.).
Claude Code is good out of the box, but customization gives you a clear advantage.
Custom sub agents + / commands solve that.
It's worth spending the time optimizing instructions, tool use, agent definitions, and more.
Claude Code, on its own, somehow likes to use a lot of tokens and perform unnecessary tasks/tool calls.
You can max out credits or hit rate limits really fast if you are not careful.
/ commands I find is the better way to ensure Claude Code works the way it should.
Another observation is that if you are using an evaluator agent (LLM-as-a-Judge), you want to pay attention a bit more closely to how you handle logic. Biases are everywhere.
Two agents I am experimenting with that will bring lots of benefits:
> custom context compressor to help with cost and latency
> mock-data generator to speed up experimentation with sub agents
> MoE Architecture
> Hybrid reasoning models
> 355B total (32B active)
> GQA + partial RoPE
> Multi-Token Prediction
> Muon Optimizer + QK-Norm
> 22T-token training corpus
> Slime RL Infrastructure
> Native tool use
Here's all you need to know:
Model Architecture & Pre-Training
GLM-4.5 is 355B total parameters (32B active); deeper model with narrower width; optimized for reasoning via more layers and 96 attention heads.
GLM-4.5-Air is 106B (12B active).
22T-token training corpus that combines 15T general data with 7T code/reasoning-focused data.
Grouped-Query Attention + partial RoPE to enhance long-context efficiency and accuracy in reasoning tasks.
Mid-training looks like a key part of this model
"Unlike the earlier pre-training stage on large-scale universal documents, these stages leverage medium-sized domain-specific datasets, including instruction data."
Great title for a report, but even better insights about how increasing input tokens impact the performance of top LLMs.
Banger report from Chroma.
Here are my takeaways (relevant for AI devs):
Context Rot
The research evaluates how state-of-the-art LLMs perform as input context length increases, challenging the common assumption that longer contexts are uniformly handled.
Testing 18 top models (including GPT-4.1, Claude 4, Gemini 2.5, Qwen3), the authors show that model reliability degrades non-uniformly even on simple tasks as input grows, what they term "context rot."
Simple tasks reveal degradation
Even basic benchmarks like semantic variants of Needle-in-a-Haystack, repeated word copying, or long QA logs (LongMemEval) expose accuracy drops as context length increases.
The decline is more dramatic for semantically ambiguous inputs or outputs that scale with length.
160+ pages covering the most important research around context engineering for LLMs.
This is a must-read!
Here are my notes:
The paper provides a taxonomy of context engineering in LLMs categorized into foundational components, system implementations, evaluation methodologies, and future directions.
The context engineering evolution timeline from 2020 to 2025 involves foundational RAG systems to complex multi-agent architectures.
> GPT-4.1 leads
> Gemini-2.5-flash excels at tool selection
> Kimi K2 is the top open-source model
> Grok 4 falls short
> Reasoning models lag behind
> No single model dominates all domains
More below:
@rungalileo introduces Agent Leaderboard v2, a domain-specific evaluation benchmark for AI agents designed to simulate real enterprise tasks across banking, healthcare, insurance, telecom, and investment.
Unlike earlier tool-calling benchmarks that saturate at 90%+ accuracy, v2 focuses on Action Completion (AC) and Tool Selection Quality (TSQ) in complex, multi-turn conversations.
Semantically empty tokens, like “Thought process:”, “Solution”, or even just a colon “:”, can consistently trick models into giving false positive rewards.
Here are my notes:
Overview
Investigates the surprising fragility of LLM-based reward models used in Reinforcement Learning with Verifiable Rewards (RLVR).
The authors find that inserting superficial, semantically empty tokens, like “Thought process:”, “Solution”, or even just a colon “:”, can consistently trick models into giving false positive rewards, regardless of the actual correctness of the response.
"Master keys" break LLM judges
Simple, generic lead-ins (e.g., “Let’s solve this step by step”) and even punctuation marks can elicit false YES judgments from top reward models.
This manipulation works across models (GPT-4o, Claude-4, Qwen2.5, etc.), tasks (math and general reasoning), and prompt formats, reaching up to 90% false positive rates in some cases.