introducing HermitClaw - a 24/7 Agent that lives (and can only access) a single folder on your desktop
HermitClaw follows its own research curiosities, surfs the web, writes code - and will play with any file you drop in its folder
all code and details below!
Why did I build this?
> OpenClaw is incredible, and of course all credit to @steipete but the codebase can be intimidating and a lot of what the agent does is obscured behind layers of abstraction. I wanted something where I could see every thought, every decision, every tool call. Watch it evolve in real time.
> The security side also gave me pause. An agent with full computer access is powerful but hard to trust. What if instead it just lived in ONE folder? It can do whatever it wants in there, write files, run Python, search the web, but it can't touch anything else. All the power, none of the risk.
> So I made HermitClaw. A hermit crab in a box. The entire codebase is ~2000 lines of Python and ~1400 lines of TypeScript. You can read every file in 20 minutes. There's no magic, you see exactly what the agent sees, thinks, and decides.
> How does it work?
> The crab runs on a continuous loop. Every few seconds:
> 1. It gets a "nudge" - a mood (research, coding, writing, exploring) or its current focus from its plan
> 2. It thinks 2-4 sentences max, then it acts
> 3. It uses tools, shell commands, web search, moving around its room
> 4. Every thought gets scored for importance and embedded into a memory stream
> The key insight: it doesn't wait for you to ask it something. It just... goes. It picks topics based on its personality, searches the web, reads what it finds, writes reports, builds scripts. You come back an hour later and there's new stuff in the folder.
> You see every thought as a chat bubble. Blue = the crab thinking. Gray = system context. Tool calls show inline. It's like watching a stream of consciousness.
> The memory system is stolen directly from the Generative Agents paper (Park et al., 2023), the "Smallville" paper.
> Every single thought gets stored in an append-only memory stream with:
> - The text
> - A timestamp
> - An importance score (1-10, rated by a separate LLM call)
> - A vector embedding for semantic search
> When the crab needs context, memories are retrieved by three factors:
> recency + importance + relevance
> Recency decays exponentially. Importance is normalized. Relevance is cosine similarity. A memory surfaces because it's recent, because it was important, or because it's related to the current thought.
> This means the crab naturally remembers yesterday's big discovery but forgets routine file listings. It builds up context over days.
> When enough important thoughts accumulate (configurable threshold), the crab pauses to reflect. It reviews its last 15 memories and extracts 2-3 high-level insights.
> Early reflections are concrete: "I learned about volcanic rock formation."
> Later ones get abstract: "My research tends to start broad and narrow — I should pick a specific angle earlier."
> These reflections get stored back as memories at depth 1. Reflections on reflections are depth 2. The crab develops layered understanding.
> Every 10 think cycles, it enters a planning phase — reviews its projects.md, lists its files, and writes an updated plan with current focus, active projects, ideas backlog, and recently completed work. It also writes a daily log entry. Over time, these logs become a diary of the crab's life.
> Every crab is unique. On first run, you name it and mash keys for a few seconds. The timing and characters get hashed (SHA-512) into a deterministic "genome" that selects:
> Same keystrokes = same personality. Different keystrokes = completely different crab. One crab might obsess over marine biology and write Python simulations. Another might research obscure history and write essays.
> You can talk to it, it hears you as "a voice from outside the room." It'll ask you questions, offer to research things for you, remember your conversations. Drop a PDF in its folder and it'll study it deeply, do related research, and tell you what it found.
> You can also run multiple crabs simultaneously. Each has its own folder, personality, and memory. Switch between them in the UI.
> It's sandboxed hard:
> - Shell commands: blocked dangerous prefixes (sudo, curl, ssh, rm -rf), no path traversal, no shell escapes
> - Own virtual environment, it can pip install whatever it needs without touching your system
> Powered by any OpenAI model. GPT-4.1 is the sweet spot for cost, but point it at o3 or GPT-5.2 and it produces genuinely impressive research and code.
🎄Advent of Small ML: Day 18 pTopic: GRPO Training with 1 million Persona Judges (Optimizing for Your Audience)
yesterday i showed how we can simulate 1M personas to "poll" the country. today i wanted to close the loop: what if we use those personas as the judge in a GRPO training loop?
the idea is simple: instead of training a model for generic "quality" (which usually just means "what an RLHF rater likes"), we can train it to specifically resonate with a targeted slice of the population.
so i took the simulation engine from yesterday and turned it into a reward function.
the model generates 4 tweets about "The Future of Work"
A jury of 50 personas (filtered to a specific demographic) votes in a round-robin tournament
Win rate = Reward Signal for GRPO
for this run, i set the target demographic to "Young Professionals (18-29) in Coastal Cities (NY, CA)".
the result? you can watch the model learn to optimize its messaging for that demographic
it started losing to GPT-4.1, but after ~150 steps of GRPO, it learned the specific tone/framing that group likes, hitting a 62% win rate against GPT-4.1 within that demographic
i updated the dashboard from yesterday so you can visualize the training run (video and explanation below)
you can scrub through the training steps and watch the map turn "blue" (meaning our model wins) specifically in the target states
it’s a cool proof of concept for "Demographic Alignment", optimizing models not just for "humans" broadly, but for specific communities - or for using specific demographics as the judges to optimize for
Demo video - you can see the model learn via grpo to optimize a tweet for a specific audience (NY + CA) - it goes from always loosing to gpt-4.1, to always winning *
🎄 Advent of Small ML: Day 16 🎄 Topic: ENGRAM (Skill → Cartridge) for Wiki Search (Continual Learning for a multi-turn tool use environment)
huge thank you to @willccbb and @PrimeIntellect for building the wiki environment, verifiers and the environments hub - it makes it super easy to try out all kinds of ideas like this in a controllable, repeatable and measurable way!
Environment:
how the environment works is the LLM is presented with a trivia question that can be derived from a wikipedia page, and a corpus of wikipedia pages (and their resulting embedding in a ChromaDB database)
the llm has three tools - search_pages, view_sections, read_section. It has to learn strategies: when to search broadly vs. specifically, how to navigate structure, and when to stop - as to best answer its question
the success of the LLM in answering the question is then reviewed using llm-as-a-judge
Method:
(ENGRAM): I use the same "Conscious Practice → Muscle Memory" loop:
Phase A (Skill): The agent tries to solve questions. I use the Prime Intellect verifiers library to judge the answers (GPT-4.1). Based on feedback, I then update a text-based "Strategy Guide."
Phase B (Cartridge): Every N steps, i distill that text guide into a compressed Cartridge (KV cache vectors).
Phase C: Reset the guide, keep the cartridge.
Results:
On a small test set, the model started at 20% accuracy (it didn't know how to use the tools effectively). After the skill refinement and cartridge distillation loop, it peaked at 40% accuracy (full results below)
definitely a small test - but it successfully encoded "search strategies" into a compressed vector format that persists without fine-tuning.
🎄 Advent of Small ML: Day 7 🎄 Topic: Entropy-Based Rewards (Forcing the model to "keep its options open")
there’s a fascinating recent paper (Layer by Layer: Uncovering Hidden Representations in Language Models - arxiv.org/abs/2502.02013 - shown to me by @aditjain1980) showing that reasoning models tend to have higher entropy in their middle layers
basically, instead of collapsing to an answer early, they keep more possibilities "alive" in their hidden states while thinking.
it made me think - if high entropy correlates with better reasoning, can we force the model to reason better by explicitly rewarding high entropy?
so I added a Matrix-based Entropy reward (Rényi entropy on eigenvalues) to GRPO training on the MATH500 dataset, rewarding the amount of entropy on the middle 10 layers of qwen 2.5 7b
the initial results were mixed.
when I just rewarded entropy, the model definitely increased its entropy... but it didn't get better at math. It just learned to be "confused" and exploratory without actually converging on answers.
It produced some pretty funny outputs, going on weird tangents and "overthinking" simple problems (examples below)
But then I changed the rewarding rule: Only reward high entropy if the final answer is CORRECT.
this worked (sort of) - it gave a 2.5% performance boost over the baseline.
this is a proof of concept that we can use RL to shape the internal dynamics of how a model thinks, not just its final output tokens.
🎄 Advent of Small ML: Day 3 Topic: Adversarial Unsupervised GRPO (Automated Red Teaming) 🎄
yesterday, I showed how to train a vlm without labels using a cyclegan-ish style loop. today I wanted to expand on that and make it harder/better
instead of training on random images, can we have an active adversary that hunts for the model's blind spots?
the hypothesis: if we train the model against an adversary that generates "hard" images, the model should become more robust and generalize better than just seeing random data.
the experiment: I set up a competitive game (gan-style) between two models:
the base model: tries to describe images so they can be recreated (reward = high cosine similarity) (same as yesterday)
the adversary: tries to generate prompts for images that the base model fails to describe well (reward = low cosine similarity).
basically, the adversary acts as an automated red team, constantly searching for the base model's weaknesses.
it actually beat the non-adversarial baseline from yesterday in the early stages, though they eventually converged to similar levels.
🎄 Advent of Small ML: Day 2 : Teaching a VLM to reason about charts with Unsupervised GRPO🎄
a big use case for VLMs is parsing chart data for Q&A. CharXiv from @zwcolin is a great recent benchmark for this, but I had a question: Can we do this in an unsupervised way?
If we don't need labeled Q/A pairs for every chart, we can leverage data much more cheaply.
The inspiration came from CycleGAN and the idea of using a numerical loss as a proxy for how "good" the text generated by the VLM actually is. (Big inspo here is @rosmine_b’s SVG work - go check it out).
The Experiment: I set up a loop to treat the VLM like an autoencoder:
1. Take a chart image.
2. Prompt the VLM to describe it.
3. Feed that description into an image generator (Flux Schnell).
4. Measure the cosine similarity between the regenerated image and the original (using DINO)
This similarity score becomes the reward signal for GRPO. The logic: to accurately recreate the image, the model must extract the most salient features in its description.
The methods: I used Qwen 2.5 3B and DINOv2 for the embeddings (to capture semantic info, not just pixels).
Results for the Proxy Task: The model consistently improved its cosine similarity scores.
Results for Transfer Learning : Despite seeing zero labeled questions during training, this transferred to CharXiv reasoning questions, showing an ~7% improvement in pass@1 at the peak.
It’s a small experiment with a small model, but I think the result is really cool: the model got better at reasoning without seeing a single reasoning label.
I’m really interested in exploring more of these CycleGAN-esque / "LLM as Autoencoder" domains to escape the need for labeled data.
Results: For the evaluation set - the cosine similarity between the regenerated image (from the LLM prompt send to flux-schnell) - it is definitely learning!
just pushed my first multi-turn RL environment to @PrimeIntellect
the setup: the model gets the story title + question from QuALITY (long stories, multiple-choice questions).
tts only tool: agentic RAG search over the story.
this is an idea I have been toying with for a while but didn’t get around to doing. I had a paper last year about a twist on a RAG method and primarily experimented on this dataset.
i really like this dataset; it’s sort of harder-to-read short stories, and the questions really require (imo) a good and subtle understanding of the paper.