A more cost effective and accurate way to find security vulnerabilities in complex codebases, based on a new architecture: Agentic MapReduce.
In testing, Devin Security Swarm found 36 of 50 real-world GHSA vulnerabilities at 30% lower cost per finding than the next most accurate alternative.
We built a new architecture for whole-codebase reasoning that we’re calling Agentic MapReduce.
Security scanning is different from most coding tasks: a report is only trustworthy if the whole codebase is considered. But most agentic systems struggle to scale reasoning across large repos.
Devin maps relevant signals across the repo, fans out focused agents over bounded shards, reduces their findings into one report, then verifies serious vulnerabilities in isolated sandboxes before marking them confirmed.
The result is simultaneously more efficient and more accurate than other tools. We evaluated a variety of security scanning tools on a dataset of 50 GHSA vulnerabilities across 14 languages including Go, Rust, Python, Ruby, Java, C#, JavaScript, C, Swift, Dart, and Elixir. The dataset spans opens source repos of various sizes and of many software categories.
Beyond excelling on our eval, Devin Security Swarm also found critical vulnerabilities that other tools missed, like a PHP sandbox bypass via template injection, an argument injection through metadata value parsing, and an overly broad deserialization surface.
Security Swarm is a new pillar of Devin for Security: a suite of tools to help you find vulnerabilities, validate their exploitability at runtime, and ship remediation PRs.
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
20+ world-class open-source developers built realistic coding tasks on repos they maintain. They define what “mergeable” means in their repo.
What does it take to measure mergeability? We use a mix of unit tests, rubrics and novel verifiers to assess correctness, test quality, scope discipline, style, and adherence to codebase standards.
FrontierCode was built in close partnership with the expert maintainers of 36 flagship open-source repositories, like @smilingnosrati, CEO & Tech Lead @CeleryOrg (29k stars), and Martin McKeaveney, CTO of @Budibase (28k stars).
Maintainers invested more than 40 hours per task, undergoing multiple rounds of iteration to ensure that any PR that satisfies these standards would actually be merged.
First, instead of presenting diffs alphabetically and file-by-file, Devin Review groups related changes together and orders them logically. Each group comes with a clear description of what’s going on. Devin Review also intelligently detects copied and moved code, separating signal from noise.cognition.ai/blog/devin-rev…
Devin Review includes a bug catching agent that labels potential issues by confidence and severity. It will also flag decisions / patterns that could be bad, even if they aren’t bugs, helping you stop slop.
Red: pay attention. Orange: take a look. Gray: FYI
Our research interns present:
Kevin-32B = K(ernel D)evin
It's the first open model trained using RL for writing CUDA kernels. We implemented multi-turn RL using GRPO (based on QwQ-32B) on the KernelBench dataset.
It outperforms top reasoning models (o3 & o4-mini)! 🧵
We train on a subset of 180 PyTorch -> CUDA conversion tasks from KernelBench. It's a nice RL environment because we have immediate code execution feedback.
During training, we give the model 4 refinement steps. In each step, the model proposes a kernel. Then we evaluate correctness & performance and inject the environment feedback in the next step.
For more details on how we made GRPO work in a multi-turn setting read our blogpost (linked below)!
We ablate two different ways of training:
- Single-turn RL (training on just the first step)
- Multi-turn RL (training on four refinement steps)
When evaluated on performance (= speedup of CUDA kernels over PyTorch) we see a significant improvement from multi-turn training.
The model learns how to refine itself more effectively!
(All models are evaluated on 4 & 8 refinement steps, i.e. same amount of compute)
Just tag Devin to fix frontend bugs, create first-draft PRs for backlog tasks, make refactors, and more.
Start building with Devin below:
1/5 Devin is built to collaborate with engineering teams and starts at $500/month. Here’s how some of the best teams are using Devin today:
2/5 We worked with Devin to contribute to popular open source repos. Here is one example of a Devin session that triages, solves, and tests a fix for an issue in Anthropic’s MCP: app.devin.ai/sessions/26695…