Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Cognition

@cognition

Jul 1 • 6 tweets • 3 min read • Read on X

Scrolly

Introducing Devin Security Swarm

A more cost effective and accurate way to find security vulnerabilities in complex codebases, based on a new architecture: Agentic MapReduce.

In testing, Devin Security Swarm found 36 of 50 real-world GHSA vulnerabilities at 30% lower cost per finding than the next most accurate alternative.

We built a new architecture for whole-codebase reasoning that we’re calling Agentic MapReduce.

Security scanning is different from most coding tasks: a report is only trustworthy if the whole codebase is considered. But most agentic systems struggle to scale reasoning across large repos.

Devin maps relevant signals across the repo, fans out focused agents over bounded shards, reduces their findings into one report, then verifies serious vulnerabilities in isolated sandboxes before marking them confirmed.

The result is simultaneously more efficient and more accurate than other tools. We evaluated a variety of security scanning tools on a dataset of 50 GHSA vulnerabilities across 14 languages including Go, Rust, Python, Ruby, Java, C#, JavaScript, C, Swift, Dart, and Elixir. The dataset spans opens source repos of various sizes and of many software categories.

Beyond excelling on our eval, Devin Security Swarm also found critical vulnerabilities that other tools missed, like a PHP sandbox bypass via template injection, an argument injection through metadata value parsing, and an overly broad deserialization surface.

Security Swarm is a new pillar of Devin for Security: a suite of tools to help you find vulnerabilities, validate their exploitability at runtime, and ship remediation PRs.

Learn more and try it today at:

devin.ai/security

We’re also publishing extensive documentation and technical materials about Agentic MapReduce, including a deep-dive on our evals.

Read our announcement: cognition.com/blog/introduci…

Learn about Agentic MapReduce: devin.ai/blog/agentic-m…

Check out the evals: devin.ai/blog/security-…

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @cognition

Cognition

@cognition

Jul 8

Introducing SWE-1.7, the most capable model we’ve trained yet.

It scores within a few points of the strongest frontier models at a fraction of the cost, and is now available at 1000 tok/s.

RL is not hitting its limit: after refining our recipe, we keep seeing gains as we scale

SWE-1.7 was built on broad improvements in RL pipeline on top of a Kimi K2.7 base model.

Our proprietary benchmark FrontierCode evaluates whether a model makes code you’d actually want to merge.

SWE-1.7 advances the cost-performance Pareto curve on FrontierCode with a score of 42.3% and cost per task of $1.97 on our Main set.

RL runs are notoriously unstable, and two well-known issues predominate:

1. Entropy collapse, where policies stop exploring and plateau, and
2. Numerical instability due to drift between the trainer and inference engines.

Without modification, these problems tend to collapse long RL runs.

Read 8 tweets

Cognition

@cognition

Jun 8

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

20+ world-class open-source developers built realistic coding tasks on repos they maintain. They define what “mergeable” means in their repo.

What does it take to measure mergeability? We use a mix of unit tests, rubrics and novel verifiers to assess correctness, test quality, scope discipline, style, and adherence to codebase standards.

FrontierCode was built in close partnership with the expert maintainers of 36 flagship open-source repositories, like @smilingnosrati, CEO & Tech Lead @CeleryOrg (29k stars), and Martin McKeaveney, CTO of @Budibase (28k stars).

Maintainers invested more than 40 hours per task, undergoing multiple rounds of iteration to ensure that any PR that satisfies these standards would actually be merged.

Read 7 tweets

Cognition

@cognition

Jan 21

Meet Devin Review: a reimagined interface for understanding complex PRs.

Code review tools today don’t actually make it easier to read code. Devin Review builds your comprehension and helps you stop slop.

Try without an account:

More below 👇 devinreview.com

Full breakdown:

First, instead of presenting diffs alphabetically and file-by-file, Devin Review groups related changes together and orders them logically. Each group comes with a clear description of what’s going on. Devin Review also intelligently detects copied and moved code, separating signal from noise.cognition.ai/blog/devin-rev…

Devin Review includes a bug catching agent that labels potential issues by confidence and severity. It will also flag decisions / patterns that could be bad, even if they aren’t bugs, helping you stop slop.

Red: pay attention. Orange: take a look. Gray: FYI

Read 9 tweets

Cognition

@cognition

May 6, 2025

Our research interns present:
Kevin-32B = K(ernel D)evin

It's the first open model trained using RL for writing CUDA kernels. We implemented multi-turn RL using GRPO (based on QwQ-32B) on the KernelBench dataset.

It outperforms top reasoning models (o3 & o4-mini)! 🧵

We train on a subset of 180 PyTorch -> CUDA conversion tasks from KernelBench. It's a nice RL environment because we have immediate code execution feedback.

During training, we give the model 4 refinement steps. In each step, the model proposes a kernel. Then we evaluate correctness & performance and inject the environment feedback in the next step.

For more details on how we made GRPO work in a multi-turn setting read our blogpost (linked below)!

We ablate two different ways of training:
- Single-turn RL (training on just the first step)
- Multi-turn RL (training on four refinement steps)

When evaluated on performance (= speedup of CUDA kernels over PyTorch) we see a significant improvement from multi-turn training.

The model learns how to refine itself more effectively!

(All models are evaluated on 4 & 8 refinement steps, i.e. same amount of compute)

Read 6 tweets

Cognition

@cognition

Apr 25, 2025

Project DeepWiki

Up-to-date documentation you can talk to, for every repo in the world.

Think Deep Research for GitHub – powered by Devin.

It’s free for open-source, no sign-up!
Visit deepwiki com or just swap github → deepwiki on any repo URL:

Go to to explore wikis for the most popular open source repos.

Turn on Deep Research for agent-powered in-depth answers (vid sped up). deepwiki.com

Don't see your repo? We're happy to index any public GitHub repo for you (watch how).

To get wikis for private repos, sign up for a Devin account at . devin.ai