The Adversarial Robustness Toolbox (ART) = framework that uses generative adversarial neural networks (GANs) to protect deep learning models from security attacks
Thread⬇️
GANs = the most popular form of generative models.
GAN-based attacks:
+White Box Attacks: The adversary has access to the training environment, knowledge of the training algorithm
+Black Box Attacks: The adversary has no additional knowledge
2/⬇️
The goal of ART = to provide a framework to evaluate the robustness of a neural network.
The current version of ART focuses on four types of adversarial attacks:
+evasion
+inference
+extraction
+poisoning
3/⬇️
ART is a generic Python library. It provides native integration with several deep learning frameworks such as @TensorFlow, @PyTorch, #Keras, @ApacheMXNet
If you'd like to find a concentrated coverage of ART, click the link below. You'll move to TheSequence Edge#7, our educational newsletter. thesequence.substack.com/p/edge7 5/5
• • •
Missing some Tweet in this thread? You can try to
force a refresh
This Google paper presented at #NeurIPS2025 is a true gem.
In their search for a better backbone for sequence models, they:
• Reframe Transformers & RNNs as associative memory systems driven by attentional bias
• Reinterpret "forgetting" as retention regularization, not as erasure
• Combine these insights into Miras – a unified framework for designing next-gen sequence architectures
From this perspective, they introduce 3 new models, Moneta, Yaad, and Memora, that:
- Beat Transformers, Mamba2, DeltaNet, and hybrids across key benchmarks
- Scale better to long contexts
- Deliver state-of-the-art recall on needle-in-a-haystack tests
Here are the details (really worth exploring):
Transformers traditionally dominate because they scale well, but they become slow and expensive for long sequences since attention grows quadratically.
Google's key idea draws from human attentional bias – our natural habit of focusing more on certain things than others.
1. Associative memory view
Google researchers show that Transformers, Titans, and RNNs can all be seen as associative memories that learn key→value mappings guided by an internal objective (the attentional bias).
This objective decides:
- what kind of memory the model builds
- what it should prioritize
Learning these mappings becomes a form of meta-learning.
Google Cloud AI Research introduced a new SRL training method that overcomes the issues of SFT and RLVR.
The main idea: it treats problem-solving as a sequence of logical actions.
Here is how it works:
What's the problem with common methods?
- Reinforcement Learning with Verifiable Rewards (RLVR) struggles when it can’t find correct examples to learn from.
- Supervised Fine-Tuning (SFT) tends to copy right answers too rigidly, token by token.
@googlecloud AI Research offer to fix both problems with SRL.
SRL trains the model to generate an internal reasoning monologue before deciding on each action. It also gives smoother feedback based on how closely each action matches expert examples from the SFT dataset.
But a key innovation is Adaptive Quantization Noise (AQN): QeRL turns quantization noise into an exploration tool, adjusting it on the fly during RL.
Here are the details:
1. QeRL builds two RL algorithms for LLMs:
- GRPO: creates multiple answers for a prompt, scores them with rule-based rewards, and updates the model using average scores.
- Dynamic Sampling Policy Optimization (DAPO): removes limits on how much the model can vary during training so that it can discover more diverse solutions.
Upon this, QeRL adds quantization.
2. QeRL uses:
- Quantization (NVFP4) – makes model computations smaller and faster.
- Low-Rank Adaptation (LoRA) – fine-tuning without touching every parameter.
This cuts memory use and speeds up RL training, while reaching the same quality as full fine-tuning.
1. TRM is built on the idea of the Hierarchical Reasoning Model (HRM).
HRM uses 2 small neural networks working together, each at its own rhythm, to successfully solve hard problems like Sudoku, mazes, and ARC-AGI puzzles, though it’s tiny (27 million parameters).
TRM is a simpler, smaller alternative to HRM.
2. No more complex math:
HRM depends on a mathematical “fixed-point” assumption to simplify gradients, assuming that its recursive loops converge to a stable state.
On the contrary, TRM just runs the full recursion several times and backpropagates through all steps.
This removes the need for theoretical constraints and gives a huge boost in generalization: 56.5% → 87.4% on Sudoku-Extreme.
Retrieval-of-Thought (RoT) makes reasoning models faster by reusing earlier reasoning steps as templates.
These steps are stored in a “thought graph” that shows both their order and meaning.
As a result, RoT:
- reduces output tokens by up to 40%
- speeds up inference by 82%
- lowers cost by 59%
All without losing accuracy.
Here is how it works:
RoT works by:
- Storing reasoning steps as nodes in a “thought graph.”
- Retrieving relevant steps when a new problem comes in.
- Assembling a dynamic template from those steps to guide the model.
Let’s take it step by step
1. Building the "thought graph"
Researchers collected a large set of reasoning templates (3.34k). Each step in these templates became a node in the graph, with metadata like topic tags: algebra, geometry, etc.
- Sequential edges connect steps in the natural order within a template.
- Semantic edges connect steps that mean similar things across different templates.
So this graph acts like a memory bank of reasoning fragments.
1. Intern-s1: A scientific multimodal foundation model by Shanghai AI Lab (open-source)
This is a 241B-parameter multimodal Mixture-of-Experts model with 28B active parameters, optimized for scientific reasoning:
- Trained on 5T tokens (2.5T scientific)
- Supports text, images, molecular structures, and time-series data.
- Has a dynamic tokenizer and Mixture-of-Rewards RL framework
- Outperforms both open- and closed-source models on MatBench, ChemBench, etc.
It's a 9B hybrid Mamba-Transformer LLM optimized for reasoning:
- 3–6× higher throughput than Qwen3-8B
- Matches or exceeds its accuracy across benchmarks like MATH (80.5), BFCLv3, RULER-128k, AIME24
- FP8 pretraining on 20T tokens with 128k context
- Runs on a single 22GB A10G GPU