The Adversarial Robustness Toolbox (ART) = framework that uses generative adversarial neural networks (GANs) to protect deep learning models from security attacks
Thread⬇️
GANs = the most popular form of generative models.
GAN-based attacks:
+White Box Attacks: The adversary has access to the training environment, knowledge of the training algorithm
+Black Box Attacks: The adversary has no additional knowledge
2/⬇️
The goal of ART = to provide a framework to evaluate the robustness of a neural network.
The current version of ART focuses on four types of adversarial attacks:
+evasion
+inference
+extraction
+poisoning
3/⬇️
ART is a generic Python library. It provides native integration with several deep learning frameworks such as @TensorFlow, @PyTorch, #Keras, @ApacheMXNet
If you'd like to find a concentrated coverage of ART, click the link below. You'll move to TheSequence Edge#7, our educational newsletter. thesequence.substack.com/p/edge7 5/5
• • •
Missing some Tweet in this thread? You can try to
force a refresh
👉 Mixed together various data types: text next to images, video frames after captions, then webpages, etc. This way the model learns to connect what it reads with what it sees.
ByteDance proposed and implemented this idea in their BAGEL, a new open-source multimodal model.
Here's how it works:
Architecture:
BAGEL is one giant Transformer with two separate experts inside:
- Understanding expert handles text and ViT image tokens.
- Generation expert handles the VAE image-creation tokens.
These experts are placed side-by-side in every layer and "look" at the same sequence, but each focuses on its own job.
There are 2 image pipelines:
- Vision Transformer (ViT) for understanding pictures turns raw pixels into tokens the model can reason about.
- VAE + diffusion for generating pictures compresses an image to a small latent grid, then refines noise into a final image.
.@sama's interview at @sequoia AI Ascent introduces a lot of insights on:
- How OpenAI came to ChatGPT
- Its aim to be the “core AI subscription”
- AI as an operating system
- What the ideal smart model is
- Main future goals
Here is an outline of his talk with the key ideas:
1. Past milestones and directions
- The first consumer product was Dolly API
- OpenAI also tried building a robot hand
- One person and then a team became excited about building LLMs with unsupervised learning, which started with GPT-1, GPT-2. Then GPT-3 showed something cool.
2. The hint to ChatGPT:
The transition from pure research to a sustainable business model needed massive funding to scale from GPT-3 to GPT-4.
This led OpenAI to release GPT-3 via an API, which had limited commercial success but revealed a key insight:
👉 People enjoyed chatting with the model, even when it wasn’t great at conversation.
1. Agents as first-class business & M365 entities:
The new Microsoft 365 Copilot unifies chat, search, notebooks, and tools like “Researcher” and “Analyst.” With Copilot Tuning, businesses can tailor agents to their own knowledge, language, and brand voice.
2. Know your agents
Microsoft Entra Agent ID gives every AI agent a unique, verifiable identity — so you know what access and actions they’re allowed.
Our top 9:
▪️ Beyond 'Aha!'
▪️ J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
▪️ The CoT Encyclopedia
▪️ System Prompt Optimization with Meta-Learning
▪️ Parallel Scaling Law for LMs
▪️ Insights into DeepSeek-V3
▪️ QuXAI: Explainers for Hybrid Quantum Machine Learning Models
▪️ AttentionInfluence
▪️ MLE-Dojo
▪️ Learning from Peers in Reasoning Models
▪️ WorldPM
▪️ Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent
▪️ Learning Dynamics in Continual Pre-Training for LLMs
▪️ Memorization-Compression Cycles Improve Generalization
▪️ DanceGRPO
▪️ Unified Continuous Generative Model
▪️ Depth Anything with Any Prior
▪️ MetaUAS
🧵
1. Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
Proposes aligning models with meta-reasoning abilities (deduction, induction, abduction) to improve reasoning reliability and performance
Designing models and hardware together — is it a new shift for the best
cost-efficient models?
This idea is used in DeepSeek-V3 that is trained on just 2,048 powerful NVIDIA H800 GPUs.
A new research from @deepseek_ai clarifies how DeepSeek-V3 works using its key innovations:
- Multi-head Latent Attention (MLA)
- Mixture of Experts (MoE)
- FP8 mixed-precision training
- Multi-Plane Network Topology
🧵
1. Multi-head Latent Attention (MLA)
MLA compresses the KV cache down to 70 KB per token, while other models like LLaMA-3.1 and Qwen2.5 need 7x more.
Thanks to this DeepSeek-V3:
- Handles long conversations
- Runs on limited hardware
- Makes inference cheaper and more scalable
2. Apart from MLA there are some other tricks to reduce the size of the KV cache:
- Shared KV (GQA/MQA): Multiple heads share a single set of KV pairs.
- Windowed KV: Keeps only recent info and drop the old stuff at the cost of long-range memory.
- Quantization: Store data in lower bit formats with minimal accuracy loss.