The Adversarial Robustness Toolbox (ART) = framework that uses generative adversarial neural networks (GANs) to protect deep learning models from security attacks
Thread⬇️
GANs = the most popular form of generative models.
GAN-based attacks:
+White Box Attacks: The adversary has access to the training environment, knowledge of the training algorithm
+Black Box Attacks: The adversary has no additional knowledge
2/⬇️
The goal of ART = to provide a framework to evaluate the robustness of a neural network.
The current version of ART focuses on four types of adversarial attacks:
+evasion
+inference
+extraction
+poisoning
3/⬇️
ART is a generic Python library. It provides native integration with several deep learning frameworks such as @TensorFlow, @PyTorch, #Keras, @ApacheMXNet
If you'd like to find a concentrated coverage of ART, click the link below. You'll move to TheSequence Edge#7, our educational newsletter. thesequence.substack.com/p/edge7 5/5
• • •
Missing some Tweet in this thread? You can try to
force a refresh
- Uses a Transformer architecture with a memory module to improve long-context reasoning.
- Outperforms RMT by 37.1% and excels in multi-hop inference.
▪️ NatureLM:
- Is trained across scientific domains.
- Enhancing tasks like SMILES-to-IUPAC translation and CRISPR RNA design for cross-domain applications.
▪️ Goedel-Prover:
- Advances formal proof generation
- Achieves 57.6% Pass@32 on miniF2F using expert iteration and statement formalizers.
Find the links below👇
1. LM2: Large Memory Models by Convergence Labs Ltd.
Our top 7
▪️ Matryoshka Quantization
▪️ LLM Pretraining with Continuous Concepts
▪️ LLMs can easily learn to reason from demonstrations
▪️ Forget what you know about LLMs evaluations – LLMs are like a chameleon
▪️ Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
▪️ Hephaestus
▪️ SynthDetoxM Dataset
▪️ The Curse of Depth in LLMs
▪️ InfiniteHiP
▪️ Distillation Scaling Laws
▪️ TransMLA: Multi-Head Latent Attention
▪️ Logical reasoning in LLMs: A survey
▪️ ReasonFlux
▪️ How Stanford’s s1 surpasses DeepSeek-R1
▪️ The Stochastic Parrot on LLM’s Shoulder
▪️ Training LMs for Social Deduction with Multi-Agent RL
▪️ Towards Internet-scale training for agents
▪️ WorldGUI
▪️ CoSER: Coordinating LLM-Based Persona Simulation
▪️ Scaling Pre-training to One Hundred Billion Data for VLMs
▪️ Adapting Language-Specific LLMs to Reasoning Models
🧵
1. Matryoshka Quantization from @GoogleDeepMind
Introduces MatQuant, a multi-scale quantization method that mixes int2, int4, and int8 layers for efficient model deployment
1. Model Distillation guide from @OpenAI 2. Knowledge Distillation tutorial by @PyTorch 3. Jetson Introduction to Knowledge Distillation by @nvidia 4. Tutorial on Knowledge Distillation with @kerasteam 5. @huggingface's guides:
- Knowledge Distillation
- Knowledge Distillation for Computer Vision
Save the link and check out the links below 👇
1. Model Distillation guide from @OpenAI
Explains this process step-by step, including
- storing outputs from a large model
- evaluating both large and small models
- create training data for a small model
- assess the fine-tuned small model
2. Knowledge Distillation tutorial by @PyTorch covers:
• Extracting hidden representations for further calculations
• Modifying PyTorch training loops to include additional losses
• Enhancing lightweight models using complex models as teachers
Distillation involves using a large teacher model to train a smaller student one.
But can we predict a distilled model’s performance based on teacher quality, student size, data volume, etc.?
@Apple and @UniofOxford explored this and developed distillation scaling laws.
Here are the key takeaways👇
1. A good teacher doesn’t always mean a better student:
If a teacher is too strong, the student might struggle to learn from it, leading to worse performance.
This is called the capacity gap — when the student isn’t powerful enough to properly mimic the teacher.
2. Distillation scaling law predicts how well a student model will perform based on three key factors:
- Student model's size
- The number of training tokens
- The teacher’s size and quality
This law follows a "power law" relationship, which means that performance improves in a predictable way but only to a point. Then adding more resources doesn’t help.
Sliding Tile Attention (STA) speeds up video generation up to 3.53x times.
It focuses only on small, relevant regions at a time and moves across the video in a sliding pattern.
STA processes larger chunks (tiles) at once, making it faster and more hardware-efficient.
Here's how it works:
Firstly, what's wrong with current methods?
3D attention, that is generally used in Diffusion Transformers (DiTs), processes all video frames at once, treating every pixel separately, which takes up a huge amount of computing power—about 70% of the total effort.
The problem with traditional Sliding Window Attention (SWA) is that it creates "mixed blocks," which are inefficient for GPUs.