Happy to finally announce the stable release of DSRs. Over the past few months, I’ve been building DSRs with incredible support and contributions from folks Maguire Papay, @tech_optimist, and @joshmo_dev.
A big shout out to @lateinteraction and @ChenMoneyQ who were the first people to hear my frequent rants on this!! Couldn't have done this without all of them.
DSRs originally started as a passion project to explore true compilation and as it progressed I saw it becoming more. I can’t wait to see what the community builds with it.
DSRs is a 3 phase project:
1. API Stabilization. We are nearly done with this and it was mostly implementing the API design. We kept the DSPy style in mind and tried to keep it close to it so it's easier to onboard and while at it we tried to improve it and make it a bit more idiomatic and intuitive!
2. Performance Optimisation with benchmarking vs DSPy. We want to benchmark LLMs performance vs DSPy, with API design finalized we want to improve performance in every front. We'll improve the latency and improve the templates and optimizers in DSRs.
3. True Module Compilation. Why should you optimize signature when you can optimize and fuse much more? This is the idea of the final phase of DSRs. A true LLM workflow compiler. More on this after Phase 2.
Really grateful for @PrimeIntellect offering compute to drive Phase 2 and 3 experimentation for this! Big shoutout to them and @johannes_hage for this!!!
But what is DSRs? What does it offer? Let's see.
[1] Faster DataLoaders
DSRs much like DSPy using Example and Predictions as the I/O currency in workflows but is much stricter in DSRs.
To make this easier we provide dataloader to load data from CSV, JSON, Parquet and HF as Vector of Examples.
[2] Signatures
DSRs provides you 2 ways to initialize signatures: inline with macros and struct based with attribute macros. With attribute macro you define your signatures as struct in "DSPy syntax" and with macro_rules signature you define them via an einsum like notation.
Signatures are the only point of change for task structure. That means you don't have CoT predictors separately, instead you pass that as argument to the the macro.
[3] Modules
Modules in DSRs define the flow of the LLM Workflow you are designing. You can configure the evaluation and optimization individually for each module. You have traits like Evaluator and Optimizable that connect to the optimizer to define the process for that module.
[4] Predictors
Predictors are not Modules in DSRs rather they are a separate the only entity that is bounded to a single signature and invoke the LLM call via Adapters. Currently we only have Predict but we plan to add Refine and React soon.
[5] Evaluator
Evaluator is defined as a trait to be implemented by the module you wish to evaluate. You define the metric methods and call evaluate over a example vector to get the result.
[6] Optimization
Optimization is much more granular in DSRs and you can free individual components of the Module. By default everything is unoptimizable to tag a component as optimizable you mark it with `parameter` and derive Optimizable trait. We support nested parameters too.
We provide COPRO right now, as of now optimizers is quite experimental. With compute now we'll test and iterate on this more throughly and add support for more optimizers.
Stay tuned for more updates and much more frequent ones. We have examples in the repo to get you to speed but we have a docs site releasing soon!!
There has been a lot of work on pre train-finetune style models in NLP, something that CV lacks right now. Can we train the Convolution model in BERT style? If not, then what stopping us? How do we make it better?
SparK: Designing BERT For ConvNets, a 🧵
[1] Motivation
Lots of work is done in using the success of BERT in LMs to Image Models especially ViTs, and while there has been some progress we fail to find an efficient adaption of BERT pretraining to convolution models. This paper explores those reasons and fixes.
[2] Masked Image Modelling(MIM)
In BERT pretraining we do Masked Language Modelling in which we replace some words with a [MASK] token and let BERT predict that word. In MIM, we mask the "image patches" and pass them to a student ViT and get its representation from MIM head.
The precision of floating point is given a lot of importance there are techniques like mixed precision that tell you that you can still perform well on a low precision setting. But do you need floating point, or can you go lower?
Quantization, a 🧵
[1] Quantization: What, Why & How
Quantization is a model compression technique where you convert model precision from float32 to a lower precision like int8. Why? Well, it makes the model smaller and makes it process faster as well. Basically helping you with model compression.
But how do you map float32 to int? By rounding and scaling! We have a mapping function, M(x), that maps float float32 to int.
M(x) = (round(x)/S)+Z
We also have an inverse function, M'(x) to map w_i back to float32.
The bigger the model the better it is! Overparameterization for the win! Well, those are some notions among some, but is it really the case? Can we train a model that is smaller but better than a bigger model?
Knowledge Distillation, a 🧵
[1] Not so Recent Discovery
Geoffrey Hinton and the team have written the paper in 2015, and that paper is special but that's not the first time someone dwelled on this topic. In 2006, Caruana showed us how you can transfer the knowledge of an ensemble into a single model.
[2] Caruana's Approach
📌Aim of a ML model is to find the function best fit on data. The data came from distribution itself, so in an ideal scenario, the data has a ground truth function as well.
In NLP, the first thing I ever learned was the tokenization of sentences. However for a long time, I always thought of tokenization as breaking sentences into "words". While that might be partly true, is that all there is to it? Or can we do better?
Subword Tokenization, a 🧵
[1] Fixing the Definition
Let's clarify the terms first, a token is anything that represents a word or a part of it. That means even characters can be tokens. In fact, its use has been demonstrated several times in research paper.
One advantage of char level tokenization is the fixed vocab size, this eliminates the problem of exploding vocab size in word tokenization. Not just that this, sort of, resolves the issue of OOV words as well. However, it has its own problems.
AlphaTensor's release showed us a unique use case of RL for algorithm discovery and the community seems to be really thrilled about it. But how does it work? What exactly did they deliver? What does this mean?
AlphaTensor, a 🧵
[1] MatMul as (a set of) Scalar Operation
Normally MatMul is nothing but a set of multiplication and addition among rows and columns. Let's take an example of a 2x2 matrix. Normally you'll be doing 8 sets of multiplication. However, in Strassens you'll be doing 7 multiplication.
Basically, a matrix is a collection of scalars. Hence we can say, if we want to find the elements of the resultant matrix we can find them by representing each element of it as a set of operations among scalars of those 2 matrices.
What to train on a larger batch size but can't because of the memory limit? Don't want to buy/rent a GPU for a few extra batches? What if you can train on higher batch sizes on the same setup 😏
Gradient Accumalation, a 🧵
As complicated as it may sound it's really simple. Let's say you are training on batch size 4 usually you'll remove the gradients by using zero_grad() method after every backward() call. But why?
That's because PyTorch accumulates(sums) the gradients on every backward call. Since you already updated the weights of the model via step(). Now if you call backward without clearing them out the gradient will be an accumulation of the previous and current batches.