Herumb Shandilya Profile picture
Research @StanfordCRFM @HazyResearch | Building DSRs | MSCS, ColBERT, DSPy @Stanford | Learning @forai_ml | Chill @CreworkHQ
Sep 10 11 tweets 5 min read
DSRs, @DSPyOSS for Rust is here🚀

Happy to finally announce the stable release of DSRs. Over the past few months, I’ve been building DSRs with incredible support and contributions from folks Maguire Papay, @tech_optimist, and @joshmo_dev.

A big shout out to @lateinteraction and @ChenMoneyQ who were the first people to hear my frequent rants on this!! Couldn't have done this without all of them.

DSRs originally started as a passion project to explore true compilation and as it progressed I saw it becoming more. I can’t wait to see what the community builds with it.

DSRs is a 3 phase project:

1. API Stabilization. We are nearly done with this and it was mostly implementing the API design. We kept the DSPy style in mind and tried to keep it close to it so it's easier to onboard and while at it we tried to improve it and make it a bit more idiomatic and intuitive!

2. Performance Optimisation with benchmarking vs DSPy. We want to benchmark LLMs performance vs DSPy, with API design finalized we want to improve performance in every front. We'll improve the latency and improve the templates and optimizers in DSRs.

3. True Module Compilation. Why should you optimize signature when you can optimize and fuse much more? This is the idea of the final phase of DSRs. A true LLM workflow compiler. More on this after Phase 2.

Really grateful for @PrimeIntellect offering compute to drive Phase 2 and 3 experimentation for this! Big shoutout to them and @johannes_hage for this!!!

But what is DSRs? What does it offer? Let's see.Image [1] Faster DataLoaders

DSRs much like DSPy using Example and Predictions as the I/O currency in workflows but is much stricter in DSRs.

To make this easier we provide dataloader to load data from CSV, JSON, Parquet and HF as Vector of Examples. Image
Jan 19, 2023 18 tweets 4 min read
There has been a lot of work on pre train-finetune style models in NLP, something that CV lacks right now. Can we train the Convolution model in BERT style? If not, then what stopping us? How do we make it better?

SparK: Designing BERT For ConvNets, a 🧵 [1] Motivation

Lots of work is done in using the success of BERT in LMs to Image Models especially ViTs, and while there has been some progress we fail to find an efficient adaption of BERT pretraining to convolution models. This paper explores those reasons and fixes.
Nov 14, 2022 15 tweets 4 min read
The precision of floating point is given a lot of importance there are techniques like mixed precision that tell you that you can still perform well on a low precision setting. But do you need floating point, or can you go lower?

Quantization, a 🧵 [1] Quantization: What, Why & How

Quantization is a model compression technique where you convert model precision from float32 to a lower precision like int8. Why? Well, it makes the model smaller and makes it process faster as well. Basically helping you with model compression.
Oct 27, 2022 19 tweets 5 min read
The bigger the model the better it is! Overparameterization for the win! Well, those are some notions among some, but is it really the case? Can we train a model that is smaller but better than a bigger model?

Knowledge Distillation, a 🧵 [1] Not so Recent Discovery

Geoffrey Hinton and the team have written the paper in 2015, and that paper is special but that's not the first time someone dwelled on this topic. In 2006, Caruana showed us how you can transfer the knowledge of an ensemble into a single model.
Oct 12, 2022 15 tweets 3 min read
In NLP, the first thing I ever learned was the tokenization of sentences. However for a long time, I always thought of tokenization as breaking sentences into "words". While that might be partly true, is that all there is to it? Or can we do better?

Subword Tokenization, a 🧵 [1] Fixing the Definition

Let's clarify the terms first, a token is anything that represents a word or a part of it. That means even characters can be tokens. In fact, its use has been demonstrated several times in research paper.
Oct 10, 2022 19 tweets 4 min read
AlphaTensor's release showed us a unique use case of RL for algorithm discovery and the community seems to be really thrilled about it. But how does it work? What exactly did they deliver? What does this mean?

AlphaTensor, a 🧵 [1] MatMul as (a set of) Scalar Operation

Normally MatMul is nothing but a set of multiplication and addition among rows and columns. Let's take an example of a 2x2 matrix. Normally you'll be doing 8 sets of multiplication. However, in Strassens you'll be doing 7 multiplication.
Oct 7, 2022 9 tweets 2 min read
What to train on a larger batch size but can't because of the memory limit? Don't want to buy/rent a GPU for a few extra batches? What if you can train on higher batch sizes on the same setup 😏

Gradient Accumalation, a 🧵 As complicated as it may sound it's really simple. Let's say you are training on batch size 4 usually you'll remove the gradients by using zero_grad() method after every backward() call. But why?
Oct 5, 2022 10 tweets 4 min read
Gradient Descent has made its position very concrete as of now, after all, it's a decent way to optimize a function. But are grads the only way to optimize? Genetic Algorithm is a popular one but first, let's learn about a simple yet powerful...

Particle Swarm Optimization, a 🧵 PSO is a very simple technique to optimize a function without having to calculate its grads w.r.t. the params first. Such algorithms where you optimize a function without grads come under Gradient Free Optimization.
Oct 3, 2022 17 tweets 4 min read
Recently Make-a-Video blew everyone's mind by showing something so seemingly unreal i.e. Text2Video(T2V). So what are they doing? How does it differ from previous works? More importantly, what direction is this going now?

Make-a-Video, a 🧵 [1] The Inspiration

Text2Image has shown decent results and well you have a good amount of data for it, however, data of the same scale isn't there that maps Text to Videos which proves to be a hindrance in making equally good T2V models.
Oct 1, 2022 18 tweets 4 min read
When it comes to Deep Learning gradients are seen in high regards, but how does a computer calculate them? Or maybe a better question is, how PyTorch calculate them. Let's answer both.

But first, let's see what methods we have at hand

Differentiation using Computers, a 🧵 [1] Manual Differentiation

Well, the most straightforward answer would be to not let computers calculate derivatives. Instead, we do that part ourselves and let them know what it is.

However, this opens up gates for human error and is simply too laborious. Denied!
Sep 29, 2022 10 tweets 3 min read
We've all been using GPU for deep learning and I'm pretty sure someone's model might be training a model at this moment on some GPU instance. But have you ever wondered how it works?

GPU in Deep Learning, a 🧵 [1] GPU Architecture
Where CPUs aim to solve stuff faster, GPUs aim to solve more stuff at once. Something that people call as high throughput and Parallelization. But what makes this possible. GPU has multiple smaller cores that can execute instructions in parallel, how?
Sep 27, 2022 15 tweets 3 min read
So everyone is talking about Whisper by @OpenAI, but what exactly is it? Let's take a look at what it is all about and what purpose it serves.

Whisper, a 🧵 [1] The Problem
We've all been familiar with unsupervised models like Wav2Vec2 that have shown to be able to utilize a large set of unlabelled audio data to understand high-level audio representation.
Sep 26, 2022 8 tweets 4 min read
Tired of writing the same code again and again for your ML Pipeline? Let me tell you about @EinblickAI. Einblick is Notebooks on steroids, really! Its interface is like those unreal engine blueprints, which makes it so easy for beginners as well. Let's take a look! You start by creating a new Canvas. Canvas is to @EinblickAI as Notebook is to Jupyter. Once you create Canvas you'll see the following interface:-