Chomba Bupe Profile picture
Tech entrepreneur | machine intelligence https://t.co/zzD5ZNb0OW https://t.co/h0mJxdVxQq
Jul 12 5 tweets 1 min read
Yep exactly this: Learning next sequence prediction doesn't recover the nature of the process that generated that sequence. See:
May 31 26 tweets 5 min read
I looked deeply in the mathematical formulation of diffusion models (image & video generators) & they are inherently doing stochastic gradient based pixel blending nothing more.

The explanation that follows below will be a bit technical so buckle up. Let's look at the problem of image blending first then work up to diffusion models.

Image blending in a nutshell tries to mix two images S1 & S2 to produce an image S3.

Like:

S3 = S1*M1 + S2*M2

Where M1 & M2 are blending masks.
Mar 30 15 tweets 3 min read
The evidence is weak that the model "thinks" ahead of time.

They probed the model with a simple rhyming couplet:

"He saw a carrot and had to grab it."

There was evidence the model already considered "rabbit" even when the word appears much later in the output sequence. There is a simple explanation.

For a decoder-only transformer model.

Half of the self-attention + feed forward network (FFN) blocks act like encoders while the other half act like decoders.
Feb 4 4 tweets 1 min read
Testing large language models (LLM) on IQ tests is like testing a search engine on IQ questions it has already indexed answers to, it simply does a lookup & randomly guesses answers to any other questions not in the training set.

It says nothing about the models intelligence. The base line should be using a pure lookup function over the actual LLMs training set & comparing the performance to the LLM, then seeing how much the LLM beats a search over the training set.

But of course no one else has access to the training data of these LLMs.
Jan 22 7 tweets 2 min read
Folks misunderstand the term "stochastic parrot":

It stems from the fact that during token sampling from a distribution P(next tokens | previous tokens), the output token is picked stochastically with the odds of being picked scaled by the predicted probability sampled from P. For example, a forward pass through P(), usually a decoder-only transformer model, the output is an R, an array of numbers the size of the token vocabulary.

R is normalized into a probability distribution using softmax, whoses entries sum total = 1.0 & range between 0.0 & 1.0.
Jan 1 17 tweets 3 min read
Contrary to popular belief machine learning (ML) is not that difficult to pick up:

You just need to polish up on some basics though.

- Linear algebra: Vectors & Matrices

- Numerical optimization: Gradient descent

- Differential calculus

- Probability & information theory Here is how I learned machine learning (ML) using a hands on approach:

First need to mention I came from an engineering background from which I was taught engineering maths but I didn't use much of the engineering maths in learning ML, only the above mentioned prerequisites.
Dec 20, 2024 7 tweets 2 min read
Reminder:

There is a huge difference between "can solve anything" & "can learn anything".

A computer can run any program that solves anything but can it learn anything?

No.

Turing completeness is useless on its own. The problem of learnability - ability to learn, is the major bottleneck in machine intelligence.

We have powerful architecture, like central processing units (CPU) in modern digital machines that can implement any practical computational function.

But CPUs don't learn.
Nov 15, 2024 16 tweets 3 min read
A computational graph of the attention mechanism is as shown below.

In deep learning the weighting coefficients are "soft", they don't drop to zero. For large context sizes these numerous small non-zero weights can contribute a significant amount to the resultant vector. Graphical representation of the attention equation in deep learning. Essentially, they start to interfere with the representation ie you have small weights - not zero - for things the models is not "interested" in, if the context size is large, even these small weights will eventually affect the output.

That's another downside of soft attention.
Nov 14, 2024 11 tweets 3 min read
The reason why deep neural nets (DNN) don't generalize out of distribution is that they generate hyperplanes to tile the representation space.

Hyperplanes in artificial neural networks are used to classify a point based on which side & how far it falls from a given hyperplane. Diagram showing how hyperplanes can be used to class a point based on which side & how far it falls from a given hyperplane. To confine a data point in N-dimensional space you need at least N hyperplanes. One hyperplane can confine a point along its normal line. Image
Nov 14, 2024 9 tweets 2 min read
The attention mechanism as used in deep learning (DL) - like transformer models - is based on vector superpositioning ie weighted averaging of vectors within a context.

For large enough context, the resultant vectors begin to resemble due to law of large numbers (LLN). Screenshot from https://en.wikipedia.org/wiki/Law_of_large_numbers?wprov=sfla1 That means that models lose representation power ie they can't tell the difference easily between large text inputs because the superposition vectors all approach the mean of the distribution ie when you average too many vectors, the difference between the sums will be very small
May 7, 2024 14 tweets 3 min read
AI is currently dominated by machine learning (ML) approaches based on function approximation which fit such functions to lots of data the way you plot a curve to pass through data points on a graph.

The fitted curve/function is then used in place of the original data. In deep learning the function f() is a composition of many smaller functions g1(),g2(),...,gL(), arranged in layers such that one function feeds from the ones before & feeds into the others after it.

Where L = number of layers.

The larger L is the deeper the model is.
Apr 24, 2024 7 tweets 2 min read
The most misunderstood theorem in machine learning (ML) is the universal approximation theorem:

"What must be stressed, is that while some functions can be arbitrarily well approximated in a region ... the approximated functions do not extrapolate outside of the region." Universal approximation theorem Feed-forward neural network with a 1 hidden layer can approximate continuous functions  Artificial neural networks are combinations of multiple simple mathematical functions that implement more complicated functions from (typically) real-valued vectors to real-valued vectors. The spaces of multivariate functions that can be implemented by a network are determined by the structure of the network, the set of simple functions, and its multiplicative parameters. A great deal of theoretical work has gone into characterizing these function spaces. The universal aporoximation theorem is easy to prove:

In given region, a < x < b, you can model any function:

f(x)

By a lookup table of k bins where each bin is of size:

s = (b - a) / k

The more bins the more smooth the function is.

The bin simply holds a value y_{k}
Apr 21, 2024 8 tweets 2 min read
The ability for language models to write coherent text can be explained as each token develops some affinity for other tokens.

During next token prediction each of the previous tokens "vote" for the tokens they have high affinity for, which results in coherent continuation. This is an example of a bag-of-tokens approach where only individual tokens in a set directly votes for the tokens they have high affinity for.

This simple model can mimic some properties observed in large language models (LLMs).
Apr 6, 2024 13 tweets 3 min read
Information processing is too abstract & vague.

Algorithms can solve the same problems but with different space & time complexity.

One algorithm can require lots of space like RAM while another would require less.

You want to look for space-time efficient algorithms. That is, given infinite data & compute all algorithms spiral down to merely looking up the corresponding response.

If you had enough data, you simply just have to lookup the response, it's called memoization in dynamic programming, you don't need to recompute, just look it up.
Mar 27, 2024 6 tweets 2 min read
Language models are trained in two phases:

1 - Large scale pretraining on next token prediction.

2 - Task specific finetuning like instruction following ie instructGPTs etc

How is it surprising?

Articles make it seem like LLMs are entirely self-supervised when they are not. Without task specific finetuning from human feedback the model in raw form underperform even though trained on extremely large scale data.

Its the subsequent finetuning phases that aligns the model better, towards the intended behavior.

Even that isn't perfect, nothing is.
Mar 2, 2024 10 tweets 2 min read
A deep neural net (DNN) is just an arrangement of simple nodes - the perceptrons from the 40s, with various nonlinearities - in layers one atop the other.

AI companies are hoping that if they can mega size these DNNs by using lots of compute & data they can solve intelligence.
Image The foundation for DNNs was layed out in the 80s.

Gradient descent (GD) was around even much earlier than that, stochastic gradient descent (SGD) is a variant of GD.
Feb 18, 2024 11 tweets 2 min read
Here is how stupid the statement that OpenAI's Sora is a data driven physics engine:

Its like gathering data about planetary motion, feeding it into a model that predicts where the planets will be & concluding that the model has recovered general theory of relativity internally. It took Einstein years to recover the equations for the theory of gravity & if someone thinks that stochastic gradient descent + backpropagation is like a little Einstein figuring out things during model training then your understanding of machine learning is questionable.
Feb 16, 2024 27 tweets 6 min read
OpenAI goal is to build a world simulator - that learns the dynamics of the 3D real world - from videos & language descriptions.

They claim the work indicates that training on ever larger datasets is a promising direction for learning such world models.

But they are wrong: The goal is that given enough videos the model can learn the physics of objects, occlusions, collisions, reflections, shadows etc

Its a difficult thing to recover hard rules from such an approach ie you can't learn much about the behavior of physical objects from videos alone.
Feb 15, 2024 15 tweets 2 min read
You can see interpolation, disocclusion & compression artifacts.

Only means one thing - it's remixing content from the training dataset. The important question is what did they train this thing on?

Did they ask for consent? Obviously not.
Jan 4, 2024 11 tweets 3 min read
Those timeline predictions for when artificial intelligence will surpass humans at all tasks is shrinking not because AI is advancing fast - because most AI researchers/engineers making those predictions don't understand the challenges needed to surpass human cognitive abilities. If you asked around - some folks working in computer vision think the machines good performance on toy benchmarks like ImageNet imply that these systems beat human visual capabilities when that's very far from the truth.

Reality is the best benchmark for intelligence.
Dec 31, 2023 9 tweets 2 min read
An explanation for why GPT-4 is degrading:

"... we find that on datasets released before the LLM
training data creation date, LLMs perform surprisingly better than on datasets released after"

New tasks are difting away from what GPT-4 was trained on. Large language models (LLMs) offer impressive performance in various zero-shot and few-shot tasks. However, their suc- cess in zero-shot and few-shot settings may be affected by task contamination, a potential limitation that has not been thoroughly examined. This paper investigates how zero-shot and few-shot performance of LLMs has changed chronolog- ically over time. Utilizing GPT-3 series models and several other recent open-sourced LLMs, and controlling for dataset difficulty, we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly bet- te... This is the fate of all machine learning (ML) models that don't have continuous learning capabilities ie ML model weights get frozen after training but the input distribution continuously difts, without the model continuously adapting to that change, it slowly degrades.