Latest Twitter Threads by @ChombaBupe on Thread Reader App

Sep 28 • 5 tweets • 1 min read

I have always seen painstakingly curating training data as a form of implicit programming.

Machine learning (ML) models are merely implicitly programmed, they don't really have the so called emergent properties, that are taunted from the mountain tops in the ML community.

https://twitter.com/rao2z/status/1972036884831576529

I mean there can be a few things that emerges, very few things.

But a lot of capabilities of ML models are implicitly baked in by the data curation processes ie data curators are ML implicit programmers.

Jul 12 • 5 tweets • 1 min read

Yep exactly this: Learning next sequence prediction doesn't recover the nature of the process that generated that sequence.

https://twitter.com/keyonV/status/1943730486280331460

See:

https://x.com/ChombaBupe/status/1759226186075390033?s=19

May 31 • 26 tweets • 5 min read

I looked deeply in the mathematical formulation of diffusion models (image & video generators) & they are inherently doing stochastic gradient based pixel blending nothing more.

The explanation that follows below will be a bit technical so buckle up. Let's look at the problem of image blending first then work up to diffusion models.

Image blending in a nutshell tries to mix two images S1 & S2 to produce an image S3.

Like:

S3 = S1*M1 + S2*M2

Where M1 & M2 are blending masks.

Mar 30 • 15 tweets • 3 min read

The evidence is weak that the model "thinks" ahead of time.

They probed the model with a simple rhyming couplet:

"He saw a carrot and had to grab it."

There was evidence the model already considered "rabbit" even when the word appears much later in the output sequence.

https://twitter.com/ashleevance/status/1906052687084531897

There is a simple explanation.

For a decoder-only transformer model.

Half of the self-attention + feed forward network (FFN) blocks act like encoders while the other half act like decoders.

Feb 4 • 4 tweets • 1 min read

Testing large language models (LLM) on IQ tests is like testing a search engine on IQ questions it has already indexed answers to, it simply does a lookup & randomly guesses answers to any other questions not in the training set.

It says nothing about the models intelligence.

https://twitter.com/DaveShapi/status/1886375604531974479

The base line should be using a pure lookup function over the actual LLMs training set & comparing the performance to the LLM, then seeing how much the LLM beats a search over the training set.

But of course no one else has access to the training data of these LLMs.

Jan 22 • 7 tweets • 2 min read

Folks misunderstand the term "stochastic parrot":

It stems from the fact that during token sampling from a distribution P(next tokens | previous tokens), the output token is picked stochastically with the odds of being picked scaled by the predicted probability sampled from P.

https://twitter.com/VictorTaelin/status/1881858899306745925

For example, a forward pass through P(), usually a decoder-only transformer model, the output is an R, an array of numbers the size of the token vocabulary.

R is normalized into a probability distribution using softmax, whoses entries sum total = 1.0 & range between 0.0 & 1.0.

Jan 1 • 17 tweets • 3 min read

Contrary to popular belief machine learning (ML) is not that difficult to pick up:

You just need to polish up on some basics though.

- Linear algebra: Vectors & Matrices

- Numerical optimization: Gradient descent

- Differential calculus

- Probability & information theory Here is how I learned machine learning (ML) using a hands on approach:

First need to mention I came from an engineering background from which I was taught engineering maths but I didn't use much of the engineering maths in learning ML, only the above mentioned prerequisites.

Dec 20, 2024 • 7 tweets • 2 min read

Reminder:

There is a huge difference between "can solve anything" & "can learn anything".

A computer can run any program that solves anything but can it learn anything?

No.

Turing completeness is useless on its own.

https://twitter.com/denny_zhou/status/1835761801453306089

The problem of learnability - ability to learn, is the major bottleneck in machine intelligence.

We have powerful architecture, like central processing units (CPU) in modern digital machines that can implement any practical computational function.

But CPUs don't learn.

Nov 15, 2024 • 16 tweets • 3 min read

A computational graph of the attention mechanism is as shown below.

In deep learning the weighting coefficients are "soft", they don't drop to zero. For large context sizes these numerous small non-zero weights can contribute a significant amount to the resultant vector.

https://twitter.com/ChombaBupe/status/1857125974015439183

Essentially, they start to interfere with the representation ie you have small weights - not zero - for things the models is not "interested" in, if the context size is large, even these small weights will eventually affect the output.

That's another downside of soft attention.

Nov 14, 2024 • 11 tweets • 3 min read

The reason why deep neural nets (DNN) don't generalize out of distribution is that they generate hyperplanes to tile the representation space.

Hyperplanes in artificial neural networks are used to classify a point based on which side & how far it falls from a given hyperplane.

https://twitter.com/BlueBir75555922/status/1857156409323860250

To confine a data point in N-dimensional space you need at least N hyperplanes. One hyperplane can confine a point along its normal line.

Nov 14, 2024 • 9 tweets • 2 min read

The attention mechanism as used in deep learning (DL) - like transformer models - is based on vector superpositioning ie weighted averaging of vectors within a context.

For large enough context, the resultant vectors begin to resemble due to law of large numbers (LLN).

https://twitter.com/hardmaru/status/1856912013210832918

That means that models lose representation power ie they can't tell the difference easily between large text inputs because the superposition vectors all approach the mean of the distribution ie when you average too many vectors, the difference between the sums will be very small

May 7, 2024 • 14 tweets • 3 min read

AI is currently dominated by machine learning (ML) approaches based on function approximation which fit such functions to lots of data the way you plot a curve to pass through data points on a graph.

The fitted curve/function is then used in place of the original data.

https://twitter.com/seanonolennon/status/1787712232782274628

In deep learning the function f() is a composition of many smaller functions g1(),g2(),...,gL(), arranged in layers such that one function feeds from the ones before & feeds into the others after it.

Where L = number of layers.

The larger L is the deeper the model is.

Apr 24, 2024 • 7 tweets • 2 min read

The most misunderstood theorem in machine learning (ML) is the universal approximation theorem:

"What must be stressed, is that while some functions can be arbitrarily well approximated in a region ... the approximated functions do not extrapolate outside of the region."

The universal aporoximation theorem is easy to prove:

In given region, a < x < b, you can model any function:

f(x)

By a lookup table of k bins where each bin is of size:

s = (b - a) / k

The more bins the more smooth the function is.

The bin simply holds a value y_{k}

Apr 21, 2024 • 8 tweets • 2 min read

The ability for language models to write coherent text can be explained as each token develops some affinity for other tokens.

During next token prediction each of the previous tokens "vote" for the tokens they have high affinity for, which results in coherent continuation.

https://twitter.com/mgubrud/status/1782155170647806296

This is an example of a bag-of-tokens approach where only individual tokens in a set directly votes for the tokens they have high affinity for.

This simple model can mimic some properties observed in large language models (LLMs).

Apr 6, 2024 • 13 tweets • 3 min read

Information processing is too abstract & vague.

Algorithms can solve the same problems but with different space & time complexity.

One algorithm can require lots of space like RAM while another would require less.

You want to look for space-time efficient algorithms.

https://twitter.com/spiantado/status/1775217414613131714

That is, given infinite data & compute all algorithms spiral down to merely looking up the corresponding response.

If you had enough data, you simply just have to lookup the response, it's called memoization in dynamic programming, you don't need to recompute, just look it up.

Mar 27, 2024 • 6 tweets • 2 min read

Language models are trained in two phases:

1 - Large scale pretraining on next token prediction.

2 - Task specific finetuning like instruction following ie instructGPTs etc

How is it surprising?

Articles make it seem like LLMs are entirely self-supervised when they are not.

https://twitter.com/chrmanning/status/1772642891761955139

Without task specific finetuning from human feedback the model in raw form underperform even though trained on extremely large scale data.

Its the subsequent finetuning phases that aligns the model better, towards the intended behavior.

Even that isn't perfect, nothing is.

Mar 2, 2024 • 10 tweets • 2 min read

A deep neural net (DNN) is just an arrangement of simple nodes - the perceptrons from the 40s, with various nonlinearities - in layers one atop the other.

AI companies are hoping that if they can mega size these DNNs by using lots of compute & data they can solve intelligence.

https://twitter.com/ChinpoKoumori/status/1763798803860406541

The foundation for DNNs was layed out in the 80s.

Gradient descent (GD) was around even much earlier than that, stochastic gradient descent (SGD) is a variant of GD.

Feb 18, 2024 • 11 tweets • 2 min read

Here is how stupid the statement that OpenAI's Sora is a data driven physics engine:

Its like gathering data about planetary motion, feeding it into a model that predicts where the planets will be & concluding that the model has recovered general theory of relativity internally. It took Einstein years to recover the equations for the theory of gravity & if someone thinks that stochastic gradient descent + backpropagation is like a little Einstein figuring out things during model training then your understanding of machine learning is questionable.

Feb 16, 2024 • 27 tweets • 6 min read

OpenAI goal is to build a world simulator - that learns the dynamics of the 3D real world - from videos & language descriptions.

They claim the work indicates that training on ever larger datasets is a promising direction for learning such world models.

But they are wrong: The goal is that given enough videos the model can learn the physics of objects, occlusions, collisions, reflections, shadows etc

Its a difficult thing to recover hard rules from such an approach ie you can't learn much about the behavior of physical objects from videos alone.

Feb 15, 2024 • 15 tweets • 2 min read

You can see interpolation, disocclusion & compression artifacts.

Only means one thing - it's remixing content from the training dataset.

https://twitter.com/OpenAI/status/1758192957386342435

The important question is what did they train this thing on?

Did they ask for consent? Obviously not.

Jan 4, 2024 • 11 tweets • 3 min read

Those timeline predictions for when artificial intelligence will surpass humans at all tasks is shrinking not because AI is advancing fast - because most AI researchers/engineers making those predictions don't understand the challenges needed to surpass human cognitive abilities.

https://twitter.com/emollick/status/1743023738428256540

If you asked around - some folks working in computer vision think the machines good performance on toy benchmarks like ImageNet imply that these systems beat human visual capabilities when that's very far from the truth.

Reality is the best benchmark for intelligence.

Share this page!

Enter URL or ID to Unroll