Tesla's self-driving AI solves an insane problem: compress 2 BILLION input tokens (7-8 cameras × 5MP × 30 seconds) down to just 2 output tokens—steering and acceleration. Here's how they actually made it work 🧵
The system uses a single large end-to-end neural network. Pixels and sensor data go in → steering and acceleration come out. No explicit perception modules. Raw video streams directly to actions. This approach has been powering Tesla's FSD for years now.
Why end-to-end? Because codifying human values in code is nearly impossible. When should you brake? How hard? It depends on speed, comfort, situation—no "one objective value." You can't write these alignment preferences in explicit rules.
Example: puddle on a bidirectional road. Do you drive through it or enter the oncoming lane? How do you code "pain of puddle" vs "risk of oncoming traffic"? You can't. But looking at the scene end-to-end, the right choice becomes obvious.
Watch this: chickens crossing the road. The car waits patiently for the last one to cross. No collision risk, but it understands their INTENT to cross—it's just the right thing to do. This is intelligence beyond obstacle detection.
Same system, different scenario: geese standing still on the road (not crossing). The car changes its mind, backs up, and goes around them. The engineer's reaction: "This is crazy." Different context → different behavior. Hard to write in explicit code.
With modular systems, the perception→planning interface is "ill-defined" and "very lossy"—critical info gets lost in translation. End-to-end neural networks flow gradients from pixels directly to actions. Nothing gets lost.
Tesla's core belief: "This is the right path to solving robotics as opposed to modular brittle systems." They're making a fundamental bet that end-to-end neural networks are how you build intelligent machines at scale.
But this approach has massive challenges. Challenge #1: the curse of dimensionality.
The model must compress that down to just 2 output tokens: steering and acceleration.
Tesla's advantage: access to the "Niagara Falls of data" from their massive fleet. You don't want spurious correlations—you want the RIGHT correlations explaining why those 2 output tokens are correct. Data is their competitive moat.
The fleet has collects 500 YEARS of driving data everyday—more than they can even store. But they don't use it all. They refine it down to the essential scenarios that cover the full spectrum of driving. Quality over quantity.
How? Triggers catch rare scenarios when drivers encounter them naturally—weird intersections, animals, construction zones. "You can't stage this easily" because it requires real-world state space. The fleet gives them unique access to corner cases.
Result: the car ahead spins out. Tesla's system predicts it will hit the barrier and bounce back—a SECOND-ORDER physics effect. It starts braking at 4 m/s² during the initial spin, before the collision. "It did not wait." Proactively safe.
Challenge #2: how do you debug an end-to-end system?
The same model can be prompted to predict 3D occupancy, objects, traffic lights, road boundaries, even plain language explanations. Just because it's end-to-end doesn't mean it's a black box.
Traditional Gaussian splatting breaks down on novel views and takes tens of minutes. Tesla's variant is "ridiculously fast," works better with limited camera views, and maintains structure. No COLMAP needed. It's used to debug "is it safely avoiding obstacles?"
The same model can produce reasoning tokens when needed—explaining decisions in plain language. It doesn't reason for every action (latency cost), but "wherever it's needed, it could reason longer" to produce the right answer.
Challenge #3: evaluation—"the most difficult of these three problems."
Open-loop performance can look amazing but not translate to real driving. Random fleet samples = boring highway driving. You need balanced eval sets. "Extremely important but very tedious work."
The solution: a world model simulator. It generates EIGHT simultaneous 5-megapixel video streams (front, sides, rear) for over a minute—all action-conditioned, all consistent. Vehicle rims, traffic lights, everything coherent. From one neural network.
How do you train it? "You don't need optimal driving... any kind of trash driving is good enough." State-action pairs collected for free from the fleet. It needs to simulate edge cases, not perfect driving. Brilliant inversion of the problem.
With lower test-time compute, it runs in REAL-TIME. You can steer, brake, accelerate through a fully generated 8-camera world at 5MP. It responds like reality. Videos run 6+ minutes with consistent generation. "Quite powerful" for evaluation + RL training.
Take a year-old failure, replay it with your latest network, see if it's fixed. Example: was too close to a pedestrian → new model offsets way earlier. You don't need new real-world miles to verify improvements on known issues.
Or inject adversarial events: condition one vehicle to cut across your path while keeping the rest of the scene consistent. Systematically test corner cases without real-world danger. Synthetic safety testing at scale.
The world model enables closed-loop reinforcement learning: "let the car drive and verify that it doesn't collide with anything for a very long time." Train and improve the policy in simulation, deploy to reality. No real-world risk during training.
Key insight from Q&A: "The main premise of end-to-end is that gradients must flow end-to-end." You can have auxiliary outputs, different architectures, various output spaces—everything else is empirical. Gradient flow is the ONE non-negotiable rule.
Tesla does use sensor-specific tokenization for efficiency. End-to-end doesn't mean no modularity—it means gradients flow through everything. "Some level of modularity still" exists. The architecture is flexible as long as learning is end-to-end.
Another key distinction: perception can use open-loop eval (your prediction doesn't affect the scenario). But action needs closed-loop eval (your prediction affects the environment). Use the right evaluation tool for what you're measuring.
Where this is all heading: Tesla launched their robotaxi service in Austin and Bay Area (June-July). In Austin below 40 mph, cars operate with no one in the passenger seat. Not a demo—a service you can hail. Cameras, neural networks, real-world deployment.
Next: Cybercab. Purpose-built robotaxi, 2 seats. It will have "the lowest cost of transportation across even public transportation." All powered by these same neural networks. The approach scales across vehicle platforms, locations, and weather.
And it's not just cars. "The same technology we developed for self-driving transfers most seamlessly to other forms of robots too." The world model works for Optimus humanoid robots. Same neural network, just add Optimus data—it generalizes across form factors.
Tesla is all in on robotics. The entire company is focused on producing intelligent useful large scale robots for helping everyone in the world."
This isn't just about self-driving cars. It's about solving robotics, end-to-end. 🤖
The reason I created this thread is because most people will read something short but not watch a 28-minute video. Here's the source info:
Video: Tesla ICCV 2025 Foundational Model for FSD - Ashok Elluswamy
Date: October 22, 2025
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Three days with Tesla Robotaxi. 62 total rides. Here’s what actually happened when I put the autonomous future to the test.
Most surprising discovery: I saw more human driving errors than Robotaxi mistakes. Humans got in wrong lanes 5 times. The Robotaxi never made a bad lane choice once.
Instant ride matching is underrated. No waiting for driver approval or cycling through options. Just immediate connection. Makes the whole experience seamless.
Most bulls fear losing shares in a revaluation scenario (and $TSLA moons). So we asked ourselves, how did a simple covered call strategy work in 2020 when the stock went up 730%?
Answer:
Covered calls can generate a lot of income in a week or month. Is it wise to go the more aggressive route with less margin of safety like $TSLY uses?
In a direct shot against Tesla, Marco Rubio is trying to get CATL batteries banned in 🇺🇸.
The use of batteries is not a national security risk; batteries enhance national security by making us less reliant on oil.
Pics of his letter 👇
in the following 3 tweets a TLDR summary
‘The U.S. Department of Defense must act now. CATL, a Chinese battery company, threatens our national security. It has ties to the Chinese Communist Party and military. We need to put CATL on the Section 1260H List. This list names Chinese military companies working in the U.S.’
‘CATL's success comes from Chinese government support. Its founder backs the Communist Party. CATL works with other blacklisted companies. It may power Chinese military bases and submarines. CATL is like Huawei, but for batteries instead of phones.’
Now there is no line for Foundation Series and Tesla is asking, "Anyone else? Final chance!"
Next, Tesla will lower the price to $80k for the 2 motor variant and go through the reservation list again from top to bottom. If you're number 2 million in line, you will have to wait
... until everyone else says no at $80k for the 2 motor variant.
THEN, Tesla will lower the 2 motor Cybertruck to something like $62k (toward its final price) and go through the whole list again.
The Powerwall 3 is five inches shorter and 25% slimmer than the Powerwall 2. It also weighs about 100 pounds less. This makes it easier to install and transport.
💪Higher peak power output.
The Powerwall 3 has a peak power output of 11.5 kW, up from 5 kW for the Powerwall 2. This means that it can power more devices during a power outage.