Most of the critical FSD bits are missing in the normal firmware. These outputs aren't normally running but with some tricks we can enable it.
This seems to be the general solution to handling unpredictable scenarios such as the Seattle monorail pillars or overhanging shrubbery.
The nets predict the location of static objects in the space around them via a dense grid of probabilities.
The output is a 384x255x12 dense grid of probabilities. Each cube seems to be ~0.33 meters and currently outputs predictions ~100 meters in front of the vehicle.
This is similar to the previous single camera depth model but given it the birdseye view treatment.
Before this Tesla would have to manually label this as part of the training set to ensure the car doesn't run into it
Here's a full intersection, outputs seem quite reasonable in all directions. You can see the 4 buildings on each side, the curbs ahead as well as the trees by the side of the road.
The uploaded voxel frames are every half second for practicality reasons (in the car it's much higher FPS)
I suspect they're taking the same offline 3D models they use to label the birdseye view training data (as seen during AI day) and converting it to voxel data to train a net.
It's a very clever solution, kudos to the engineers who worked on this.
I'm very curious what the model architecture looks like and how much it differs from the other birdseye view nets.
The 3D convolutional NNs used here are similar to what could potentially be used merge radar with vision if Tesla can get access to the raw Conti radar data.
The 3D birdseye view is a fair bit lower resolution than LIDAR but very impressive and achieves much of the same purpose.
These models are outputting probabilities so you can see where the model is confident vs not.
I don't quite know what the scale is here but having a 75% threshold seems to work pretty well. For all these renders I only show voxels that are above the target threshold
Rendering voxel data in a browser is pretty tricky so if anyone wants to help with more advanced visualizations let me know :)
I see why Elon said they were having issues visualizing it
• • •
Missing some Tweet in this thread? You can try to
force a refresh
@aelluswamy's talk at CPVR has a lot of very impressive improvements to Tesla's 3D voxel models. There's some subtle but very important things in the slides that I'm excited to incorporate into my own models. ⬇️
1) Image positional encoding: This adds in an x/y position encoding to each of the image space features. This should make it easier for the transformer to go from image space to 3D
It seems like a hybrid between a traditional CNN and ViT
ViT uses patches of the images encoded with a position before feeding them through a transformer. Using a position encoding with a traditional CNN seems like a nice balance of efficiency and likely makes the per camera encoder simpler
Curious what I've been up to in the past 6 months? 😅
I've been working on a novel approach to depth and occupancy understanding for my FSD models!
It's much simpler than existing techniques and directly learns the 3D representation ⬇️
I posted the full write up on my about a month ago and I've had a number of PhD students, companies and labs ask to collaborate on papers/projects so I think it's state of the art 🙂
In my last post I was doing a multi-stage pipeline to train the models:
1) train an image space depth model from the main camera 2) generated a point cloud from an entire video 3) convert to cubes 4) train a voxel model using multiple cameras
When looking at this data there's two main things to consider: the static world around the vehicle and the dynamic objects in the scene such as cars or people
For static objects information from the forward facing cameras can compensate for lack of info on the repeaters
Here's a static scene in low light. With the blinker off the curb is too dark to see. The blinker actually helps since it provides light
The nearby signs and the further away barriers are mostly washed out but since they're static they can be remembered
Curious what Tesla means by upreving their static obstacle neural nets?
Lets see how the Tesla FSD Beta 10.5 3D Voxel nets compare to the nets from two months ago.
The new captures are from the same area as the old ones so we can directly compare the outputs
1/N
This first example is a small pedestrian crosswalk sign in the middle of the road. It's about 1 foot wide so it should show up as 1 pixel in the nets.
Under the old nets it shows up as a large blob with an incorrect depth. Under the new nets it's much better.
Under the old nets the posts show up a huge blobs and disappears when the car gets close to it. The probabilities seem fairly consistent no matter how far they sign is away even though up close they should be more confident.
We recently got some insight into how Tesla is going to replace radar in the recent firmware updates + some nifty ML model techniques
⬇️ Thread
From the binaries we can see that they've added velocity and acceleration outputs. These predictions in addition to the existing xyz outputs give much of the same information that radar traditionally provides
(distance + velocity + acceleration).
For autosteer on city streets, you need to know the velocity and acceleration of cars in all directions but radar is only pointing forward. If it's accurate enough to make a left turn, radar is probably unnecessary for the most part.