Understanding depth and the 3D space around the car is critical for driving and since there's self-supervised techniques I can skip the data labeling
To start off I need a way to get depth from the camera footage
I started by training a model with monodepth2 as a base. Monodepth2 isn't the most cutting edge monocular depth estimation but it's easy to train, fairly small but still produces reasonable results
It uses pairs of consecutive frames to learn depth
Structure from motion learns the depth by predicting (b) the motion of the camera and then (a) projecting the depth from two consecutive video frames and ensuring that they match
This works quite well for static objects and just requires the main camera feed from the vehicle
Since the training process assumes that everything is static you get issues when dealing with dynamic objects like cars. For learning the static terrain though it's not a problem since we can use multiple frames to filter out the vehicles
Tesla's monocular depth I've shown before most likely uses stereoscopic training which avoids the issue since it probably uses the main and fisheye cameras at exactly the same time so everything is "static"
With the depth model, I was able to project out each frame of the vehicle using the vehicle speed
This gives me a full 3D reconstruction of the video clips!
There's a little bit of filtering to discard inaccurate points far from the car but not much
The projection is actually quite good just with the main camera. If I was to project all the cameras there'd be more detail to the sides of the vehicle
@threejs is a champ and renders the 24M points on my laptop with no issue! @mrdoob
If you point the camera from above you can easily see the entire road surface to label birds eye view maps such as Tesla uses in their vehicles
Much easier to label a birdseye reconstruction like this than it is to label lines for each frame at 36 frames per second
I didn't feel like labeling so I took this pixel data and bucketed it into a voxel representation around the vehicle
This was one of the more painful steps, I had to write this transformation from scratch and it needs to handle millions of points per clip
I trained a model using this data to predict the 3D voxel grid around the vehicle from the main, left/right pillar and left/right bumper cameras
The training data is fairly rough but the model seems to capture the coarse detail. Though, there's likely overfitting since I only have ~15.2k frames/voxel training examples which is only about 7 minutes of footage
Here's the architecture I ended up using. It's loosely modeled off of the architecture presented at Tesla AI Day.
Key bits:
* encodes using depth encoder used to generate the point clouds
* BiFPNs to encode the features
* a transformer for the largest two feature sizes
I'm sure there's a cleaner architecture (I'm far from a CV/transformer expert) but seems to work fairly well and gets a 97.5% train accuracy
Overall, I'm pretty happy for a two week side project 🙂
Thanks to everyone who helped! @greentheonly, Sherman and Sid
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Curious what Tesla means by upreving their static obstacle neural nets?
Lets see how the Tesla FSD Beta 10.5 3D Voxel nets compare to the nets from two months ago.
The new captures are from the same area as the old ones so we can directly compare the outputs
1/N
This first example is a small pedestrian crosswalk sign in the middle of the road. It's about 1 foot wide so it should show up as 1 pixel in the nets.
Under the old nets it shows up as a large blob with an incorrect depth. Under the new nets it's much better.
Under the old nets the posts show up a huge blobs and disappears when the car gets close to it. The probabilities seem fairly consistent no matter how far they sign is away even though up close they should be more confident.
Most of the critical FSD bits are missing in the normal firmware. These outputs aren't normally running but with some tricks we can enable it.
This seems to be the general solution to handling unpredictable scenarios such as the Seattle monorail pillars or overhanging shrubbery.
The nets predict the location of static objects in the space around them via a dense grid of probabilities.
The output is a 384x255x12 dense grid of probabilities. Each cube seems to be ~0.33 meters and currently outputs predictions ~100 meters in front of the vehicle.
We recently got some insight into how Tesla is going to replace radar in the recent firmware updates + some nifty ML model techniques
⬇️ Thread
From the binaries we can see that they've added velocity and acceleration outputs. These predictions in addition to the existing xyz outputs give much of the same information that radar traditionally provides
(distance + velocity + acceleration).
For autosteer on city streets, you need to know the velocity and acceleration of cars in all directions but radar is only pointing forward. If it's accurate enough to make a left turn, radar is probably unnecessary for the most part.
Got a sample of the Tesla Insurance telemetry data. The insurance records are on a per drive basis. Here's the fields:
* Unique Drive ID
* Record Version
* Car Firmware Version
* Driver Profile Name
* Start / End Time
* Drive Duration
* Start / End Odometer
(1/2)
* # of Autopilot Strikeouts
* # of Forward Collision Warnings
* # of Lane Departure Warnings
* # of ABS activations (All & User)
* Time spent within 1s of car in front
* Time spent within 3s of car in front
* Acceleration Variance
* Service Mode
* Delivered
(2/2)
There's lot of basic stuff which insurance companies can get via companion apps/dongles but there's a lot of deep insights into driver behavior which Tesla can get but others cannot.
I bet a lot of insurance companies would love to get their hands on this kind of data