Tristan Profile picture
Machine Learning + Reverse Engineering + Software + Security SWE @pytorch, tweets are personal opinions https://t.co/419A7MoFX7

Jan 14, 2022, 16 tweets

I spent some time over my 2 week holiday creating my own self driving models from the ground up in PyTorch 🙂

Open source self driving anyone?

Check out the full write up at: fn.lc/post/diy-self-…

I'll be summarizing it below ⬇️ 1/n

These were trained from the raw footage without any Tesla NNs or outputs. Makes it more fun this way and a lot more possible to iterate

I built everything here using just using 5 of the 8 cameras and the vehicle speed, steering wheel position and IMU readings

Early on I decided to focus on the models that wouldn't require me to label thousands of hours data but are still critical to self driving.

What made the most sense was to try and recreate the 3D generalized static object network previously shown at:

Understanding depth and the 3D space around the car is critical for driving and since there's self-supervised techniques I can skip the data labeling

To start off I need a way to get depth from the camera footage

I started by training a model with monodepth2 as a base. Monodepth2 isn't the most cutting edge monocular depth estimation but it's easy to train, fairly small but still produces reasonable results

It uses pairs of consecutive frames to learn depth

github.com/nianticlabs/mo…

Structure from motion learns the depth by predicting (b) the motion of the camera and then (a) projecting the depth from two consecutive video frames and ensuring that they match

This works quite well for static objects and just requires the main camera feed from the vehicle

Since the training process assumes that everything is static you get issues when dealing with dynamic objects like cars. For learning the static terrain though it's not a problem since we can use multiple frames to filter out the vehicles

Tesla's monocular depth I've shown before most likely uses stereoscopic training which avoids the issue since it probably uses the main and fisheye cameras at exactly the same time so everything is "static"

See earlier tweets about that:

With the depth model, I was able to project out each frame of the vehicle using the vehicle speed

This gives me a full 3D reconstruction of the video clips!

There's a little bit of filtering to discard inaccurate points far from the car but not much

The projection is actually quite good just with the main camera. If I was to project all the cameras there'd be more detail to the sides of the vehicle

@threejs is a champ and renders the 24M points on my laptop with no issue! @mrdoob

If you point the camera from above you can easily see the entire road surface to label birds eye view maps such as Tesla uses in their vehicles

Much easier to label a birdseye reconstruction like this than it is to label lines for each frame at 36 frames per second

I didn't feel like labeling so I took this pixel data and bucketed it into a voxel representation around the vehicle

This was one of the more painful steps, I had to write this transformation from scratch and it needs to handle millions of points per clip

I trained a model using this data to predict the 3D voxel grid around the vehicle from the main, left/right pillar and left/right bumper cameras

The training data is fairly rough but the model seems to capture the coarse detail. Though, there's likely overfitting since I only have ~15.2k frames/voxel training examples which is only about 7 minutes of footage

Here's the architecture I ended up using. It's loosely modeled off of the architecture presented at Tesla AI Day.

Key bits:
* encodes using depth encoder used to generate the point clouds
* BiFPNs to encode the features
* a transformer for the largest two feature sizes

I'm sure there's a cleaner architecture (I'm far from a CV/transformer expert) but seems to work fairly well and gets a 97.5% train accuracy

Overall, I'm pretty happy for a two week side project 🙂

Thanks to everyone who helped! @greentheonly, Sherman and Sid

Share this Scrolly Tale with your friends.

A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.

Keep scrolling