Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Tristan Rice

@rice_fry

Jan 14, 2022 • 16 tweets • 8 min read • Read on X

Scrolly

I spent some time over my 2 week holiday creating my own self driving models from the ground up in PyTorch 🙂

Open source self driving anyone?

Check out the full write up at: fn.lc/post/diy-self-…

I'll be summarizing it below ⬇️ 1/n

These were trained from the raw footage without any Tesla NNs or outputs. Makes it more fun this way and a lot more possible to iterate

I built everything here using just using 5 of the 8 cameras and the vehicle speed, steering wheel position and IMU readings

https://twitter.com/rice_fry/status/1463685353210933249

Early on I decided to focus on the models that wouldn't require me to label thousands of hours data but are still critical to self driving.

What made the most sense was to try and recreate the 3D generalized static object network previously shown at:

https://twitter.com/rice_fry/status/1463685353210933249

Understanding depth and the 3D space around the car is critical for driving and since there's self-supervised techniques I can skip the data labeling

To start off I need a way to get depth from the camera footage

I started by training a model with monodepth2 as a base. Monodepth2 isn't the most cutting edge monocular depth estimation but it's easy to train, fairly small but still produces reasonable results

It uses pairs of consecutive frames to learn depth

github.com/nianticlabs/mo…

Structure from motion learns the depth by predicting (b) the motion of the camera and then (a) projecting the depth from two consecutive video frames and ensuring that they match

This works quite well for static objects and just requires the main camera feed from the vehicle

Since the training process assumes that everything is static you get issues when dealing with dynamic objects like cars. For learning the static terrain though it's not a problem since we can use multiple frames to filter out the vehicles

https://twitter.com/rice_fry/status/1415034007222317057

Tesla's monocular depth I've shown before most likely uses stereoscopic training which avoids the issue since it probably uses the main and fisheye cameras at exactly the same time so everything is "static"

See earlier tweets about that:

https://twitter.com/rice_fry/status/1415034007222317057

With the depth model, I was able to project out each frame of the vehicle using the vehicle speed

This gives me a full 3D reconstruction of the video clips!

There's a little bit of filtering to discard inaccurate points far from the car but not much

@threejs

The projection is actually quite good just with the main camera. If I was to project all the cameras there'd be more detail to the sides of the vehicle

@threejs is a champ and renders the 24M points on my laptop with no issue! @mrdoob

If you point the camera from above you can easily see the entire road surface to label birds eye view maps such as Tesla uses in their vehicles

Much easier to label a birdseye reconstruction like this than it is to label lines for each frame at 36 frames per second

I didn't feel like labeling so I took this pixel data and bucketed it into a voxel representation around the vehicle

This was one of the more painful steps, I had to write this transformation from scratch and it needs to handle millions of points per clip

I trained a model using this data to predict the 3D voxel grid around the vehicle from the main, left/right pillar and left/right bumper cameras

The training data is fairly rough but the model seems to capture the coarse detail. Though, there's likely overfitting since I only have ~15.2k frames/voxel training examples which is only about 7 minutes of footage

Here's the architecture I ended up using. It's loosely modeled off of the architecture presented at Tesla AI Day.

Key bits:
* encodes using depth encoder used to generate the point clouds
* BiFPNs to encode the features
* a transformer for the largest two feature sizes

@greentheonly

I'm sure there's a cleaner architecture (I'm far from a CV/transformer expert) but seems to work fairly well and gets a 97.5% train accuracy

Overall, I'm pretty happy for a two week side project 🙂

Thanks to everyone who helped! @greentheonly, Sherman and Sid

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

More from @rice_fry

Tristan Rice

@rice_fry

Aug 21, 2022

@aelluswamy

@aelluswamy's talk at CPVR has a lot of very impressive improvements to Tesla's 3D voxel models. There's some subtle but very important things in the slides that I'm excited to incorporate into my own models. ⬇️

https://twitter.com/aelluswamy/status/1561151207858573312

1) Image positional encoding: This adds in an x/y position encoding to each of the image space features. This should make it easier for the transformer to go from image space to 3D

It seems like a hybrid between a traditional CNN and ViT

ViT uses patches of the images encoded with a position before feeding them through a transformer. Using a position encoding with a traditional CNN seems like a nice balance of efficiency and likely makes the per camera encoder simpler

Read 14 tweets

Tristan Rice

@rice_fry

Jul 24, 2022

Curious what I've been up to in the past 6 months? 😅

I've been working on a novel approach to depth and occupancy understanding for my FSD models!

It's much simpler than existing techniques and directly learns the 3D representation ⬇️

I posted the full write up on my about a month ago and I've had a number of PhD students, companies and labs ask to collaborate on papers/projects so I think it's state of the art 🙂

I haven't seen any papers on this

Full write up: fn.lc/post/voxel-sfm/

In my last post I was doing a multi-stage pipeline to train the models:

1) train an image space depth model from the main camera
2) generated a point cloud from an entire video
3) convert to cubes
4) train a voxel model using multiple cameras

Read 15 tweets

Tristan Rice

@rice_fry

Mar 14, 2022

@greentheonly

Is the Tesla repeater light bleed a problem? I grabbed some captures from a 2020 Model 3 to find out

Here's some of the raw 10-bit captures and my analysis ⬇️

Thanks to @greentheonly for suggesting this!

When looking at this data there's two main things to consider: the static world around the vehicle and the dynamic objects in the scene such as cars or people

For static objects information from the forward facing cameras can compensate for lack of info on the repeaters

Here's a static scene in low light. With the blinker off the curb is too dark to see. The blinker actually helps since it provides light

The nearby signs and the further away barriers are mostly washed out but since they're static they can be remembered

Read 15 tweets

Tristan Rice

@rice_fry

Nov 24, 2021

Curious what Tesla means by upreving their static obstacle neural nets?

Lets see how the Tesla FSD Beta 10.5 3D Voxel nets compare to the nets from two months ago.

The new captures are from the same area as the old ones so we can directly compare the outputs

1/N

This first example is a small pedestrian crosswalk sign in the middle of the road. It's about 1 foot wide so it should show up as 1 pixel in the nets.

Under the old nets it shows up as a large blob with an incorrect depth. Under the new nets it's much better.

Under the old nets the posts show up a huge blobs and disappears when the car gets close to it. The probabilities seem fairly consistent no matter how far they sign is away even though up close they should be more confident.

fn.lc/s/depthrender/…

Read 11 tweets

Tristan Rice

@rice_fry

Oct 11, 2021

@greentheonly

Tesla has added new voxel 3D birdseye view outputs and it's pretty amazing!

Nice of them to start merging some bits of FSD into the normal firmware in 2021.36 so we can play with the perception side 🙂

Thanks to @greentheonly for the help!

Most of the critical FSD bits are missing in the normal firmware. These outputs aren't normally running but with some tricks we can enable it.

This seems to be the general solution to handling unpredictable scenarios such as the Seattle monorail pillars or overhanging shrubbery.

The nets predict the location of static objects in the space around them via a dense grid of probabilities.

The output is a 384x255x12 dense grid of probabilities. Each cube seems to be ~0.33 meters and currently outputs predictions ~100 meters in front of the vehicle.

Read 15 tweets

Tristan Rice

@rice_fry

Apr 12, 2021

We recently got some insight into how Tesla is going to replace radar in the recent firmware updates + some nifty ML model techniques

⬇️ Thread

From the binaries we can see that they've added velocity and acceleration outputs. These predictions in addition to the existing xyz outputs give much of the same information that radar traditionally provides
(distance + velocity + acceleration).

For autosteer on city streets, you need to know the velocity and acceleration of cars in all directions but radar is only pointing forward. If it's accurate enough to make a left turn, radar is probably unnecessary for the most part.

Read 15 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Tristan Rice

Try unrolling a thread yourself!

More from @rice_fry

Tristan Rice

Tristan Rice

Tristan Rice

Tristan Rice

Tristan Rice

Tristan Rice

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!