Neural Volume Rendering for Dynamic Scenes

NeRF has shown incredible view synthesis results, but it requires multi-view captures for STATIC scenes.

How can we achieve view synthesis for DYNAMIC scenes from a single video? Here is what I learned from several recent efforts.
Instead of presenting Video-NeRF, Nerfie, NR-NeRF, D-NeRF, NeRFlow, NSFF (and many others!) as individual algorithms, here I try to view them from a unifying perspective and understand the pros/cons of various design choices.

Okay, here we go.
*Background*

NeRF represents the scene as a 5D continuous volumetric scene function that maps the spatial position and viewing direction to color and density. It then projects the colors/densities to form an image with volume rendering.

Volumetric + Implicit -> Awesome!
*Model*

Building on NeRF, one can extend it for handling dynamic scenes with two types of approaches.

A) 4D (or 6D with views) function.

One direct approach is to include TIME as an additional input to learn a DYNAMIC radiance field.

e.g., Video-NeRF, NSFF, NeRFlow
B) 3D Template with Deformation.

Inspired by non-rigid reconstruction methods, this type of approach learns a radiance field in a canonical frame (template) and predicts deformation for each frame to account for dynamics over time.

e.g., Nerfie, NR-NeRF, D-NeRF
*Deformation Model*

All the methods use an MLP to encode the deformation field. But, how do they differ?

A) INPUT: How to encode the additional time dimension as input?

B) OUTPUT: How to parametrize the deformation field?
A) Input conditioning

One can choose to use EXPLICIT conditioning by treating the frame index t as input.

Alternatively, one can use a learnable LATENT vector for each frame.
B) Output parametrization

We can either use the MLP to predict
- dense 3D translation vectors (aka scene flow) or
- dense rigid motion field
With these design choices in mind, we can mix-n-match to synthesize all the methods.
*Regularization*

Adding the deformation field introduces ambiguities. So we need to make it "well-behaved", e.g., the deformation field should be spatially smooth, temporally smooth, sparse, and avoid contraction and expansion.
*Depth supervision*

Unlike other methods above, Video-NeRF (shameless plug here) does not require a separate deformation field (and various other regularization terms) by using direct depth supervision to constrain the time-varying geometry.

With further improvement on single video depth estimation (another shameless plug 🤩), I am very excited to see dynamic view synthesis on videos in the wild soon!

So...what are the best methods/practices 🤔? I don't know.

3D template-based methods can achieve really good visual quality but may have limitations on the amounts of dynamics.

4D function is more general, but it may require designing effective constraints for regularization.
This is by no means a complete list. Please let me know if you know of other relevant works in this domain.

For a border view of the "NeRF explosion" and background, check out @fdellaert's blog dellaert.github.io/NeRF/

and Frank and @yen_chen_lin's report
arxiv.org/abs/2101.05204
To learn more about these, come and chat with us in the reading group at 13:00-14:00 EST on Jan 20.

yenchenlin.me/3D-representat…

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Jia-Bin Huang #Masks4All

Jia-Bin Huang #Masks4All Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @jbhuang0604

13 Jan
Semi-supervised learning with consistency regularization and pseudo-labeling works great for CLASSIFICATION.

But how about STRUCTURED PREDICTION tasks? 🤔

Check out @ylzou_Zack's #ICLR2021 paper on designing pseudo-labels for semantic segmentation.
yuliang.vision/pseudo_seg/
How do we get pseudo labels from unlabeled images?

Unlike classification, directly thresholding the network outputs for dense prediction doesn't work well.

Our idea: start with weakly sup. localization (Grad-CAM) and refine it with self-attention for propagating the scores.
Using two different prediction mechanisms is great bc they make errors in different ways. With our fusion strategy, we get WELL-CALIBRATED pseudo labels (see the expected calibration errors in E below) and IMPROVED accuracy under 1/4, 1/8, 1/16 of labeled examples.
Read 6 tweets
13 Dec 20
Have you ever wondered why papers from top universities/research labs often appear in the top few positions in the daily email and web announcements from arXiv?

Why is that the case? Why should I care?
Wait a minute! Does the article position even matter?

It matters!

See arxiv.org/abs/0907.4740

-> Articles in position 1 received median numbers of citations 83%, 50%, and 100% higher than those lower down in three communities.
So you get a significantly higher visibility boost, wider readership, and long-term citations and impacts by ...

simply putting your paper on the top position in the articles!

Crazy huh?
Read 6 tweets
12 Dec 20
How can we turn causal videos into 3D? Excited to share our work on Robust Consistent Video Depth Estimation.

Project: robust-cvd.github.io
Paper: arxiv.org/abs/2012.05901

w/ @JPKopf @jastarex

Check out the 🧵below!
We start by examining our Consistent Video Depth Estimation (CVD) in SIGGRAPH 2020 (work led by the amazing @XuanLuo14).

roxanneluo.github.io/Consistent-Vid…

The method achieves AWESOME results but requires precise camera poses as inputs.
Isn't SLAM/SfM a SOLVED problem? You might ask.

Yes, it works pretty well for static and controlled environments. For causal videos, existing methods usually fail to register all frames or produce outlier poses with large errors.

As a result, CVD works only *when SFM works*.
Read 9 tweets
11 Dec 20
How can we learn NeRF from a SINGLE portrait image? Check out @gaochen315's recent work leverages new (meta-learning) and old (3D morphable model) tricks to make it work! This allows us to synthesize new views and manipulate FOV.

Project: portrait-nerf.github.io
Work led by the amazing Chen Gao (@gaochen315) and in collaboration with friends from Google (Yichang Shih, Wei-Sheng Lai, and Chia-Kai Liang).

Paper: arxiv.org/abs/2012.05903
So, how does it work?

Training a NeRF from a single image from scratch won't work because it cannot recover the correct shape. The rendering results look fine at the original viewpoint but produce large errors at novel views.
Read 6 tweets
10 Dec 20
Congratulations Jinwoo Choi for passing his Ph.D. thesis defense!

Special thanks to the thesis committee members (Lynn Abbott, @berty38, Harpreet Dhillon, and Gaurav Sharma) for valuable feedback and advices. Image
Jinwoo started his PhD in building an interactive system for home-based stroke rehabilitation. Published at ASSETS 17 and PETRA, 2018.

The preliminary efforts lay the foundation for a recent 1.1 million NSF Smart and Connected Health award! Image
He then looked into scene biases in action recognition datasets and presented debiasing methods that lead to improved generalization in downstream tasks [Choi NeurIPS 19]. chengao.vision/SDN/
Read 5 tweets
6 Jul 20
Sharing one idea I found useful for paper writing:

Do NOT ask people to solve correspondence problems.

Some Dos and Don'ts examples below:

*Figures*: Don't ask people to match (a), (b), (c) ... with the descriptions in the figure caption.
*Figure caption*

Use "self-contained" caption. It's annoying to dig into the texts and match them to the figures. Ain't nobody got time for that! ⌚️

Also, add a figure "caption title" (in bold fonts). It allows readers to navigate through figures quickly.
*Notations*

Give specific, meaningful names to your math notations. For example, the readers won't need to go back and forth to figure what each term means.
Read 10 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!