Tweet

Frank Dellaert

16 Dec, 43 tweets, 11 min read

2020 was the year in which *neural volume rendering* exploded onto the scene, triggered by the impressive NeRF paper by Mildenhall et al. I wrote a post as a way of getting up to speed in a fascinating and very young field and share my journey with you: dellaert.github.io/NeRF/

The precursors to NeRF are approaches that use an *implicit* surface representation. At CVPR 2019, 3 papers introduced the use of neural nets as *scalar function approximators* to define occupancy and/or signed distance functions.

Occupancy networks (avg.is.tuebingen.mpg.de/publications/o…) introduce implicit, coordinate-based learning of occupancy. A network consisting of 5 ResNet blocks take a feature vector and a 3D point and predict binary occupancy.

IM-NET (github.com/czq142857/impl…) uses a 6-layer MLP decoder that predicts binary occupancy given a feature vector and a 3D coordinate. Can be used for auto-encoding, shape generation (GAN-style), and single-view reconstruction.

DeepSDF (github.com/facebookresear…) directly regresses a **signed distance function** from a 3D coordinate and optionally a latent code. It uses an 8-layer MPL with skip-connections to layer 4, setting a trend!

PIFu (shunsukesaito.github.io/PIFu/) showed that it was possible to learn highly detailed implicit models by re-projecting 3D points into a pixel-aligned feature representation. This idea will later be reprised, with great effect, in PixelNeRF.

Several other approaches built on top of this, and generalize to training from 2D images. Of note are Structured Implicit Functions, CvxNet, Deep Local Shapes, Scene Representation Networks, Differentiable Volumetric Rendering, and the Implicit Differentiable Renderer.

As far as I know, two papers introduced **volume rendering** into the view synthesis field, with NeRF being the simplest and ultimately the most influential.

Neural Volumes (research.fb.com/publications/n…) introduced, AFAIK, true **volume rendering** for view synthesis. This paper from Facebook Reality Labs regresses a 3D volume of density and color, albeit still in a voxel-based representation.

NeRF (matthewtancik.com/nerf) is the paper that got everyone talking. They take the DeepSDF architecture but regress density and color, as in NV. They then use a numerical integration method to approximate a volumetric rendering step.

Arguably, the impact of the NeRF paper lies in its brutal simplicity: many researchers were taken aback (I think) that such a simple architecture could yield such impressive results.

That being said, vanilla NeRF left many opportunities to improve upon: it is slow, works only for static scenes, bakes in lighting, and does not generalize.

Several projects/papers aim at improving the rather slow training and rendering time of the original NeRF paper.

Neural Sparse Voxel Fields (github.com/facebookresear…) organizes the scene into a sparse voxel octree to speed up rendering by a factor of 10.

NERF++ (github.com/Kai-46/nerfplu…) proposed to model the background with a separate NeRF to handle unbounded scenes.

DeRF (ubc-vision.github.io/derf/) decomposes the scene into "soft Voronoi diagrams" to take advantage of accelerator memory architectures.

AutoInt (computationalimaging.org/publications/a…) greatly speeds up rendering by learning the volume integral directly.

Learned Initializations (arxiv.org/abs/2012.02189) uses meta-learning to find a good weight initialization for faster training.

JaxNeRF (github.com/google-researc…) uses JAX (github.com/google/jax) to dramatically speed up training, from days to hours.

At least four efforts focus on dynamic scenes, using a variety of schemes.

Nerfies (nerfies.github.io) and its underlying D-NeRF model deformable videos using a second MLP applying a deformation for each frame of the video.

Space-Time Neural Irradiance Fields (video-nerf.github.io) simply use time as an additional input. Carefully selected losses are needed to successfully train this method to render free-viewpoint videos (from RGBD data!).

Neural Scene Flow Fields (cs.cornell.edu/~zl548/NSFF/) instead train from RGB but monocular depth predictions as a prior, and regularize by also outputting scene flow, used in the loss.

D-NeRF (albertpumarola.com/research/D-NeR…) is quite similar to the Nerfies paper and even uses the same acronym, but seems to limit deformations to translations.

Besides Nerfies, two other papers focus on avatars/portraits of people.

DNRF (gafniguy.github.io/4D-Facial-Avat…) is focused on 4D avatars and hence impose a strong inductive bias by including a deformable face model into the pipeline.

Portrait NeRF (portrait-nerf.github.io) creates static NeRF-style avatars, but does so from a single RGB headshot. To make this work, light-stage training data is required.

Another dimension in which NeRF-style methods have been augmented is in how to deal with lighting, typically through latent codes that can be used to re-light a scene.

NeRF-W (nerf-w.github.io) was one of the first follow-up works on NeRF, and optimizes a latent appearance code to enable learning a neural scene representation from less controlled multi-view collections.

Neural Reflectance Fields (cseweb.ucsd.edu/~bisai/) improve on NeRF by adding a local reflection model in addition to density. It yields impressive relighting results, albeit from single point light sources.

NeRV (people.eecs.berkeley.edu/~pratul/nerv/) uses a second "visibility" MLP to support arbitrary environment lighting and "one-bounce" indirect illumination.

NeRD (markboss.me/publication/20…) or “Neural Reflectance Decomposition” is another effort in which a local reflectance model is used, and additionally a low-res spherical harmonics illumination is removed for a given scene.

Latent codes can also be used to encode shape priors.

GRAF (autonomousvision.github.io/graf/) i.e., a “Generative model for RAdiance Fields"is a conditional variant of NeRF, adding both appearance and shape latent codes, while viewpoint invariance is obtained through GAN-style training.

pi-GAN (marcoamonteiro.github.io/pi-GAN-website/) Is similar to GRAF but uses a SIREN-style implementation of NeRF, where each layer is modulated by the output of a different MLP that takes in a latent code.

pixelNeRF (github.com/sxyu/pixel-nerf) is closer to image-based rendering, where N images are used at test time. It is based on PIFu, creating pixel-aligned features that are then interpolated when evaluating a NeRF-style renderer.

Clearly (?) none of this will scale to large scenes composed of many objects, so an exciting new area of interest is how to compose objects into volume-rendered scenes.

Neural Scene Graphs (arxiv.org/abs/2011.10379) supports several object-centric NeRF models in a scene graph.

GIRAFFE (arxiv.org/abs/2011.12100) supports composition by having object-centric NeRF models output feature vectors rather than color, then compose via averaging, and render at low resolution to 2D feature maps that are then upsampled in 2D.

Object-Centric Neural Scene Rendering (shellguo.com/osf/) learns "Object Scattering Functions" in object-centric coordinate frames, allowing for composing scenes and realistically lighting them, using Monte Carlo rendering.

Finally, at least one paper has used NeRF rendering in the context of (known) object pose estimation.

iNeRF (yenchenlin.me/inerf/) uses a NeRF MLP in a pose estimation framework, and is even able to improve view synthesis on standard datasets by fine-tuning the poses. However, it does not yet handle illumination.

Neural Volume Rendering and NeRF-style papers have exploded on the scene in 2020, and the last word has not been said. My post def. does not rise to the level of a thorough review, but I hope that this it is useful for people working in this area or thinking of joining the fray.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Frank Dellaert

Try unrolling a thread yourself!

More from @fdellaert

Frank Dellaert

Did Thread Reader help you today?

Like this author's thread?