TL;DR
- Introduced by Mildenhall et al. in 'NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis' (arXiv:2003.08934, ECCV 2020 best paper).
- Represents a scene as a continuous 5D function — position (x, y, z) and viewing direction (θ, φ) — mapped by an MLP to colour and volume density.
- Renders novel views by ray marching: sample points along each camera ray, query the MLP, composite using volumetric rendering.
- Largely superseded for real-time rendering by 3D Gaussian Splatting (2023), but remains the conceptual reference for neural scene representation and inspires many extensions (Instant-NGP, Nerfacto, NeRF-W).
The Idea#
NeRF takes a small set of input images of a scene — typically 20 to 100 photos with known camera poses — and learns a continuous representation of the scene's geometry and appearance. The representation is a multi-layer perceptron: input the 3D coordinate of a point and the viewing direction; output the RGB colour and the volume density at that point. To render a novel view, march rays from the camera through every output pixel, sample points along each ray, query the MLP at each sample, and composite using the classical volumetric rendering equation.
The result is photorealistic novel-view synthesis from sparse input — far higher fidelity than the multi-view stereo pipelines that preceded it. NeRF won the ECCV 2020 best paper award and spawned an enormous research follow-up.
Architecture and Training#
- MLP with roughly 8 hidden layers of 256 units. Input is the 5D coordinate; outputs are density σ (depending only on position) and view-dependent colour c.
- Positional encoding — each input coordinate is mapped through a Fourier feature basis before entering the MLP, allowing the network to fit high-frequency detail.
- Hierarchical sampling — a 'coarse' network proposes a density distribution along each ray, and a 'fine' network samples more densely in high-density regions.
- Loss — per-pixel MSE between rendered and ground-truth colours across all training rays.
Why It Was Slow#
The original NeRF took roughly 1-2 days to train per scene on a single GPU and rendered a 1080p frame in tens of seconds. Both numbers were the dominant friction in deployment. Two follow-ups changed the picture:
- Instant-NGP (Müller et al., 2022) — replaced the dense MLP with a multi-resolution hash grid feature lookup plus a tiny MLP. Training dropped to minutes; rendering to interactive frame rates.
- Mip-NeRF (Barron et al., 2021) and Mip-NeRF 360 — handled anti-aliasing and unbounded scenes properly.
- Nerfacto (Nerfstudio) — engineering-grade NeRF combining many of the above improvements into a single trainable model.
As of 2026, 3D Gaussian Splatting has replaced NeRF as the dominant approach for high-quality real-time scene rendering. NeRF remains preferred for some use cases — unbounded outdoor scenes, view-dependent reflective surfaces — and as a research baseline.
Where NeRF Still Wins#
- Strongly view-dependent surfaces — wet roads, polished metal, glass — where Gaussian Splatting's explicit colour storage struggles.
- Sparse-input regimes where regularisation in implicit representation helps.
- Tasks where the scene representation needs to support downstream queries (semantic NeRFs, editable NeRFs, NeRF-RPN) — implicit representations are easier to compose with learned heads.
Deployment Reality#
Most production neural reconstruction pipelines in 2026 default to 3D Gaussian Splatting for the renderer and use NeRF-style methods either as a baseline for evaluation or as a back-end for difficult scenes. Nerfstudio is the canonical open framework — it provides a unified trainer, viewer, and exporter across Nerfacto, Instant-NGP, Splatfacto (Gaussian Splatting), and a wide method library.
On Yobitel compute, NeRF training fits comfortably on a single L40S or H100 per scene. Long-running batch training jobs across scene archives benefit from the L40S's NVENC for video preview generation during training monitoring.