Scene Representation Transformer

Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations

Model Overview

SRT takes few posed or unposed images of novel real-world scenes as input and produces a set-latent scene representation that is decoded to 3D videos & semantics, entirely in real-time. The model is fully geometry-free, instead powered by Transformers and attention mechanisms.


A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g.textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene.

In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a “set-latent scene representation”, and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error.

We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery.


There exists an independent third-party implementation of SRT at We have checked the implementation and it appears to match our results. Please feel free to reach out with questions on implementation details.


Our MultiShapeNet (MSN) dataset is publicly available.

You can read the data directly from data_dir='gs://kubric-public/tfds'. However, for best performance, it's recommended to copy the data locally with gsutil:

mkdir $DATA_DIR
gsutil -m cp -r gs://kubric-public/tfds/multi_shapenet_frames/ $DATA_DIR

Once downloaded, you can plug the data directly into your model using SunDs:

import sunds

builder = sunds.builder('multi_shapenet')
print(  # Print structure of the dataset

ds = builder.as_dataset(
for ex in ds.as_numpy_iterator():  # Convert TF -> numpy
    # Each example is a scene containing 10 images
    ray_origins = ex['ray_origins']  # f32[10 128 128 3]
    ray_directions = ex['ray_directions']  # f32[10 128 128 3]
    color_images = ex['color_images']  # ui8[10 128 128 3]

Inspect the dataset in our interactive Colab.


@article{srt22, title={{Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations}}, author={Mehdi S. M. Sajjadi and Henning Meyer and Etienne Pot and Urs Bergmann and Klaus Greff and Noha Radwan and Suhani Vora and Mario Lucic and Daniel Duckworth and Alexey Dosovitskiy and Jakob Uszkoreit and Thomas Funkhouser and Andrea Tagliasacchi}, journal={{CVPR}}, year={2022}, url={}, }