Introduction

Dramatic and rapid changes to the global economy are required in order to limit climate-related risks for natural and human systems (IPCC, 2018). Governmental interventions are needed to fight climate change and they need strong public support. However, it is difficult to mentally simulate the complex effects of climate change (O’Neill & Hulme, 2009) and people often discount the impact that their actions will have on the future, especially if the consequences are long-term, abstract, and at odds with current behaviors and identities (Marshall, 2015).

We are developing a tool to help the public understand the consequences of climate change.
We intend to make people aware of Climate Change in their direct environment by showing them concrete examples.

Currently we are focusing on simulating images of one specific extreme climate event: floods. We are aiming to create a flood simulator which, given a user-entered address, is able to extract a street view image of the surroundings and to alter it to generate a plausible image projecting flood where it is likely to occur.

{{ include.description }} — Visualization made with a Generative Adversarial Network (GAN)

Recent research has explored the potential of translating numerical climate models into representations that are intuitive and easy to understand, for instance via climate-analog mapping (Fitzpatrick et al., 2019) and by leveraging relevant social group norms (van der Linden, 2015). Other approaches have focused on selecting relevant images to best represent climate change impacts (Sheppard, 2012; Corner & Clarke, 2016) as well as using artistic renderings of possible future landscapes (Giannachi, 2012) and even video games (Angel et al., 2015). However, to our knowledge, our project is the first application of generative models to generate images of future climate change scenarios.

Technical Proposal

We propose to use Style Transfer and especially Unsupervised Image To Image Translation techniques to learn a transformation from a natural image of a house to its flooded version. This technology can leverage the quantity of cheap-to-acquire unannotated images.

Image To Image Translation : A class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs
Style Transfer : Aims to modify the style of an image while preserving its content. CycleGAN (Zhu et al., 2017)

Let $x_1 \in X_1$ and $x_2 \in X_2$ be images from two different image domains. $X_1$ represents the non-flooded domain which gathers several type of street-level imagery defined later in the data section and $X_2$ is the flooded domain composed of images where a part of a single house or building is visible and the street is partially or fully covered by water.

In the unpaired image-to-image translation setting, we are given samples drawn from two marginal distributions : $x_1 \sim p(x_1)$ samples of (non-flooded) houses and $x_2 \sim p(x_2)$ samples of flooded houses, without access to the joint distribution $p(x_1,x_2)$ .

From a probability theory viewpoint, the key challenge is to learn the joint distribution while only observing the marginals. Unfortunately, there is an infinite set of joint distributions that correspond to the given marginal distributions (cf coupling theory). Inferring the joint distribution from the marginals is a highly ill-defined problem. Assumptions are required to constrain the structure of the joint distribution, such as those introduced by the authors of CycleGAN.

In our case we are estimating the complex conditional distribution $p(x_2|x_1)$ with different image-to-image translation models $p(x_{1\rightarrow 2}|x_1)$ , where $x_{1\rightarrow 2}$ is a sample produced by translating $x_1$ to $X_2$ .

CycleGAN

CycleGAN is one of the research papers that revolutionized image-to-image translation in an unpaired setting. It has been used as the first proof of concept for this project.

It aims to capture the style from one image collection and to learn how to apply it to the other image collection. There are two main constraints in order to ensure conversion and coherent transformation:

The indistinguishable constraint: The produced output has to be indistinguishable from the samples of the new domain. This is enforced using the GAN Loss Goodfellow et al., 2014 and is applied at the distribution level. In our case, the mapping of the non-flood domain to the flood domain should create images that are indistinguishable from the training images of floods and vice-versa Galenti et al., 2017. But this constraint alone is not enough to map an input image in domain $X_1$ to an output image $X_2$ with the same semantics. The network could learn to generate realistic images from domain $X_2$ without preserving the content of the input image. This latter point is tackled by the cycle consistency constraint.

The Cycle-Consistency Constraint aims to regularize the mapping of the two domains in a meaningful way . It can be described as imposing a structural constraint which states that if we translate from one domain to the other and back again we should arrive at where we started. Formally, if we have a translator $G:X_1 \rightarrow X_2$ and another translator $F:X_2 \rightarrow X_1$ then $G$ and $F$ should be bijections and inverses of each other.

(a) The CycleGAN model contains two mapping functions $G:X_1 \rightarrow X_2$ and $F:X_2 \rightarrow X_1$ , and associated adversarial discriminators $D_{X_2}$ and $D_{X_1}$ . $D_{X_2}$ encourages $G$ to translate $X_1$ into outputs indistinguishable from domain $X_2$ , and vice versa for $D_{X_1}$ and $F$ .
(b) Forward cycle-consistency loss: $x_1\rightarrow G(x_1) \rightarrow F(G(x_1)) \approx x_1$
(c) Backward cycle-consistency loss: $x_2 \rightarrow F(x_2) \rightarrow G(F(x_2)) \approx x_2$

Pros and cons: The most advantageous part of this approach is its total lack of supervision, which means that the access to data is cheap (1K images of non-flooded and flooded houses). The major problem is that the style transfer is applied to the entire image.

Initial Results: When the ground is not concrete but grass and vegetation, CycleGAN generates a brown flood of low quality with blur on the edges between houses and grass. The color of the sky changes from blue to grey (probably because of the bias on the training set of flood images).

InstaGAN

The InstaGAN architecture is built on the foundations of CycleGAN. The main idea of their approach is to incorporate instance attributes $\mathcal{A}$ (and $\mathcal{B}$ ) to the source $X_1$ (and the target $X_2$ ) domain to improve the image-to-image translation. They describe their approach as learning joint mappings between attribute-augmented spaces $X_1$ × $\mathcal{A}$ and $X_2$ × $\mathcal{B}$ .

In our setting, the set of instance attributes $a\in\mathcal{A}$ is reduced to one attribute: a segmentation mask of where-to-flood and for the attribute of $b\in\mathcal{B}$ a segmentation mask covering the flood. Each network is designed to encode both an image and a set of masks (in our case a single mask).

The authors explicitly say that any useful information could be incorporated as an attribute and claim that their approach leads to disentangle different instances within the image allowing the generator to perform accurate and detailed translation from one instance to the other.

Pros and cons:. As for CycleGAN InstaGAN doesn't need paired images but requires the knowledge of some attributes, here the masks. Sometimes the model is able to render water in a realistic manner, including reflections and texture. But a major drawback is that, although it's penalized during training, it continues to modify the rest of the image (the unmasked region): colors change, artifacts appear, textures are different and fine details are blurred.

Results: Empirically we find that it works well with grass but not with concrete. Transparency is a big issue with InstaGAN's results on our task, since most of the time we can see the road lanes through the flood. Even in synthetic settings with aligned images InstaGAN generates relatively realistic water texture which remains transparent. We could conclude that it learns to reflect the sky on the water (whatever the color of the sky is), resulting in the fact that sometimes it paints blue on the concrete itself without the accompanying water texture. In our case results quality worsen dramatically out of the training set.

Note: The instances used in the papers are either segmentation mask of animals (e.g. translating sheep to giraffe), or segmentation mask of clothes (e.g. translating pants to skirt). In both cases, I found that theses instances are less diverse than instances from our non-flood to flood translation in the sense that sheep color, shape, texture is less diverse than the examples of flood or street in our dataset.

Generative Image Inpainting

Previous approaches based on modification of CycleGAN does not give us a fine control over the region that should be flooded. Assuming we are able to identify such a region in the image, we would only need to learn how to render water realistically. There are a lot of promising image edition techniques in the GAN literature demonstrating how to perform edition of specific attributes, morphing images or manipulating the semantic. These transformation are often performed on small latent space of generated fake images. However, natural image edition is a lot harder and there is no easy way of manipulating the semantic of natural images.

Image Inpainting is the technique of modifying and restoring a damaged image in a visually plausible way. Given recent advances in the field, it is now possible to guess lost information and replace it with plausible content at real-time speed.

For example, a recent deep-generative model exploiting contextual attention: DeepFill, is able to reconstruct high definition altered images of faces and landscapes at real-time speed. We believe that there is a way of leveraging the network generation capacity and apply its mechanisms to our case. Our experiment consist in biasing DeepFill to reconstruct only region where there is water (without surrounding water). We trained the network with several hundreds images of flood where the water was replaced by a grey mask. At inference we replaced what we defined as the ground with a grey mask. (see results below)

Results: The quality of the result is bad when given large masks. This could be explain by the fact that the architecture is designed to extract information from the context in the image: in the former experiment the network had to draw from a context where water is inexistent. To pursue research in that direction, one may want to give a better context to the network by using example of water texture on the side of the image. Or by increasing the dataset of images where there is water.

Current Approach

Our current approach is built on MUNIT. In the paper, a partially shared latent space assumption is made. It is assumed that images can be disentangled into a content code (domain-invariant) and a style code (domain-dependant). In this assumption, each image $x_{i}\in\mathcal{X}_{i}$ is generated from a content latent code $c\in \mathcal{C}$ that is shared by both domains, and a style latent code $s_{i}\in \mathcal{S}_{i}$ that is specific to the individual domain.

In other words, a pair of corresponding images $(x_{1},x_{2})$ from the joint distribution is assumed to be generated by $x_{1} = G^{*}_{1}(c, s_{1})$ and $x_{2} = G^{*}_{2}(c, s_{2})$ , where $c, s_{1}, s_{2}$ are from some prior distributions and $G^{*}_{1}$ , $G^{*}_{2}$ are the underlying generators. Given the former hypothesis, the goal is to learn the underlying generator and encoder functions with neural networks.

MUNIT image-to-image translation model consists of two auto-encoders (denoted by red and blue arrows respectively), one for each domain. The latent code of each auto-encoder is composed of a content code $c$ and a style code $s$ .

Image-to-image translation is performed by swapping encoder-decoder pairs. For example, to translate a house $x_{1}\in \mathcal{X}_{1}$ to a flooded-house $\mathcal{X}_{2}$ , one may use MUNIT to first extract the content latent code $c_{1} = E^{c}_{1}(x_{1})$ of the house image that we want to flood and randomly draw a style latent code $s_{2}$ from the prior distribution $q(s_{2})\sim \mathcal{N}(0, \mathbf{I})$ of flooded-houses and then use $G_{2}$ to produce the final output image $x_{1\rightarrow 2} = G_{2}(c_{1}, s_{2})$ (content from $x_1$ and style from $x_2$ ).

How does it work ?

Huang et al. demonstrated that Instance Normalization is deeply linked to style normalization. Munit transfer the style by modifying the features statistics, de-normalizing in a certain way. Given an input batch $x \in \mathbb{R}^{N\times C\times H\times W}$ , Instance Normalization Layers are used in MUNIT encoders to normalize feature statistics.

$IN(x) = \gamma \left ( \frac{x-\mu(x)}{\sigma(x)} \right )$

Where $\mu(x)$ and $\sigma(x)$ are computed as the mean and standard deviation across spatial dimensions independently for each channel and each sample. Adaptative Instance normalization layers are then used in the decoder to de-normalize the features statistics.

$AdaIN(z,\gamma,\beta) = \gamma \left ( \frac{z-\mu(z)}{\sigma(z)} \right ) + \beta$

With $\beta$ and $\gamma$ defined as a multi-layer perceptron (MLP), i.e., [ $\beta;\gamma$ ] = [ $\beta(s);\gamma(s)$ ]=MLP $(s)$ with $s$ the style. The fact that the de-normalization parameters are inferred with a MLP allow users to generate multiple output from one image.

Modifying the network to fit our purpose:

We questioned and transformed the official MUNIT architecture to fit our purpose.

Because we wanted control over the translation, we removed randomness from the style: the network is then trained to perform style transfer with the style extracted from one image and not sampled from a normal distribution.
After analyzing the feature space of the style, T-SNE plot we decided that sharing the weights between the style encoders could help the network to extract informative features. Since the results were not affected by this ablation we kept it. (See Experiment)
We shrink the architecture to use a single AutoEncoder and concluded that it was either longer to converge or that the transformation was harder to learn since the results were affected negatively. (See Experiment)
Based on the fact that the flooding process is destructive and that there is no reason that the network could reconstruct the road from the flooded version, we implemented a weaker version of the Cycle Consistency Loss where the later is only computed on a specific region of the image. The specific region is defined by a binary mask of where we think the image should be altered. For example a flooded image mapped to a non-flooded house should only be altered in an area close by the one delimited by the water. (In practice there are bias intrinsic to the dataset such as the sky often being gray in a image of flood) (See Experiment)
We trained a classifier to distinguish between flooded and non-flooded images (binary output) then use it when training MUNIT with a Loss on the generator indicating that fake flooded (resp non-flooded) images should be identified as flooded (resp non-flooded) by the classifier. It didn't improve the results we had, like if the flood classifier was a very bad discriminator that the generator could trick easily. (See Experiment)
To push the style encoder towards learning meaningful information, we investigated how to anonymise the representation of the content feature learned by MUNIT encoder. The idea behind is that if the content feature doesn't contains information about the domain it has been encoded, then the style would encode this information. We hence minimized the mutual information between the content feature and the source of the content. To do so we used a Domain-Classifier as in Learning Anonymized Representations with Adversarial Neural Networks
We experiment playing with the training ratio of the Discriminator and the Generator. We empirically found that a factor 5 does improve slightly the convergence speed.

Major Changes: Introducing a Semantic Consistency Loss, we use DeepLab v2 trained on cityscape to infer semantic label, and implemented an additional loss indicating to the generator that every fake image should keep the same Semantic as the source image before translation everywhere except a defined region where we think there should be an alteration. This modification dramatically improved our results. (See Experiment)

We also experimented with DeepLab v2 trained on COCO-Stuff. We thought this version would better suit our problem because it is able to identify water on the road but it turned out that (maybe because of the large number of classes) it didn't constrained much the network as with the previous version. We also tried to merge the classes from coco-stuff to only keep meta-classes that would be similar to cityscapes, it would allow us to keep a small number of classes and leverage the potential of identifying the water. (Impossible with Cityscape classes) (See Results).

How to leverage simulated data ?

We plan of using a simulated world built by Vahe Vardanyan with the Graphics Engine Unity to simulate different types of houses and streets under flood conditions to help our GAN understand where it should flood.

One main advantage of using synthetic data is that theoretically we would have access to an unlimited amount of pairs. The principal difficulty lies in leveraging those pairs despite the existing discrepancy between the distribution of synthetic and real data.

We can visualize the discrepancies between the differents domains with a T-SNE plot. Learning to flood natural images is equivalent to adapt samples from domain $X_1$ to domain $X_2$ and we would like to help the network learn this translation with an easier task: translating from $X_{1\_Synthetic}$ to $X_{2\_Synthetic}$ . Indeed, probably because of its pairs, the gap separating the synthetic domains is smaller than for the real one. We also notice that some of the real data are mixed with the synthetic cluster, somehow a proof that the synthetic world is well imitating the real world.

We mix simulated data with their natural equivalent (synthetic flooded images with flooded images) at training time with an additional pixelwise reconstruction loss computed on the pixel that shouldn't be altered.

$\large \mathcal{L}_{synthetic}(x_1,x_{1\rightarrow 2}) = \left |$mask$\cdot(x_1$-$x_{1\rightarrow 2}) | \right |$

Where mask $\large $mask$ = (x_1==x_2)$ correspond to the region of pixels where $x_1$ and $x_2$ are paired, in our case, where there is no water.

Evaluating the realism of our results

We synthesized our attempt to establish an automated evaluation metric to quantify fake Image realism in the following paper . Our work consisted in adapting several existing metrics (IS,FID, KID..) and assessing them against gold standard human evaluation: HYPE. While insufficient alone to establish a human-correlated automatic evaluation metric, we believe this work begins to bridge the gap between human and automated generative evaluation procedures.

Data Mining And Annotation:

We set a goal of recovering about 1000 images in each domain meeting a number of criteria.

Flooded Houses: images should present a part of a single house or building visible and the street partially or fully covered by water.

These images have been gathered using the results of different Google Image queries focusing on North-American suburban type of houses.

Non-flooded houses are a mix of several types of images:

Single houses with grass gathered manually from the Web.
Street-level imagery extracted from Google StreetView API.
Diverse street-level imagery covering a variety of weather conditions, seasons, times of day and viewpoints taken from a publicly available dataset.

Motivated by the idea that it would be easier to perform Image To Image Translation if our GAN had an idea of what the concepts of Ground and Water are, we increased the knowledge we had on the dataset by annotating pixels corresponding to Water in the Flooded Houses images and those corresponding to the Ground in Non-flooded houses images.

70% of the Flooded Houses were annotated using a Semantic Segmentation Network, namely DeepLab v2 trained on COCO-stuffs-164k dataset and merging some labels to output a binary mask of water.
30% of them have been manually annotated using LabelBox.
100% of the Non Flooded Houses were automatically segmented using DeepLab trained on CityScapes.

Simulated data

To create our simulated world, we used the Unity 3D game engine (version 2018.2.21f1). We created different types of buildings (skyscrapers, individual houses, industrial buildings) in the virtual world, combined with attributes of urban and rural environments: roads, trees, cars, mountains, vegetation, etc.

For each shot captured in the simulated world, we extract:

Original image (non-flooded)
Flooded image
Binary masks of the flood location
Depth image
Semantic segmentation image
Json file with camera parameters

See a more detailed description of each of these

We currently have 11k such captures. Note that each scene has been captured multiple times varying slightly the camera height, pitch, and position.

Height estimation

One of the challenges is to match the level of the generated flood to climate predictions, or at least to a plausible water level. In the aforementioned approaches, no notion of the geometry of the scene is included. We propose to introduce height information in our model in order to generate and respect the geometry of the scene.

Images obtained using masks based on semantic segmentation (ground -> water) in the weak cycle consistency constraint present realistic water texture, but lack physical realisticness, with water covering only the pixels corresponding to the ground and circumventing the wheels of cars, for example.

There are two main ways we can think of to include height information in our model and condition the GAN on water height:

At training time : compute binary masks of the areas that should be flooded and include them for computing the weak cycle consistency loss. Note that this might imply discretizing the flood levels (e.g. have 3 levels : low, medium, high - that would correspond to certain metric levels).
At inference time : condition the output on a binary mask of the area to flood (no discretization of flood levels needed) - This would require to change the MUNIT architecture/ or to use another architecture (e.g. it is be possible to put masks at inference time as the semantic layout in SPADE)

Here we present the two types of approaches that we have tried so far to estimate height in street scene images : "geometrical approaches" which consist in recovering the 3D metric geometry of the scene, and "end-to-end approaches" which consist in predicting height from an image input. The problem of estimating height from single view image has not been much explored in the literature, and there is no dataset of street scenes with associated height maps directly available .

Geometrical approaches

There are three steps in the geometric approaches to recover height:

1) Recover the 3D geometry of the scene (not metric)

2) Match relative coordinates to metric system in some way

3) Create binary masks of the areas to be flooded using a metric threshold

Step 1 : Recover the 3D geometry of the scene

The pinhole camera model can be applied to recover 3D coordinates of each pixel of the image using depth maps and camera parameters. Camera parameters considered in our case are the field of view and the pitch (roll and yaw are uassumed to be ~0), and we assume that pixels are square and that the optical center of the camera is in the middle of the image.
Pseudo relative-depth information is obtained using MegaDepth.

We can then derive the 3D geometry of the scene, and assuming that the water level of the floods corresponds to horizontal planes, we can extract the masks by taking all pixels with height values below a certain fixed threshold.

Limitations :

MegaDepth does not predict metric depth.
On our images, we noticed MegaDepth can be unreliable when there is a strong perspective, or for thin objects (e.g. poles)
Choosing a threshold on relative coordinates does not guarantee that the flood level will be consistently plausible on all the images, and correspond to physical reality.
While camera parameters can be specified when fetching data through the GSV API, they are unknown for the other real images we gathered to train our flood model.

Step 2: Match relative coordinates to metric system

1- Using reference objects in the image

alt text

One way of recovering metric scaling is to find reference objects that are often present in the dataset, and for which we know the dimensions : vehicles, pedestrians. We used the 3D Bounding Box detection model of Mousavian et al. to to detect cars, trucks and pedestrians. There are two steps in this method: first, perform 2D bounding box detection, and then use CNNs to regress the 3D parameters and dimensions. We used a pretrained model on KITTI dataset, hence the choice of the classes we kept as reference objects. When applying this model to our street scene images, we noticed that the objects were sometimes misclassified, especially if only part of a vehicle was visible (this model of 3D bounding box does not extend outside the frame of the picture). Hence, we decided to perform semantic segmentation on the whole input image, and checked that the labels from the segmentation were the same as the detections. We used this Deeplab segmentation model which was trained on Cityscapes. Because there were discrepancies between KITTI labels and Cityscapes labels, we only kept the classes truck, car and pedestrian.

We match the relative coordinates of the delimiters of the vertical edges of the bounding box to the metric height of the detected objects. We assume the objects are on the ground (which is almost sure to be the case) and match the bottom of the vertical edges to zero level (ground).

If no reference object is found, we output a mask that corresponds to the ground class in the semantic segmentation.

Limitations

Reference objects are not always present in the images. And when they are, if the object is far away, the scaling from relative height to metric might be very unreliable. This method seems to work well when there are multiple objects, and when they are more or less in plane facing the camera (at similar depth ).

2- Using the ground as reference

In this method, we do not rely on reference objects but rather on geometry. We assume that camera height is known (this is reasonable for street-level images extracted from known platforms) and that the ground is flat.

We assume that camera height is known (this is reasonable for street-level images extracted from known platforms).

alt text

We scale the relative coordinates to metric using ground as a reference. TO do so, we take the midle column of the image (right in the camera axis) and take two pixels that were segmented as belonging to the ground. We consider that both should be at height level 0 (assumption that ground is flat)

alt text

In this graphic, $h_c$ is the height of the camera, and $\alpha$ is half of the vertical field of view.

The matching to metric scaling we make is as follows (we refer to the above graphic):

take 2 points segmented as ground in the image (red dots on the projection plane)
the angles $\beta$ and $\gamma$ can be obtained using the y (vertical axis) pixel location in the image. Let H be the height of the image, and Y the distance in pixels from the middle of the image to the red pixel corresponding to the angle $\beta$
$\dfrac{tan \beta}{tan \alpha} = \dfrac{Y}{H/2}$
From there we can compute the yellow distance in meters:
$d= \dfrac{h_c}{tan \beta } - \dfrac{h_c}{tan \gamma }$$
We match this distance to the difference of relative depth to get the scaling from relative units to meters.

Limitations

The assumption that the ground is flat does not always hold. While in the images we tested our approach on, there were always pixels labelled as ground in the center vertical column of pixels of the image, it could be that because of vehicles, vegetation or others, there would be none. This also relies on two pixels only, and there would be some additional work to do to determine which ground pixels to choose (they shouldn't be too close in the front of the image because sometimes, there is warping and the ground is "stretched " in front of the camera etc.)

3- Create binary masks using a specified theshold

Once we have a correspondence between relative and metric coordinates, we can take all the pixels with height less than a threshold specified in meters and generate binary masks of these.

We added the option of merging the mask corresponding to ground segmentation and the mask obtained with the geometrical approach. The underlying idea is that sometimes depth maps obtained using MegaDepth can be unreliable for elements that are far away, but we would still like to flood ground. This holds only if the difference in heights of the ground is not too big ( no big slope), to satisfy the condition that the flood surface should be a horizontal plane.

End-to-End height estimator

One of the challenges in height estimation is that there is no ground truth dataset of height maps of street-level images. But ground truth height maps can be obtained from simulated data. Beyond building a ground truth dataset, we propose to leverage data from our simulator and train a height estimation model from single-view images. Indeed, images of houses flooded to any chosen height can be generated in the simulator. While methods for single image depth estimation have been investigated for many years, we have found no work in height estimation from street view scenes images. However, these two problems have some similarities, so we took inspiration from depths estimators to build our height estimator.

Following the success of MegaDepth on the task of single-view depth estimation, we chose to train an hourglass network to predict metric height maps from input images of street-level images.

hourglass

Architecture of the hourglass network (Image modified from here). Each block is a modified inception module, except blocks H which are Conv 3x3.

For our first model, we chose an L2 loss with a mask on the sky.

We tried this approach on data from our simulator and data obtained with the CARLA simulator separately. One of the main next steps would be to try this approach on real images. You can find more details about the model here