Dramatic and rapid changes to the global economy are required in order to limit climate-related risks for natural and human systems (IPCC, 2018). Governmental interventions are needed to fight climate change and they need strong public support. However, it is difficult to mentally simulate the complex effects of climate change (O’Neill & Hulme, 2009) and people often discount the impact that their actions will have on the future, especially if the consequences are long-term, abstract, and at odds with current behaviors and identities (Marshall, 2015).
Currently we are focusing on simulating images of one specific extreme climate event: floods. We are aiming to create a flood simulator which, given a user-entered address, is able to extract a street view image of the surroundings and to alter it to generate a plausible image projecting flood where it is likely to occur.
Recent research has explored the potential of translating numerical climate models into representations that are intuitive and easy to understand, for instance via climate-analog mapping (Fitzpatrick et al., 2019) and by leveraging relevant social group norms (van der Linden, 2015). Other approaches have focused on selecting relevant images to best represent climate change impacts (Sheppard, 2012; Corner & Clarke, 2016) as well as using artistic renderings of possible future landscapes (Giannachi, 2012) and even video games (Angel et al., 2015). However, to our knowledge, our project is the first application of generative models to generate images of future climate change scenarios.
We propose to use Style Transfer and especially Unsupervised Image To Image Translation techniques to learn a transformation from a natural image of a house to its flooded version. This technology can leverage the quantity of cheap-to-acquire unannotated images.
Image To Image Translation : A class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs
Style Transfer : Aims to modify the style of an image while preserving its content. CycleGAN (Zhu et al., 2017)
Let and be images from two different image domains. represents the non-flooded domain which gathers several type of street-level imagery defined later in the data section and is the flooded domain composed of images where a part of a single house or building is visible and the street is partially or fully covered by water.
In the unpaired image-to-image translation setting, we are given samples drawn from two marginal distributions : samples of (non-flooded) houses and samples of flooded houses, without access to the joint distribution .
From a probability theory viewpoint, the key challenge is to learn the joint distribution while only observing the marginals. Unfortunately, there is an infinite set of joint distributions that correspond to the given marginal distributions (cf coupling theory). Inferring the joint distribution from the marginals is a highly ill-defined problem. Assumptions are required to constrain the structure of the joint distribution, such as those introduced by the authors of CycleGAN.
In our case we are estimating the complex conditional distribution with different image-to-image translation models , where is a sample produced by translating to .
CycleGAN is one of the research papers that revolutionized image-to-image translation in an unpaired setting. It has been used as the first proof of concept for this project.
It aims to capture the style from one image collection and to learn how to apply it to the other image collection. There are two main constraints in order to ensure conversion and coherent transformation:
The indistinguishable constraint: The produced output has to be indistinguishable from the samples of the new domain. This is enforced using the GAN Loss Goodfellow et al., 2014 and is applied at the distribution level. In our case, the mapping of the non-flood domain to the flood domain should create images that are indistinguishable from the training images of floods and vice-versa Galenti et al., 2017. But this constraint alone is not enough to map an input image in domain to an output image with the same semantics. The network could learn to generate realistic images from domain without preserving the content of the input image. This latter point is tackled by the cycle consistency constraint.
The Cycle-Consistency Constraint aims to regularize the mapping of the two domains in a meaningful way . It can be described as imposing a structural constraint which states that if we translate from one domain to the other and back again we should arrive at where we started. Formally, if we have a translator and another translator then and should be bijections and inverses of each other.
(a) The CycleGAN model contains two mapping functions and , and associated adversarial discriminators and . encourages to translate into outputs indistinguishable from domain , and vice versa for and .
(b) Forward cycle-consistency loss:
(c) Backward cycle-consistency loss:
Pros and cons: The most advantageous part of this approach is its total lack of supervision, which means that the access to data is cheap (1K images of non-flooded and flooded houses). The major problem is that the style transfer is applied to the entire image.
Initial Results: When the ground is not concrete but grass and vegetation, CycleGAN generates a brown flood of low quality with blur on the edges between houses and grass. The color of the sky changes from blue to grey (probably because of the bias on the training set of flood images).
The InstaGAN architecture is built on the foundations of CycleGAN. The main idea of their approach is to incorporate instance attributes (and ) to the source (and the target ) domain to improve the image-to-image translation. They describe their approach as learning joint mappings between attribute-augmented spaces × and × .
In our setting, the set of instance attributes is reduced to one attribute: a segmentation mask of where-to-flood and for the attribute of a segmentation mask covering the flood. Each network is designed to encode both an image and a set of masks (in our case a single mask).
The authors explicitly say that any useful information could be incorporated as an attribute and claim that their approach leads to disentangle different instances within the image allowing the generator to perform accurate and detailed translation from one instance to the other.
Pros and cons:. As for CycleGAN InstaGAN doesn't need paired images but requires the knowledge of some attributes, here the masks. Sometimes the model is able to render water in a realistic manner, including reflections and texture. But a major drawback is that, although it's penalized during training, it continues to modify the rest of the image (the unmasked region): colors change, artifacts appear, textures are different and fine details are blurred.
Results: Empirically we find that it works well with grass but not with concrete. Transparency is a big issue with InstaGAN's results on our task, since most of the time we can see the road lanes through the flood. Even in synthetic settings with aligned images InstaGAN generates relatively realistic water texture which remains transparent. We could conclude that it learns to reflect the sky on the water (whatever the color of the sky is), resulting in the fact that sometimes it paints blue on the concrete itself without the accompanying water texture. In our case results quality worsen dramatically out of the training set.
Note: The instances used in the papers are either segmentation mask of animals (e.g. translating sheep to giraffe), or segmentation mask of clothes (e.g. translating pants to skirt). In both cases, I found that theses instances are less diverse than instances from our non-flood to flood translation in the sense that sheep color, shape, texture is less diverse than the examples of flood or street in our dataset.
Previous approaches based on modification of CycleGAN does not give us a fine control over the region that should be flooded. Assuming we are able to identify such a region in the image, we would only need to learn how to render water realistically. There are a lot of promising image edition techniques in the GAN literature demonstrating how to perform edition of specific attributes, morphing images or manipulating the semantic. These transformation are often performed on small latent space of generated fake images. However, natural image edition is a lot harder and there is no easy way of manipulating the semantic of natural images.
Image Inpainting is the technique of modifying and restoring a damaged image in a visually plausible way. Given recent advances in the field, it is now possible to guess lost information and replace it with plausible content at real-time speed.
For example, a recent deep-generative model exploiting contextual attention: DeepFill, is able to reconstruct high definition altered images of faces and landscapes at real-time speed. We believe that there is a way of leveraging the network generation capacity and apply its mechanisms to our case. Our experiment consist in biasing DeepFill to reconstruct only region where there is water (without surrounding water). We trained the network with several hundreds images of flood where the water was replaced by a grey mask. At inference we replaced what we defined as the ground with a grey mask. (see results below)
Results: The quality of the result is bad when given large masks. This could be explain by the fact that the architecture is designed to extract information from the context in the image: in the former experiment the network had to draw from a context where water is inexistent. To pursue research in that direction, one may want to give a better context to the network by using example of water texture on the side of the image. Or by increasing the dataset of images where there is water.
Our current approach is built on MUNIT. In the paper, a partially shared latent space assumption is made. It is assumed that images can be disentangled into a content code (domain-invariant) and a style code (domain-dependant). In this assumption, each image is generated from a content latent code that is shared by both domains, and a style latent code that is specific to the individual domain.
In other words, a pair of corresponding images from the joint distribution is assumed to be generated by and , where are from some prior distributions and , are the underlying generators. Given the former hypothesis, the goal is to learn the underlying generator and encoder functions with neural networks.
Image-to-image translation is performed by swapping encoder-decoder pairs. For example, to translate a house to a flooded-house , one may use MUNIT to first extract the content latent code of the house image that we want to flood and randomly draw a style latent code from the prior distribution of flooded-houses and then use to produce the final output image (content from and style from ).
Huang et al. demonstrated that Instance Normalization is deeply linked to style normalization. Munit transfer the style by modifying the features statistics, de-normalizing in a certain way. Given an input batch , Instance Normalization Layers are used in MUNIT encoders to normalize feature statistics.
Where and are computed as the mean and standard deviation across spatial dimensions independently for each channel and each sample. Adaptative Instance normalization layers are then used in the decoder to de-normalize the features statistics.
With and defined as a multi-layer perceptron (MLP), i.e., [] = []=MLP with the style. The fact that the de-normalization parameters are inferred with a MLP allow users to generate multiple output from one image.
We questioned and transformed the official MUNIT architecture to fit our purpose.
Because we wanted control over the translation, we removed randomness from the style: the network is then trained to perform style transfer with the style extracted from one image and not sampled from a normal distribution.
After analyzing the feature space of the style, T-SNE plot we decided that sharing the weights between the style encoders could help the network to extract informative features. Since the results were not affected by this ablation we kept it. (See Experiment)
We shrink the architecture to use a single AutoEncoder and concluded that it was either longer to converge or that the transformation was harder to learn since the results were affected negatively. (See Experiment)
Based on the fact that the flooding process is destructive and that there is no reason that the network could reconstruct the road from the flooded version, we implemented a weaker version of the Cycle Consistency Loss where the later is only computed on a specific region of the image. The specific region is defined by a binary mask of where we think the image should be altered. For example a flooded image mapped to a non-flooded house should only be altered in an area close by the one delimited by the water. (In practice there are bias intrinsic to the dataset such as the sky often being gray in a image of flood) (See Experiment)
We trained a classifier to distinguish between flooded and non-flooded images (binary output) then use it when training MUNIT with a Loss on the generator indicating that fake flooded (resp non-flooded) images should be identified as flooded (resp non-flooded) by the classifier. It didn't improve the results we had, like if the flood classifier was a very bad discriminator that the generator could trick easily. (See Experiment)
To push the style encoder towards learning meaningful information, we investigated how to anonymise the representation of the content feature learned by MUNIT encoder. The idea behind is that if the content feature doesn't contains information about the domain it has been encoded, then the style would encode this information. We hence minimized the mutual information between the content feature and the source of the content. To do so we used a Domain-Classifier as in Learning Anonymized Representations with Adversarial Neural Networks
We experiment playing with the training ratio of the Discriminator and the Generator. We empirically found that a factor 5 does improve slightly the convergence speed.
Major Changes: Introducing a Semantic Consistency Loss, we use DeepLab v2 trained on cityscape to infer semantic label, and implemented an additional loss indicating to the generator that every fake image should keep the same Semantic as the source image before translation everywhere except a defined region where we think there should be an alteration. This modification dramatically improved our results. (See Experiment)
We also experimented with DeepLab v2 trained on COCO-Stuff. We thought this version would better suit our problem because it is able to identify water on the road but it turned out that (maybe because of the large number of classes) it didn't constrained much the network as with the previous version. We also tried to merge the classes from coco-stuff to only keep meta-classes that would be similar to cityscapes, it would allow us to keep a small number of classes and leverage the potential of identifying the water. (Impossible with Cityscape classes) (See Results).
We plan of using a simulated world built by Vahe Vardanyan with the Graphics Engine Unity to simulate different types of houses and streets under flood conditions to help our GAN understand where it should flood.
One main advantage of using synthetic data is that theoretically we would have access to an unlimited amount of pairs. The principal difficulty lies in leveraging those pairs despite the existing discrepancy between the distribution of synthetic and real data.
We can visualize the discrepancies between the differents domains with a T-SNE plot. Learning to flood natural images is equivalent to adapt samples from domain to domain and we would like to help the network learn this translation with an easier task: translating from to . Indeed, probably because of its pairs, the gap separating the synthetic domains is smaller than for the real one. We also notice that some of the real data are mixed with the synthetic cluster, somehow a proof that the synthetic world is well imitating the real world.
We mix simulated data with their natural equivalent (synthetic flooded images with flooded images) at training time with an additional pixelwise reconstruction loss computed on the pixel that shouldn't be altered.
Where mask correspond to the region of pixels where and are paired, in our case, where there is no water.
We synthesized our attempt to establish an automated evaluation metric to quantify fake Image realism in the following paper . Our work consisted in adapting several existing metrics (IS,FID, KID..) and assessing them against gold standard human evaluation: HYPE. While insufficient alone to establish a human-correlated automatic evaluation metric, we believe this work begins to bridge the gap between human and automated generative evaluation procedures.
We set a goal of recovering about 1000 images in each domain meeting a number of criteria.
Flooded Houses: images should present a part of a single house or building visible and the street partially or fully covered by water.
These images have been gathered using the results of different Google Image queries focusing on North-American suburban type of houses.
Non-flooded houses are a mix of several types of images:
Motivated by the idea that it would be easier to perform Image To Image Translation if our GAN had an idea of what the concepts of Ground and Water are, we increased the knowledge we had on the dataset by annotating pixels corresponding to Water in the Flooded Houses images and those corresponding to the Ground in Non-flooded houses images.
To create our simulated world, we used the Unity 3D game engine (version 2018.2.21f1). We created different types of buildings (skyscrapers, individual houses, industrial buildings) in the virtual world, combined with attributes of urban and rural environments: roads, trees, cars, mountains, vegetation, etc.
For each shot captured in the simulated world, we extract:
See a more detailed description of each of these
We currently have 11k such captures. Note that each scene has been captured multiple times varying slightly the camera height, pitch, and position.
One of the challenges is to match the level of the generated flood to climate predictions, or at least to a plausible water level. In the aforementioned approaches, no notion of the geometry of the scene is included. We propose to introduce height information in our model in order to generate and respect the geometry of the scene.
Images obtained using masks based on semantic segmentation (ground -> water) in the weak cycle consistency constraint present realistic water texture, but lack physical realisticness, with water covering only the pixels corresponding to the ground and circumventing the wheels of cars, for example.
There are two main ways we can think of to include height information in our model and condition the GAN on water height:
Here we present the two types of approaches that we have tried so far to estimate height in street scene images : "geometrical approaches" which consist in recovering the 3D metric geometry of the scene, and "end-to-end approaches" which consist in predicting height from an image input. The problem of estimating height from single view image has not been much explored in the literature, and there is no dataset of street scenes with associated height maps directly available .
There are three steps in the geometric approaches to recover height:
1) Recover the 3D geometry of the scene (not metric)
2) Match relative coordinates to metric system in some way
3) Create binary masks of the areas to be flooded using a metric threshold
The pinhole camera model can be applied to recover 3D coordinates of each pixel of the image using depth maps and camera parameters.
Camera parameters considered in our case are the field of view and the pitch (roll and yaw are uassumed to be ~0), and we assume that pixels are square and that the optical center of the camera is in the middle of the image.
Pseudo relative-depth information is obtained using MegaDepth.
We can then derive the 3D geometry of the scene, and assuming that the water level of the floods corresponds to horizontal planes, we can extract the masks by taking all pixels with height values below a certain fixed threshold.
Limitations :
One way of recovering metric scaling is to find reference objects that are often present in the dataset, and for which we know the dimensions : vehicles, pedestrians. We used the 3D Bounding Box detection model of Mousavian et al. to to detect cars, trucks and pedestrians. There are two steps in this method: first, perform 2D bounding box detection, and then use CNNs to regress the 3D parameters and dimensions. We used a pretrained model on KITTI dataset, hence the choice of the classes we kept as reference objects. When applying this model to our street scene images, we noticed that the objects were sometimes misclassified, especially if only part of a vehicle was visible (this model of 3D bounding box does not extend outside the frame of the picture). Hence, we decided to perform semantic segmentation on the whole input image, and checked that the labels from the segmentation were the same as the detections. We used this Deeplab segmentation model which was trained on Cityscapes. Because there were discrepancies between KITTI labels and Cityscapes labels, we only kept the classes truck, car and pedestrian.
We match the relative coordinates of the delimiters of the vertical edges of the bounding box to the metric height of the detected objects. We assume the objects are on the ground (which is almost sure to be the case) and match the bottom of the vertical edges to zero level (ground).
If no reference object is found, we output a mask that corresponds to the ground class in the semantic segmentation.
Limitations
Reference objects are not always present in the images. And when they are, if the object is far away, the scaling from relative height to metric might be very unreliable. This method seems to work well when there are multiple objects, and when they are more or less in plane facing the camera (at similar depth ).
In this method, we do not rely on reference objects but rather on geometry. We assume that camera height is known (this is reasonable for street-level images extracted from known platforms) and that the ground is flat.
We assume that camera height is known (this is reasonable for street-level images extracted from known platforms).
We scale the relative coordinates to metric using ground as a reference. TO do so, we take the midle column of the image (right in the camera axis) and take two pixels that were segmented as belonging to the ground. We consider that both should be at height level 0 (assumption that ground is flat)
In this graphic, is the height of the camera, and is half of the vertical field of view.
The matching to metric scaling we make is as follows (we refer to the above graphic):
take 2 points segmented as ground in the image (red dots on the projection plane)
the angles and can be obtained using the y (vertical axis) pixel location in the image. Let H be the height of the image, and Y the distance in pixels from the middle of the image to the red pixel corresponding to the angle
From there we can compute the yellow distance in meters:
We match this distance to the difference of relative depth to get the scaling from relative units to meters.
Limitations
The assumption that the ground is flat does not always hold. While in the images we tested our approach on, there were always pixels labelled as ground in the center vertical column of pixels of the image, it could be that because of vehicles, vegetation or others, there would be none. This also relies on two pixels only, and there would be some additional work to do to determine which ground pixels to choose (they shouldn't be too close in the front of the image because sometimes, there is warping and the ground is "stretched " in front of the camera etc.)
Once we have a correspondence between relative and metric coordinates, we can take all the pixels with height less than a threshold specified in meters and generate binary masks of these.
We added the option of merging the mask corresponding to ground segmentation and the mask obtained with the geometrical approach. The underlying idea is that sometimes depth maps obtained using MegaDepth can be unreliable for elements that are far away, but we would still like to flood ground. This holds only if the difference in heights of the ground is not too big ( no big slope), to satisfy the condition that the flood surface should be a horizontal plane.
One of the challenges in height estimation is that there is no ground truth dataset of height maps of street-level images. But ground truth height maps can be obtained from simulated data. Beyond building a ground truth dataset, we propose to leverage data from our simulator and train a height estimation model from single-view images. Indeed, images of houses flooded to any chosen height can be generated in the simulator. While methods for single image depth estimation have been investigated for many years, we have found no work in height estimation from street view scenes images. However, these two problems have some similarities, so we took inspiration from depths estimators to build our height estimator.
Following the success of MegaDepth on the task of single-view depth estimation, we chose to train an hourglass network to predict metric height maps from input images of street-level images.
Architecture of the hourglass network (Image modified from here). Each block is a modified inception module, except blocks H which are Conv 3x3.
For our first model, we chose an L2 loss with a mask on the sky.
We tried this approach on data from our simulator and data obtained with the CARLA simulator separately. One of the main next steps would be to try this approach on real images. You can find more details about the model here