Inference

As stated, we use convolutions and transposed convolutions which keep the size of the output features (padding="same"), meaning that zero padding is used in convolutions so that the output has the same physical size as the input (or divided by the number of strides). We should understand that we can't use the features computed in the last layer directly, if we want to use this model in fully convolutional mode: for each processed blocks, only a subset would be exact, meaning independent of the input region. Using the output would produce some "blocking artifacts" caused by the zero padding in the convolutions.

Blocking artifacts¶

We will generate an image using the trained model, to observe these so-called blocking artifacts in the softmax output.

part_3_inference_artifacts.py

import pyotb
import argparse

parser = argparse.ArgumentParser(description="Apply the model")
parser.add_argument("--savedmodel", required=True, help="savedmodel directory")
params = parser.parse_args()

# Generate the classification map
infer = pyotb.TensorflowModelServe(
    n_sources=2,
    source1_il="/data/pan.tif",
    source1_rfieldx=128,
    source1_rfieldy=128,
    source1_placeholder="input_p",
    source2_il="/data/xs.tif",
    source2_rfieldx=32,
    source2_rfieldy=32,
    source2_placeholder="input_xs",
    model_dir=params.savedmodel,
    model_fullyconv=True,
    output_efieldx=128,
    output_efieldy=128,
    output_names="estimated",
)

infer.write("/data/map_artifacts.tif", ext_fname="box=2000:2000:1000:1000")

Question

Run the generation of the softmax output using the model located in the /data/models/model3 directory,
Open the image in QGIS.

As you can notice, each 64x64 patch border is clearly visible, and the softmax image is not continuous between adjacent processed patches.

Then, what is causing these blocking artifacts?

To ease the model build, we have specified the padding="same" option. This allows us to conveniently construct skip-connections between the model encoder (i.e. the "downsampling" part) and the decoder (the "upsampling" part). Hence, after each convolution, the feature maps are contaminated with the zero padding. Say we apply a convolution to a NxN input with a 3x3 kernel with stride 1 and padding="same", then the result is another NxN feature map. However, the borders of this new feature map are polluted with the zero values that were added around the input NxN to make the output having the same size as the input. We call the valid part, the central part (in spatial dimensions) of the output where the resulting values are not polluted by any upstream zero-value from padding. You can read more on the subject in this book, section "Semantic segmentation".

Valid part¶

The class otbtf.ModelBase provides the necessary to enable fully convolutional models to be applied over large images, avoiding blocking artifacts caused by convolutions at the borders of tensors. ModelBase comes with a postprocess_outputs(), that process the outputs tensors returned by get_outputs(). This creates new outputs, aiming to be used at inference time. The default implementation of ModelBase.postprocess_outputs() avoids blocking artifacts, by keeping only the values of the central part of the tensors in spatial dimensions.

If you take a look to otbtf.ModelBase.__init__() you can notice the inference_cropping parameter, with the default values set to [16, 32, 64, 96, 128]. Now if you take another look in
otbtf.ModelBase.postprocess_outputs(), you can see how these values are used: the model will create an array of outputs, each one cropped to one value of inference_cropping. These cropped outputs enable to avoid or lower the magnitude of the blocking artifacts in convolutional models. The new outputs tensors are named by the cropped_tensor_name() function, that returns a new name corresponding to:

f"{tensor_name}_crop{crop}"

For instance, for the new output tensor created for estimated, that removes 32 pixels from the borders in the spatial dimensions, would be named estimated_crop32.

How to choose the right cropping value?¶

Theoretically, we can determine the part of the output image that is not polluted by the convolutional padding. For a 2D convolution of stride $s$ and kernel size $k$ , we can deduce the valid output size $y$ from input size $x$ using this expression: $y = \left[\frac{x - k }{s}\right] + 1$ For a 2D transposed convolution of stride $s$ and kernel size $k$ , we can deduce the valid output size $y$ from input size $x$ using this expression: $y = x * s - k + 2$

Let's consider a chunk of input images with the following size:

input_p: 128x128
input_xs: 32x32

We chose an input patch size large enough so that the size of the feature maps in the model bottleneck is large enough to determine the right cropping value for the output. In the following table, we summarize the valid part size after each operations in the model, from the inputs to the output:

Name	Operation	Kernel	Stride	Out. size	Valid out. size
pan branch
input_p	/	/	/	128	128
conv1	Conv2D	3	2	64	63
conv2	Conv2D	3	2	32	31
xs branch
input_xs	/	/	/	32	32
conv_xs	Conv2D	3	1	32	30
xs+pan branches merging
conv_xs+conv2	Add	/	/	32	30
conv3	Conv2D	3	2	16	14
conv4	Conv2D	3	2	8	6
conv1t	Transposed Conv2D	3	2	16	11
conv2t	Transposed Conv2D	3	2	32	21
conv3t	Transposed Conv2D	3	2	64	41
estimated	Transposed Conv2D	3	2	128	81

This shows that our model can be applied in a fully convolutional fashion without generating blocking artifacts, using the central part of the output of size 81. This is equivalent to remove $(128 - 81) / 2 = 23$ pixels from the borders of the output. We keep the upper nearest power of 2 to keep the convolutions consistent between two adjacent image chunks, hence we can remove 32 pixels from the borders. We can hence use the output cropped with 32 pixels, named estimated_crop32 in the model outputs. By default, cropped outputs in otbtf.ModelBase are generated for the following values: [16, 32, 64, 96, 128] but that can be changed setting inference_cropping in the model __init__() (see the reference API documentation for details).

Info

Very deep networks will lead to very large cropping values. In these cases, there is a tradeoff between numerical exactness VS computational cost. In practice, expression field can be ridiculously enlarged since most of the networks learn to disminish the convolutional distortion at the border of the training patches.

Expression field¶

In is called expression field the spatial part that the model outputs for the tensors specified in the output_names. As explained, the model transforms an elementary input image of size 128x128 into an elementary output predicted label image of size 64x64. Hence, later we will use a receptive field of 128x128 (the input volume that the network "sees") and an expression field of 64x64 (the output volume that the network "fills").

part_3_inference_valid.py

import pyotb
import argparse

parser = argparse.ArgumentParser(description="Apply the model")
parser.add_argument("--savedmodel", required=True, help="savedmodel directory")
params = parser.parse_args()

# Generate the classification map
infer = pyotb.TensorflowModelServe(
    n_sources=2,
    source1_il="/data/pan.tif",
    source1_rfieldx=128,
    source1_rfieldy=128,
    source1_placeholder="input_p",
    source2_il="/data/xs.tif",
    source2_rfieldx=32,
    source2_rfieldy=32,
    source2_placeholder="input_xs",
    model_dir=params.savedmodel,
    model_fullyconv=True,
    output_efieldx=64,
    output_efieldy=64,
    output_names="estimated_crop32",
)

infer.write("/data/map_valid.tif", ext_fname="box=2000:2000:1000:1000")

Question

Run the generation of the valid part of the softmax output using the model located in the /data/models/model3 directory.
Open the image in QGIS.

Comparison with the vanilla output¶

The following animation alternates between the softmax of the class 1 (buildings) computed from the original output (estimated) and the cropped output with the valid part (estimated_crop32). The softmax values are stretched between [0, 1] to be displayed in a gray-level image.

Comparison between the output used to train the network and the output with
the valid part

Dig deeper 🚀¶

We propose to measure the convolutional distortion of the network, e.g. the difference (e.g. mean squared error) between the valid part, and the original output, for a number of cropping values. The idea is to show the compromise between the numerical exactness and the computational footpring.

Question

Create a deeper model with a larger theoretical cropping value, and specify additional cropping values with the inference_cropping argument in the model's __init__() (e.g. inference_cropping=[8, 16, 24, 32, 48, 64, 80, 96, 112, 128, 192, 256])
Create a python script that measures the convolutional distortion for the possible cropping values. You can use the ExtractROI OTB application to extract a portion of a geospatial image that fits a reference image, the BandMathX with a scalar product to compute the squared error between two images, and the ComputeImageStatistics to compute the mean value of an image.

You can use the following function as a helper:

import pyotb

...
def compute_rmse_value(ref, img):

  # Extract an ROI fitting the reference image
  roi = pyotb.ExtractROI(
    img,
    mode="fit",
    mode_fit_im=ref
  )

  # Compute the squared error between the ref and img
  se = pyotb.BandMathX(
    il=[roi, ref],
    exp="(im1-im2)*(im1-im2)'"
  )

  # Compute and return the root mean squared error
  stats = pyotb.ComputeImagesStatistics(se)
  return stats["out.mean"][0] ** 0.5