Inference
As stated, we use convolutions and transposed convolutions
which keep the size of the output features (padding="same"), meaning
that zero padding is used in convolutions so that the output has the
same physical size as the input (or divided by the number of strides).
We should understand that we can't use the features computed in the
last layer directly, if we want to use this model in fully convolutional
mode: for each processed blocks, only a subset would be exact,
meaning independent of the input region. Using the output would produce
some "blocking artifacts" caused by the zero padding in the convolutions.
Blocking artifacts¶
We will generate an image using the trained model, to observe these so-called blocking artifacts in the softmax output.
import pyotb
import argparse
parser = argparse.ArgumentParser(description="Apply the model")
parser.add_argument("--savedmodel", required=True, help="savedmodel directory")
params = parser.parse_args()
# Generate the classification map
infer = pyotb.TensorflowModelServe(
n_sources=2,
source1_il="/data/pan.tif",
source1_rfieldx=128,
source1_rfieldy=128,
source1_placeholder="input_p",
source2_il="/data/xs.tif",
source2_rfieldx=32,
source2_rfieldy=32,
source2_placeholder="input_xs",
model_dir=params.savedmodel,
model_fullyconv=True,
output_efieldx=128,
output_efieldy=128,
output_names="estimated",
)
infer.write("/data/map_artifacts.tif", ext_fname="box=2000:2000:1000:1000")
Question
- Run the generation of the softmax output using the model located in the
/data/models/model3directory, - Open the image in QGIS.
As you can notice, each 64x64 patch border is clearly visible, and the softmax image is not continuous between adjacent processed patches.
Then, what is causing these blocking artifacts?
To ease the model build, we have specified the padding="same" option. This
allows us to conveniently construct skip-connections between the model encoder
(i.e. the "downsampling" part) and the decoder (the "upsampling" part). Hence,
after each convolution, the feature maps are contaminated with the zero
padding. Say we apply a convolution to a NxN input with a 3x3 kernel with
stride 1 and padding="same", then the result is another NxN feature map.
However, the borders of this new feature map are polluted with the zero values
that were added around the input NxN to make the output having the same size
as the input. We call the valid part, the central part (in spatial
dimensions) of the output where the resulting values are not polluted by any
upstream zero-value from padding. You can read more on the subject in this
book, section "Semantic segmentation".
Valid part¶
The class otbtf.ModelBase provides the necessary to enable fully
convolutional models to be applied over large images, avoiding blocking
artifacts caused by convolutions at the borders of tensors. ModelBase comes
with a postprocess_outputs(), that process the outputs tensors returned by
get_outputs(). This creates new outputs, aiming to be used at inference time.
The default implementation of ModelBase.postprocess_outputs() avoids blocking
artifacts, by keeping only the values of the central part of the tensors in
spatial dimensions.
If you take a look to otbtf.ModelBase.__init__()
you can notice the inference_cropping parameter, with the default values set
to [16, 32, 64, 96, 128]. Now if you take another look in
otbtf.ModelBase.postprocess_outputs(),
you can see how these values are used: the model will create an array of
outputs, each one cropped to one value of inference_cropping. These cropped
outputs enable to avoid or lower the magnitude of the blocking artifacts in
convolutional models. The new outputs tensors are named by the
cropped_tensor_name()
function, that returns a new name corresponding to:
For instance, for the new output tensor created for estimated, that removes 32 pixels from the borders in the spatial dimensions, would be named estimated_crop32.
How to choose the right cropping value?¶
Theoretically, we can determine the part of the output image that is not polluted by the convolutional padding. For a 2D convolution of stride and kernel size , we can deduce the valid output size from input size using this expression: For a 2D transposed convolution of stride and kernel size , we can deduce the valid output size from input size using this expression:
Let's consider a chunk of input images with the following size:
- input_p: 128x128
- input_xs: 32x32
We chose an input patch size large enough so that the size of the feature maps in the model bottleneck is large enough to determine the right cropping value for the output. In the following table, we summarize the valid part size after each operations in the model, from the inputs to the output:
| Name | Operation | Kernel | Stride | Out. size | Valid out. size |
|---|---|---|---|---|---|
| pan branch | |||||
| input_p | / | / | / | 128 | 128 |
| conv1 | Conv2D | 3 | 2 | 64 | 63 |
| conv2 | Conv2D | 3 | 2 | 32 | 31 |
| xs branch | |||||
| input_xs | / | / | / | 32 | 32 |
| conv_xs | Conv2D | 3 | 1 | 32 | 30 |
| xs+pan branches merging | |||||
| conv_xs+conv2 | Add | / | / | 32 | 30 |
| conv3 | Conv2D | 3 | 2 | 16 | 14 |
| conv4 | Conv2D | 3 | 2 | 8 | 6 |
| conv1t | Transposed Conv2D | 3 | 2 | 16 | 11 |
| conv2t | Transposed Conv2D | 3 | 2 | 32 | 21 |
| conv3t | Transposed Conv2D | 3 | 2 | 64 | 41 |
| estimated | Transposed Conv2D | 3 | 2 | 128 | 81 |
This shows that our model can be applied in a fully convolutional fashion
without generating blocking artifacts, using the central part of the output of
size 81. This is equivalent to remove pixels from
the borders of the output. We keep the upper nearest power of 2 to keep the
convolutions consistent between two adjacent image chunks, hence we can remove 32
pixels from the borders. We can hence use the output cropped with 32 pixels,
named estimated_crop32 in the model outputs.
By default, cropped outputs in otbtf.ModelBase are generated for the following
values: [16, 32, 64, 96, 128] but that can be changed setting inference_cropping
in the model __init__() (see the reference API documentation for details).
Info
Very deep networks will lead to very large cropping values. In these cases, there is a tradeoff between numerical exactness VS computational cost. In practice, expression field can be ridiculously enlarged since most of the networks learn to disminish the convolutional distortion at the border of the training patches.
Expression field¶
In is called expression field the spatial part that the model outputs for the tensors specified in the output_names. As explained, the model transforms an elementary input image of size 128x128 into an elementary output predicted label image of size 64x64. Hence, later we will use a receptive field of 128x128 (the input volume that the network "sees") and an expression field of 64x64 (the output volume that the network "fills").
import pyotb
import argparse
parser = argparse.ArgumentParser(description="Apply the model")
parser.add_argument("--savedmodel", required=True, help="savedmodel directory")
params = parser.parse_args()
# Generate the classification map
infer = pyotb.TensorflowModelServe(
n_sources=2,
source1_il="/data/pan.tif",
source1_rfieldx=128,
source1_rfieldy=128,
source1_placeholder="input_p",
source2_il="/data/xs.tif",
source2_rfieldx=32,
source2_rfieldy=32,
source2_placeholder="input_xs",
model_dir=params.savedmodel,
model_fullyconv=True,
output_efieldx=64,
output_efieldy=64,
output_names="estimated_crop32",
)
infer.write("/data/map_valid.tif", ext_fname="box=2000:2000:1000:1000")
Question
- Run the generation of the valid part of the softmax output using the
model located in the
/data/models/model3directory. - Open the image in QGIS.
Comparison with the vanilla output¶
The following animation alternates between the softmax of the class 1 (buildings) computed from the original output (estimated) and the cropped output with the valid part (estimated_crop32). The softmax values are stretched between [0, 1] to be displayed in a gray-level image.

Dig deeper 🚀¶
We propose to measure the convolutional distortion of the network, e.g. the difference (e.g. mean squared error) between the valid part, and the original output, for a number of cropping values. The idea is to show the compromise between the numerical exactness and the computational footpring.
Question
- Create a deeper model with a larger theoretical cropping value, and
specify additional cropping values with the
inference_croppingargument in the model's__init__()(e.g.inference_cropping=[8, 16, 24, 32, 48, 64, 80, 96, 112, 128, 192, 256]) - Create a python script that measures the convolutional distortion for
the possible cropping values. You can use the
ExtractROIOTB application to extract a portion of a geospatial image that fits a reference image, theBandMathXwith a scalar product to compute the squared error between two images, and theComputeImageStatisticsto compute the mean value of an image.
You can use the following function as a helper:
import pyotb
...
def compute_rmse_value(ref, img):
# Extract an ROI fitting the reference image
roi = pyotb.ExtractROI(
img,
mode="fit",
mode_fit_im=ref
)
# Compute the squared error between the ref and img
se = pyotb.BandMathX(
il=[roi, ref],
exp="(im1-im2)*(im1-im2)'"
)
# Compute and return the root mean squared error
stats = pyotb.ComputeImagesStatistics(se)
return stats["out.mean"][0] ** 0.5