How mixedprecision training prevented me winning a competition
Introduction
I usually start each new solution by creating a solid baseline from scratch, without looking to provided baselines. To my beliefs, it helps to build a deep understanding of the problem. Although my experience in competitive machine learning is quite extensive, monocular geopose estimation from satellite imagery was new to me. After doing some initial EDA, it became clear that data presents a few challenges to participants: * Speckle noise in the AGL map (Presumably due to way of data acquisition) * Temporal inconsistency between measured AGL values and RGB images (A good example is a fence that is visible in AGL, but missing in RGB domain, different constructions sites) * Possible data leak due to overlapping/nearby tiles from same location belonging to train/test.
I’ve done my experiments on my selfbuilt machines: One with 2080Ti and another PC with two 3090s. I’ve used a single GPU PC for quick experiments and used the second for longrunning training sessions in distributed mode. Training of a single model could easily take up to 3 days even on two 3090 cards.
Train/Validation split nuances
To my understanding, the safest way of splitting satellite imaging data into train/test parts is to ensure no city/region belongs to both train and test simultaneously. Checkerboard style of assigning image tiles to train and test is likely to cause severe data leak and make the validation untruthful.
In contrast, when using a locationbased group split, we can guarantee that validation is leakfree and this gives us the possibility to evaluate model performance on the unseen regions.
Unfortunately, I’ve noticed that there is a certain degree of overlap / nearby tiles between train and test. So instead of building a locationbased group split (Which is more correct from a business perspective), I’ve used the advantage of the data leak and did a locationstratified split instead. This way I intentionally used data leak since I knew beforehand it would be an advantage on the leaderboard compared to a leakfree split.
I’ve used a 10fold CV split, but I’ve only managed to train 3 folds in the remaining amount of time. I expect a full 10fold ensemble would perform substantially better.
Modelling approach
I’ve used a UNetlike model with two heads (AGL, MAG) attached to the output of the decoder (to predict AGL in meters and magnitude in centimeters) and a third head (ANGLE) attached to the output of the encoder to predict the orientation. The overall architecture mimics very much the official baseline model, but the structure of prediction heads differs significantly, which I will explain later. Unfortunately, I’ve started this competition just two weeks before the deadline and haven’t much time to do a proper ablation study. Comparing the performance of HRNet and FPN architectures to UNet would be a good idea for a future paper.
In my final solution I’ve used two model architectures:
 EfficientNet B7 encoder with vanilla UNet decoder
 EfficientNet B6 encoder with custommade UNetlike decoder with residual deconvolutions.
Empirical study shows that diverse models play well in an ensemble. Therefore, I intentionally was training different models on different folds to make the diverse ensemble. But two bestperforming models were the heaviest ones from the above. Training each took almost 3 days on a 2x3090 setup.
Apart from optimized model architecture, I believe other key knowhows helped me to perform better in this challenge:
 Good data augmentation to increase model robustness and generalization. It also allowed using testtime augmentation at the inference stage.
 Tuning model architecture to have output stride 1 for predicted AGL
 Proper choice of the loss function
 Deep ensemble technique for blending predictions of separate models
 Full D4 testtime augmentation
Height map predictions
Dense prediction heads (for AGL and Magnitude) were designed with gradual upsampling from stride 4 to stride 1. I found that simple bilinear upsampling to full resolution does not give enough accuracy. So I ended up bilinear upsampling operator by a factor of two, followed by convolution & activation. This block was repeated two times and 1x1 convolution at the end to make final predictions.


AGL prediction head was also using the ground sample distance (GSD) value from the metadata file as an additional cue. Essentially it serves as a scaling coefficient to go from arbitrary units (inside CNN) to real units (meters).
MAG head did not have this input, since it predicts values in pixel units, and can deduct scaling on its own using only image data.
For optimizing this AGL & MAG objective I’ve tested many loss functions but ended up with Huber loss. According to the R2 score on validation, models trained with Huber loss performed better than the ones, trained with MSE or L1 or their combination. My explanation for this phenomenon is that Huber loss is less sensitive to outlier than MSE and converges faster than “vanilla” L1 loss. I tuned the gamma parameter to 10 for AGL and 5 to MAG to ensure there is still enough “sweet spot” for the MSE component of that loss.
It is also worth noting that I’ve used meter units as an objective for AGL optimization. The use of centimeter units leads to instabilities during the training in fp16 mode due to large values in target height maps. Scaling target values to meter units solved this problem. I’ve also experimented with logscale height regression and logcosh loss function, yet without noticeable improvement in R2 score.
Orientation prediction
For making predictions for orientation, I used similar approach as in the baseline solution  global pooling classifier attached to the last feature map from the encoder:


The only deviation from the baseline head  additional nonlinearity for orientation regression to allow more dedication for angle representation of the head itself other than final layers of the encoder. Orientation was parametrized as vector of two components cos(theta), sin(theta)
.
I’ve used a combination of Cosine and MSE losses for optimizing this objective. For the Cosine component, a weight of 1.0 was used, while MSE loss had the weight of 0.1 and acted as a regularization loss to keep predictions around the unit circle.
Also, it’s worth noticing, that areas in the image, with small (<1m) ground height are unlikely to have enough signal in it to compute the orientation. Presumably if one can exclude or downweight those areas during orientation prediction  this may increase the accuracy of the orientation prediction. One heuristics that came to my mind is the following:
 Compute AGL predictions as usual
 Compute the absolute magnitude of the spatial gradients of the predicted AGL.
 Compute 2D softmax on abs. magnitude map to get an attention mask.
 Compute orientation predictions in a dense fashion, similar to AGL head.
 Multiply dense orientation predictions with an attention mask.
The line of thought is the following  this approach gives more weight to areas where height value is changing  edges of the buildings and disfavor flat regions.
Data Augmentation
Data augmentation proved itself to be a key ingredient in winning ML challenges. This one is not an exclusion. A wellknown Albumentations library is now a defacto standard library for image augmentation in python, which offers almost a hundred image/mask/boxes augmentation of all sorts. However, this competition contained data of many domains:  Spatial, pixels  Spatial, meters  Scalar, meters/pixel  Scalar, degree
It was necessary to have augmentations that were consistent across all domains. So I’ve implemented a custom affine augmentation class to augment heigh maps consistently with orientation angle and scale parameters. It helped to increase the diversity of image orientation and scale range during the training.


I’ve also implemented augmentations for 90degree rotation and transposition. Training with these augmentations increased model stability and enabled me to use D4 testtime augmentation at the inference stage.
Apart from spatial augmentations, I’ve also used photometric augmentations. In particular:
 Random brightness & contrast change
 Random tone curve adjustment
 Random gamma correction
 Applying different noise models: Gaussian, ISO, Multiplicicative
 Random RGB channels shift
 Random shifts in HSV colorspace
 Fancy PCA Augmentation
 Random fog augmentation
 Coarse dropout of square blocks up to 128x128
The full augmentation pipeline looked as follows:


This set of augmentation isn’t scientifically picked, but rather battletested in many similar competitions (Inria Aerial Labeling, Kaggle, SpaceNet challenges). Usually, as a rule of thumb  the heavier the model, the more augmentations you may apply to prevent overfitting.
According to training logs, I was very far from overfitting. But also, it could be a leaky validation scheme that I have deliberately chosen. Anyway, photometric augmentation usually helps a lot. Spatial augmentation should be applied with care, but also generally safe.
Inference tricks
During the training session, I’ve saved the best 3 checkpoints based on the R2 score on validation. At inference, I used all of them and averaged their predictions to get the final result. Averaging height map predictions and scale was done naively using the mean average. Angle averaging was done a bit differently  I normalized angle predictions to unit length and then averaged them.
The second inference trick was testtime augmentation. When properly trained, the model is invariant to flips/D2/D4 augmentations. However, averaging predictions of 8 variants of the input image, greatly reduces the variance of the predictions and improves a final R2 at a price of increased inference time. According to my measurements, D4 TTA gave a boost of ~0.005~0.01 R2 depending on the model.
All ensembling was done entirely on the GPU, without involving temporary storage. For this purpose I’ve written a special wrapper to make augmentation/deaugmentation on the fly during the inference:


I’ve found TTA to be an extremely useful way of “cheap” increase of model accuracy. Yet it comes for N times increase in inference time (8 for D4, 4 for D2, and 2x for horizontal flips). Yet in uncapped competitions, participants are very tempted to go with huge ensembles and heavy TTA techniques as long as rules do not restrict inference time.
What didn’t work or I didn’t have time to play with
Next time I will start a bit earlier. Two weeks was indeed a “last call” to hop on. The challenge was complicated and hardware demanding. Also, there were many strong participants on the leaderboard. A few ideas left intact this time:
 Transformersbased model. It has been shown recently, that ViT, SWIN, and other approaches can beat classical CNN models in segmentation tasks. Therefore, it would be interesting to compare their performance with my solution.
 Trimmed losses. Monocular depth estimation benefits greatly from using trimmed losses to exclude the influence of strong outliers during the training. I wonder whether this can be applied to geopose estimation as well.
 HRNet, FPN architectures, NFNets
 GANbased data augmentation. I feel this is a largely underestimated technique. It requires significant effort to master but can lead to a substantial boost of the model generalization.
 Direct optimization of R2 score. Could be interesting to look at how better/worse it will be compared to MSE/Huber losses.
I’d like to say thanks to the organizers of this challenge for such an interesting task! Hope to see more similar competitions in the future.
References
 UNet: Convolutional Networks for Biomedical Image Segmentation (https://arxiv.org/abs/1505.04597)
 Albumentations (https://github.com/albumentationsteam/albumentations)
 Catalyst (https://github.com/BloodAxe/xView2Solution)
 pytorchtoolbelt (https://github.com/BloodAxe/pytorchtoolbelt)
 XView2 Solution (https://github.com/BloodAxe/xView2Solution)
 GeoPose Solution (https://github.com/BloodAxe/DrivenData2021GeoposeSolution)
 PyTorch Image Models (https://github.com/rwightman/pytorchimagemodels)
 The Devil is in the Decoder: Classification, Regression and GANs (https://arxiv.org/abs/1707.05847)
 Learning Geocentric Object Pose in Oblique Monocular Images (https://openaccess.thecvf.com/content_CVPR_2020/papers/Christie_Learning_Geocentric_Object_Pose_in_Oblique_Monocular_Images_CVPR_2020_paper.pdf)