deep learningmachine learningsemantic segmentationobject detectionimage classification

Deep Learning for Earth Observation: From CNNs to Semantic Segmentation

Kazushi MotomuraJanuary 5, 20267 min read
Deep Learning for Earth Observation: From CNNs to Semantic Segmentation

Quick Answer: Deep learning has moved satellite image analysis beyond pixel-level indices to scene understanding. CNNs classify entire image patches, object detection models (YOLO, Faster R-CNN) locate specific features like buildings or ships, and semantic segmentation networks (U-Net, DeepLab) classify every pixel into land cover categories. Key challenges in Earth observation include limited labeled training data, large image sizes (often 10,000+ pixels per side), class imbalance (rare features in vast landscapes), and multi-spectral inputs that standard pretrained models don't handle natively. Transfer learning from ImageNet helps but requires adaptation for satellite-specific spectral bands and spatial resolutions.

In 2023, a colleague asked me to help map informal settlements across a 50,000 km² region in Southeast Asia. Traditional classification — supervised maximum likelihood on spectral bands — produced results that were, charitably, unusable. The settlements were spectrally identical to surrounding bare soil. It took a U-Net trained on 200 hand-labeled patches to achieve 87% accuracy. That project crystallized something I'd been sensing for years: for certain problems, deep learning isn't just better than classical approaches — it's the only approach that works.

But deep learning in remote sensing isn't the same as deep learning in computer vision. The data is different, the scale is different, and the failure modes are different.

Why Classical Methods Hit a Ceiling

Traditional satellite image classification relies on spectral signatures — the reflectance values in different wavelength bands. This works beautifully for problems where the target has a distinct spectral profile: vegetation (high NIR reflectance), water (low NIR), snow (high visible, low SWIR).

It fails when the problem requires spatial context. A pixel's spectral values alone can't tell you whether it belongs to a building, a road, or a parking lot — all three may have similar reflectance. But a convolutional neural network can learn that buildings have rectangular shapes, roads are linear, and parking lots are large flat areas adjacent to buildings.

This is the fundamental advantage: deep learning can learn spatial patterns, not just spectral ones.

The Three Levels of Deep Learning in Earth Observation

Scene Classification

The simplest application: classify an entire image tile. "Is this tile urban, agricultural, forest, or water?" ResNet and EfficientNet architectures work well here. You typically work with patches of 64×64 to 256×256 pixels.

When it's useful: Large-area land use mapping where you need broad categories, not precise boundaries. The EuroSAT dataset (Sentinel-2 patches labeled into 10 land use classes) is the standard benchmark — current models exceed 98% accuracy.

Object Detection

Locate and classify specific objects: buildings, ships, aircraft, solar panels, swimming pools. YOLO and Faster R-CNN are the dominant architectures, adapted for satellite imagery.

The challenge with satellite data: Objects are small. A ship in a 10m Sentinel-2 image might be 3-5 pixels. Even in sub-meter commercial imagery, a car is 4-6 pixels across. This is fundamentally different from natural images where objects typically occupy a significant portion of the frame.

Practical tip: Anchor box sizes must be recalibrated for satellite imagery. Standard pretrained detection models use anchor boxes designed for objects in photographs — far too large for overhead imagery.

Semantic Segmentation

Classify every pixel: this pixel is building, that pixel is road, this one is vegetation. U-Net remains the workhorse architecture for remote sensing segmentation, largely because its encoder-decoder structure with skip connections preserves both high-level features and fine spatial detail.

DeepLab v3+ is the other common choice, using atrous (dilated) convolutions to capture multi-scale context without losing resolution. For very high-resolution imagery (<1m), combining both approaches often outperforms either alone.

The Satellite-Specific Challenges

More Than Three Channels

Standard deep learning models expect 3-channel RGB images. Sentinel-2 has 13 bands. Landsat has 11. SAR has complex-valued data with amplitude and phase.

Common approaches:

  • Band selection: Choose 3-4 bands relevant to your task (e.g., NIR-Red-Green for vegetation)
  • Modified input layers: Replace the first convolutional layer to accept N channels, randomly initialize it, and fine-tune
  • Late fusion: Process subsets of bands through separate encoder branches, merge at a deeper layer

In my experience, the modified input layer approach offers the best balance of simplicity and performance. You lose ImageNet pretraining for the first layer but retain it for all subsequent layers.

Image Size and Tiling

A single Sentinel-2 granule is 10,980 × 10,980 pixels at 10m resolution. You can't feed this into a neural network directly. The standard approach is tiling with overlap:

  1. Cut the image into overlapping tiles (e.g., 256×256 with 50% overlap)
  2. Run inference on each tile
  3. Merge predictions, averaging overlapping regions

The edge effect problem: Predictions near tile edges are often less accurate because the network lacks context beyond the boundary. Overlap and averaging mitigate this, but don't eliminate it entirely.

Limited Labeled Data

This is the biggest practical challenge. Labeling satellite imagery is expensive and requires domain expertise. You can't crowdsource it the way you can for cat-vs-dog classification.

Strategies that work:

  • Transfer learning from ImageNet: Even though satellite images look nothing like photographs, the low-level features (edges, textures) transfer surprisingly well
  • Self-supervised pretraining on satellite data: SatMAE, SSL4EO-S12, and similar models pretrained on large unlabeled satellite datasets provide better starting points than ImageNet for remote sensing tasks
  • Data augmentation: Random rotations (any angle — overhead images have no canonical orientation), flips, color jittering, and random cropping
  • Semi-supervised learning: Label a small subset, train a model, use it to generate pseudo-labels for unlabeled data, retrain

Class Imbalance

If you're mapping buildings in a rural area, 95% of pixels might be "not building." The network learns to predict "not building" everywhere and achieves 95% accuracy while being completely useless.

Solutions: Weighted loss functions (increase the penalty for misclassifying rare classes), focal loss, or oversampling rare-class patches during training.

Change Detection with Deep Learning

One of the most impactful applications: detecting what changed between two dates. Classical approaches compare spectral values directly (image differencing, CVA). Deep learning approaches can learn complex change patterns.

Siamese networks are the standard architecture: two identical encoder branches process the pre-change and post-change images, and a decoder produces a change/no-change map. The network learns to distinguish meaningful changes (new construction, deforestation) from noise (seasonal vegetation changes, atmospheric differences).

Practical results: In post-disaster assessment, a well-trained Siamese U-Net can map building damage from pre/post-event satellite imagery in minutes — a task that traditionally requires days of manual interpretation.

What Deep Learning Cannot (Yet) Do Well

  • Generalize across geographies without retraining: A building detection model trained on European cities performs poorly in African informal settlements
  • Work with very few examples: If you have fewer than ~50 labeled examples of your target class, classical methods often outperform deep learning
  • Explain its decisions: A spectral threshold is interpretable; a 50-layer neural network is not. In regulatory or scientific contexts, this matters
  • Handle temporal irregularity: Satellite time series have varying acquisition dates, cloud contamination, and different viewing angles — challenges that standard architectures don't handle natively

Getting Started: A Practical Path

  1. Start with transfer learning — don't train from scratch. Use a ResNet or EfficientNet pretrained on ImageNet, modify the input channels, and fine-tune on your labeled data
  2. Begin with classification, not segmentation — scene-level labels are far cheaper to produce than pixel-level masks
  3. Use existing datasets — EuroSAT, BigEarthNet, SpaceNet, xView, DOTA provide labeled satellite data for various tasks
  4. Consider torchgeo — the PyTorch library specifically designed for geospatial deep learning, with built-in support for common satellite datasets, samplers, and transforms
  5. Benchmark against classical methods — if random forest on spectral features achieves 90% accuracy, spending weeks training a deep learning model to reach 92% may not be worth it

The field is moving fast. Foundation models trained on massive satellite archives (like IBM's Prithvi and ESA's PhilEO) are making it possible to achieve good results with far less labeled data than before. But the fundamentals — understanding your data, defining your problem clearly, and validating rigorously — matter more than the architecture you choose.

Kazushi Motomura

Kazushi Motomura

Remote sensing specialist with 10+ years in satellite data processing. Founder of Off-Nadir Lab. Master's in Satellite Oceanography (Kyushu University).