deep learningmachine learningsemantic segmentationobject detectionimage classification

Deep Learning for Earth Observation: From CNNs to Semantic Segmentation

Name: Off-Nadir Delta
Author: Kazushi Motomura

Kazushi MotomuraJanuary 5, 2026(Updated: July 11, 2026)9 min read

Deep Learning for Earth Observation: From CNNs to Semantic Segmentation

Quick Answer: Deep learning has moved satellite image analysis beyond pixel-level indices to scene understanding. CNNs classify entire image patches, object detection models (YOLO, Faster R-CNN) locate specific features like buildings or ships, and semantic segmentation networks (U-Net, DeepLab) classify every pixel into land cover categories. Key challenges in Earth observation include limited labeled training data, large image sizes (often 10,000+ pixels per side), class imbalance (rare features in vast landscapes), and multi-spectral inputs that standard pretrained models don't handle natively. Transfer learning from ImageNet helps but requires adaptation for satellite-specific spectral bands and spatial resolutions.

For certain remote sensing problems, deep learning isn't just better than classical approaches — it's the only approach that works. The dividing line is spatial context: when the target is defined by shape and arrangement rather than spectral signature, pixel-based methods fail no matter how carefully you tune them. In 2023, a colleague asked me to help map informal settlements across a 50,000 km² region in Southeast Asia. Traditional classification — supervised maximum likelihood on spectral bands — produced results that were, charitably, unusable. The settlements were spectrally identical to surrounding bare soil. It took a U-Net trained on 200 hand-labeled patches to achieve 87% accuracy.

But deep learning in remote sensing isn't the same as deep learning in computer vision. The data is different, the scale is different, and the failure modes are different.

Why do classical methods hit a ceiling?

Traditional satellite image classification relies on spectral signatures — the reflectance values in different wavelength bands — and it hits its ceiling the moment a problem requires spatial context. Spectral profiles work beautifully when the target is spectrally distinct: vegetation (high NIR reflectance), water (low NIR), snow (high visible, low SWIR). They cannot separate targets that reflect alike.

A pixel's spectral values alone can't tell you whether it belongs to a building, a road, or a parking lot — all three may have similar reflectance. But a convolutional neural network can learn that buildings have rectangular shapes, roads are linear, and parking lots are large flat areas adjacent to buildings.

This is the fundamental advantage: deep learning can learn spatial patterns, not just spectral ones.

The Three Levels of Deep Learning in Earth Observation

Scene Classification

The simplest application: classify an entire image tile. "Is this tile urban, agricultural, forest, or water?" ResNet and EfficientNet architectures work well here. You typically work with patches of 64×64 to 256×256 pixels.

When it's useful: Large-area land use mapping where you need broad categories, not precise boundaries. The EuroSAT dataset (Sentinel-2 patches labeled into 10 land use classes) is the standard benchmark — current models exceed 98% accuracy.

Object Detection

Locate and classify specific objects: buildings, ships, aircraft, solar panels, swimming pools. YOLO and Faster R-CNN are the dominant architectures, adapted for satellite imagery.

The challenge with satellite data: Objects are small. A ship in a 10m Sentinel-2 image might be 3-5 pixels. Even in sub-meter commercial imagery, a car is 4-6 pixels across. This is fundamentally different from natural images where objects typically occupy a significant portion of the frame. (For vessels specifically, radar imagery is often the better input — see continuous maritime monitoring with SAR ship detection.)

Practical tip: Anchor box sizes must be recalibrated for satellite imagery. Standard pretrained detection models use anchor boxes designed for objects in photographs — far too large for overhead imagery.

Semantic Segmentation

Classify every pixel: this pixel is building, that pixel is road, this one is vegetation. U-Net remains the workhorse architecture for remote sensing segmentation, largely because its encoder-decoder structure with skip connections preserves both high-level features and fine spatial detail.

DeepLab v3+ is the other common choice, using atrous (dilated) convolutions to capture multi-scale context without losing resolution. For very high-resolution imagery (<1m), combining both approaches often outperforms either alone.

What makes satellite imagery hard for deep learning?

Four things, mainly: multispectral inputs that pretrained RGB models don't accept, scenes tens of thousands of pixels wide that force tiling, labeled data that only domain experts can produce, and severe class imbalance when the target occupies a sliver of the landscape. None of these is fatal — each has standard workarounds, covered below — but ignoring them is the usual reason a first satellite deep learning project fails.

More Than Three Channels

Standard deep learning models expect 3-channel RGB images. Sentinel-2 has 13 bands — ESA describes its imager as "an innovative wide swath high-resolution multispectral imager with 13 spectral bands." Landsat has 11. SAR has complex-valued data with amplitude and phase (free Sentinel-1 archives are distributed through the Alaska Satellite Facility).

Common approaches:

Band selection: Choose 3-4 bands relevant to your task (e.g., NIR-Red-Green for vegetation)
Modified input layers: Replace the first convolutional layer to accept N channels, randomly initialize it, and fine-tune
Late fusion: Process subsets of bands through separate encoder branches, merge at a deeper layer

In my experience, the modified input layer approach offers the best balance of simplicity and performance. You lose ImageNet pretraining for the first layer but retain it for all subsequent layers.

Image Size and Tiling

A single Sentinel-2 granule is 10,980 × 10,980 pixels at 10m resolution. You can't feed this into a neural network directly. The standard approach is tiling with overlap:

Cut the image into overlapping tiles (e.g., 256×256 with 50% overlap)
Run inference on each tile
Merge predictions, averaging overlapping regions

The edge effect problem: Predictions near tile edges are often less accurate because the network lacks context beyond the boundary. Overlap and averaging mitigate this, but don't eliminate it entirely.

Limited Labeled Data

This is the biggest practical challenge. Labeling satellite imagery is expensive and requires domain expertise. You can't crowdsource it the way you can for cat-vs-dog classification.

Strategies that work:

Transfer learning from ImageNet: Even though satellite images look nothing like photographs, the low-level features (edges, textures) transfer surprisingly well
Self-supervised pretraining on satellite data: SatMAE, SSL4EO-S12, and similar models pretrained on large unlabeled satellite datasets provide better starting points than ImageNet for remote sensing tasks
Data augmentation: Random rotations (any angle — overhead images have no canonical orientation), flips, color jittering, and random cropping
Semi-supervised learning: Label a small subset, train a model, use it to generate pseudo-labels for unlabeled data, retrain

Class Imbalance

If you're mapping buildings in a rural area, 95% of pixels might be "not building." The network learns to predict "not building" everywhere and achieves 95% accuracy while being completely useless.

Solutions: Weighted loss functions (increase the penalty for misclassifying rare classes), focal loss, or oversampling rare-class patches during training.

Change Detection with Deep Learning

One of the most impactful applications: detecting what changed between two dates. Classical approaches compare spectral values directly — image differencing, change vector analysis, and related change detection techniques. Deep learning approaches can learn complex change patterns.

Siamese networks are the standard architecture: two identical encoder branches process the pre-change and post-change images, and a decoder produces a change/no-change map. The network learns to distinguish meaningful changes (new construction, deforestation) from noise (seasonal vegetation changes, atmospheric differences).

Practical results: In post-disaster assessment, a well-trained Siamese U-Net can map building damage from pre/post-event satellite imagery in minutes — a task that traditionally requires days of manual interpretation. That said, many operational monitoring needs don't require training a network at all: index-based satellite area monitoring catches most gradual and abrupt changes with far less setup.

What can't deep learning do well yet?

Deep learning models still generalize poorly across geographies, overfit when labeled examples number in the dozens, resist the kind of interpretation regulators and scientists expect, and handle irregular satellite time series awkwardly. None of these are theoretical quibbles — each one has sunk real operational projects, so check them against your use case before committing.

Generalize across geographies without retraining: A building detection model trained on European cities performs poorly in African informal settlements
Work with very few examples: If you have fewer than ~50 labeled examples of your target class, classical methods often outperform deep learning
Explain its decisions: A spectral threshold is interpretable; a 50-layer neural network is not. In regulatory or scientific contexts, this matters
Handle temporal irregularity: Satellite time series have varying acquisition dates, cloud contamination, and different viewing angles — challenges that standard architectures don't handle natively

Classical vs. Deep Learning: Performance Benchmarks

Having concrete numbers prevents both over-hype and under-appreciation of deep learning for specific tasks:

Task	Classical Method (Best)	Deep Learning (Best)	Note
EuroSAT scene classification	Random Forest ~90%	CNN/ViT ~98.6%	10 Sentinel-2 land use classes
Urban building footprint extraction	Threshold + morphology ~65% F1	U-Net ~88–92% F1	SpaceNet dataset benchmark
Ship detection in SAR	CFAR detector ~75% precision	YOLOv8 ~92% precision	High false positive rate in classical
Deforestation change detection	Thresholded NDVI diff ~80% recall	Siamese U-Net ~92% recall	Requires clean optical data
Road extraction	Template matching ~60% F1	DeepLab v3+ ~80% F1	DeepGlobe dataset
Informal settlement mapping	Spectral classification ~55% accuracy	U-Net ~85–90% accuracy	Spectrally similar to surroundings

The informal settlement row captures the opening example of this article. The 30+ percentage point improvement over classical methods is real — but note that achieving 87–90% with deep learning required 200 hand-labeled patches and took weeks, while the classical method took hours. The decision depends on whether accuracy or speed-to-result matters more for your application.

When deep learning genuinely isn't worth it: If classical accuracy is above ~85% and you have fewer than 50 labeled examples, the training data cost almost never justifies the incremental improvement from deep learning.

Getting Started: A Practical Path

Start with transfer learning — don't train from scratch. Use a ResNet or EfficientNet pretrained on ImageNet, modify the input channels, and fine-tune on your labeled data
Begin with classification, not segmentation — scene-level labels are far cheaper to produce than pixel-level masks
Use existing datasets — EuroSAT, BigEarthNet, SpaceNet, xView, DOTA provide labeled satellite data for various tasks
Consider torchgeo — the PyTorch library specifically designed for geospatial deep learning, with built-in support for common satellite datasets, samplers, and transforms
Benchmark against classical methods — if random forest on spectral features achieves 90% accuracy, spending weeks training a deep learning model to reach 92% may not be worth it

The field is moving fast. Foundation models trained on massive satellite archives (like IBM's Prithvi and ESA's PhilEO) are making it possible to achieve good results with far less labeled data than before. But the fundamentals — understanding your data, defining your problem clearly, and validating rigorously — matter more than the architecture you choose.

Deep Learning for Earth Observation: From CNNs to Semantic Segmentation

Why do classical methods hit a ceiling?

The Three Levels of Deep Learning in Earth Observation

Scene Classification

Object Detection

Semantic Segmentation

What makes satellite imagery hard for deep learning?

More Than Three Channels

Image Size and Tiling

Limited Labeled Data

Class Imbalance

Change Detection with Deep Learning

What can't deep learning do well yet?

Classical vs. Deep Learning: Performance Benchmarks

Getting Started: A Practical Path

Related reading

Related Articles

Supervised vs Unsupervised Classification: Two Approaches to Mapping Land Cover

Machine Learning for Satellite Image Classification: From Random Forest to Deep Learning

From headline to satellite evidence

Pick an event on the Watchfloor

Ask Delta Agent why it matters

Look closer on the Map

Reuse the insight