machine learningclassificationdeep learningRandom Forestsatellite

Machine Learning for Satellite Image Classification: From Random Forest to Deep Learning

Name: Off-Nadir Delta
Author: Kazushi Motomura

Kazushi MotomuraNovember 19, 2025(Updated: July 11, 2026)9 min read

Machine Learning for Satellite Image Classification: From Random Forest to Deep Learning

Quick Answer: Machine learning classifies satellite pixels into land cover categories by learning statistical patterns from training examples. Random Forest remains the workhorse — fast to train, resistant to overfitting, handles mixed feature types, and achieves 80-90% accuracy for most land cover classification tasks. Gradient boosting (XGBoost, LightGBM) often achieves slightly higher accuracy. Deep learning (CNNs like U-Net) excels when spatial context matters (building detection, road extraction) but requires more training data and computation. The most impactful factor is training data quality, not algorithm choice — switching from Random Forest to deep learning typically improves accuracy by 2-5%, while improving training data quality can improve accuracy by 10-20%.

I've seen countless satellite classification projects where teams spent weeks optimizing neural network architectures to squeeze out an extra 1% accuracy — while their training data contained obvious mislabeling errors that were costing them 10%. The most common failure mode in satellite image classification isn't the algorithm. It's the training data.

That said, algorithm choice does matter, and understanding when to use which approach saves time and produces better results.

The Classification Problem

Satellite image classification assigns each pixel (or object) to a category:

Land cover classes: forest, cropland, urban, water, bare soil
Crop types: wheat, rice, corn, soybean
Damage levels: undamaged, moderate, severe
Any categorical distinction visible in satellite data

The process:

Training data: Labeled examples of each class (pixels or polygons with known class labels)
Feature extraction: Spectral bands, indices, texture, temporal features for each training sample
Model training: Algorithm learns the relationship between features and classes
Prediction: Apply the trained model to classify every pixel in the image
Accuracy assessment: Compare classified map against independent reference data

Which algorithm should you choose?

Start with Random Forest for pixel-based land cover work: it trains in minutes, resists overfitting, and typically reaches 80-90% accuracy on 5-10 class problems. Move to gradient boosting when you need to squeeze out another 1-3%, and reserve deep learning for tasks where spatial context — shape, arrangement, texture at scale — carries the signal that spectral values alone cannot.

Random Forest

An ensemble of decision trees, each trained on a random subset of training data and features. Classification is by majority vote across all trees.

Why it's the default choice:

Fast to train (minutes for millions of pixels)
Resistant to overfitting (the ensemble averaging smooths out individual tree errors)
Handles mixed feature types (spectral, indices, texture, elevation, categorical)
Provides feature importance rankings (which bands/indices matter most)
Minimal hyperparameter tuning needed (number of trees = 500 works for most cases)
Works well with relatively small training datasets (hundreds to thousands of samples)

Typical accuracy: 80-90% for 5-10 class land cover classification.

Gradient Boosting (XGBoost, LightGBM, CatBoost)

Builds trees sequentially, each correcting the errors of the previous tree. Produces stronger individual predictions than Random Forest's averaging approach.

When to choose over Random Forest:

Often achieves 1-3% higher accuracy
Better at capturing complex feature interactions
More hyperparameter tuning required (learning rate, max depth, regularization)
Slightly slower to train but still fast enough for operational use

Support Vector Machine (SVM)

Finds the optimal hyperplane separating classes in feature space. With kernel functions (RBF), handles non-linear class boundaries.

When to choose:

Small training datasets (<1000 samples) where SVMs can outperform ensemble methods
High-dimensional feature spaces (many bands/indices)
Becoming less popular as Random Forest and gradient boosting are easier to use and scale better

Convolutional Neural Networks (CNNs)

Deep learning models that process image patches rather than individual pixels, learning spatial patterns (edges, textures, shapes) in addition to spectral features.

Architectures for satellite classification:

U-Net: Encoder-decoder architecture for semantic segmentation. Standard choice for pixel-wise classification with spatial context.
DeepLab: Atrous convolution for multi-scale feature extraction. Good for objects of varying sizes.
ResNet/EfficientNet: For scene classification (classifying entire image patches rather than individual pixels).

For a deeper treatment of architectures, tiling, and the satellite-specific pitfalls of neural networks, see deep learning for Earth observation.

When to choose deep learning:

Spatial context is important (building footprint extraction, road detection)
Large training datasets available (tens of thousands of labeled patches)
GPU resources available
Task involves pattern recognition beyond spectral signatures (object shape, arrangement)

When NOT to choose deep learning:

Small training datasets (deep learning overfits severely with <1000 samples)
Pixel-level spectral classification where spatial context doesn't help (mineral mapping)
Computational resources are limited
Interpretability is important (deep learning is harder to explain than Random Forest)

Feature Engineering

For pixel-based classifiers (Random Forest, XGBoost, SVM), the features you provide matter enormously:

Spectral Features

All available bands (don't pre-select — let the algorithm decide importance). Sentinel-2 provides 13 spectral bands per scene, all of which can feed the classifier.
Band ratios (NDVI, NDWI, NDBI, NBR — see how to choose a satellite index)
Red-edge indices (for vegetation type discrimination)

Temporal Features

Multi-date observations (spring, summer, autumn composites)
Phenological metrics (green-up date, peak NDVI, growing season length)
Time series statistics (mean, median, standard deviation, min, max per band)

Free archives make temporal features cheap to build: NASA describes Landsat as "the longest continuous space-based record of Earth's land in existence," and Sentinel-2 scenes are freely available through the Copernicus Data Space Ecosystem. See the overview of free satellite data sources for what else is available at no cost.

Texture Features

GLCM metrics (contrast, homogeneity, entropy, correlation) computed over neighborhood windows
Edge detection filters
Most valuable for distinguishing classes with similar spectral signatures but different spatial patterns

Ancillary Features

Elevation and slope from DEM
Distance to water, roads, settlements
Climate variables (precipitation, temperature)
Soil type

The feature engineering effect: Adding temporal and texture features to a spectral-only Random Forest typically improves accuracy by 5-15%. This improvement often exceeds what you'd gain by switching algorithms.

Why is training data the most critical factor?

Because every algorithm can only reproduce the patterns present in its labels. Switching from Random Forest to deep learning typically buys 2-5 percentage points of accuracy; fixing mislabeled polygons, balancing classes, and covering the full environmental range of your study area routinely buys 10-20. If a classification is failing, audit the labels before touching the model.

Quality Over Quantity

1000 carefully labeled, representative training samples outperform 10,000 sloppy labels every time. Common training data problems:

Mislabeled samples: A "forest" training polygon that includes a road or clearing introduces confusion. Verify training data against high-resolution imagery.

Class imbalance: If 90% of training data is "forest" and 2% is "wetland," the classifier will rarely predict wetland. Balance classes through stratified sampling or class weighting.

Non-representative samples: Training data collected only in flat terrain won't generalize to mountainous areas. Ensure geographic and environmental diversity.

Temporal mismatch: Training labels from 2020 applied to 2024 imagery — if land cover has changed, the labels are wrong.

How much training data do you need?

As a working rule, a few hundred well-distributed samples per class are enough for tree-based methods, while convolutional networks need thousands of labeled patches to avoid overfitting. Collecting beyond the recommended range rarely helps unless the extra samples also add geographic or seasonal diversity. The table gives practical minimums and recommended targets per method:

Method	Minimum Samples per Class	Recommended
Random Forest	50-100	500-2000
Gradient Boosting	100-200	500-2000
SVM	20-50	200-1000
CNN (U-Net)	1000-5000 patches	10,000+ patches

Sources of Training Labels

Field survey (most reliable but expensive and limited)
Interpretation of VHR imagery (Google Earth, Planet, aerial photos)
Existing land cover maps (may be outdated)
OpenStreetMap (variable quality)
Active learning (classifier identifies uncertain pixels for human labeling)

How do you assess classification accuracy?

Compare the classified map against independent reference samples that were never used in training, and report per-class metrics rather than a single headline percentage. A confusion matrix with producer's and user's accuracy for every class shows which categories are reliable and which are being systematically confused — information that overall accuracy alone hides completely.

The Golden Rule

Never assess accuracy on training data. Always use independent reference data that wasn't used for training.

Metrics

Overall Accuracy (OA): Percentage of correctly classified reference samples. Simple but can be misleading with class imbalance.

Kappa Coefficient: Accounts for chance agreement. Values 0.6-0.8 = substantial agreement; >0.8 = excellent.

Producer's Accuracy: Per-class metric — what fraction of the reference data for class X was correctly classified? (Measures omission error.)

User's Accuracy: Per-class metric — what fraction of pixels classified as class X actually are class X? (Measures commission error.)

F1 Score: Harmonic mean of precision and recall. Useful for imbalanced classes.

Confusion Matrix: The complete picture — shows exactly which classes are confused with which.

Sample Size for Validation

Minimum ~50 reference samples per class (more for rare classes or where high confidence is needed). Total validation sample size of 500-1000 points is typical for regional studies.

Practical Workflow

Define classes clearly — unambiguous, mutually exclusive, detectable by satellite
Collect training data — diverse, representative, correctly labeled
Extract features — spectral + temporal + texture + ancillary
Split data — 70% training, 30% validation (spatially stratified)
Train initial model — Random Forest with default parameters
Evaluate — confusion matrix, identify problem classes
Iterate — improve training data for confused classes, add features
Compare algorithms — try gradient boosting, deep learning if justified
Final accuracy assessment — on fully independent validation data
Document — training data sources, features used, parameters, accuracy

The most important steps are 2 (training data quality) and 6-7 (iterative improvement). The algorithm selection (step 8) is often the least impactful decision in the entire workflow.

Classification maps from two dates are also the starting point for post-classification change analysis — see land cover change detection methods for how classified outputs feed change mapping. And if you want to inspect a classified raster without opening a desktop GIS, you can drop the GeoTIFF onto a browser map and check it visually in seconds.

Kazushi Motomura

Remote sensing specialist with 10+ years in satellite data processing. Founder of Off-Nadir Lab. Master's in Satellite Oceanography (Kyushu University). Co-author, Remote Sensing Encyclopedia. More about the author →

Website X/Twitter GitHub

Machine Learning for Satellite Image Classification: From Random Forest to Deep Learning

The Classification Problem

Which algorithm should you choose?

Random Forest

Gradient Boosting (XGBoost, LightGBM, CatBoost)

Support Vector Machine (SVM)

Convolutional Neural Networks (CNNs)

Feature Engineering

Spectral Features

Temporal Features

Texture Features

Ancillary Features

Why is training data the most critical factor?

Quality Over Quantity

How much training data do you need?

Sources of Training Labels

How do you assess classification accuracy?

The Golden Rule

Metrics

Sample Size for Validation

Practical Workflow

Related Articles

Deep Learning for Earth Observation: From CNNs to Semantic Segmentation

Sentinel-2 vs. Landsat 9: A Practical Comparison for Earth Observation

Supervised vs Unsupervised Classification: Two Approaches to Mapping Land Cover

From headline to satellite evidence

Pick an event on the Watchfloor

Ask Delta Agent why it matters

Look closer on the Map

Reuse the insight