machine learningclassificationdeep learningRandom Forestsatellite

Machine Learning for Satellite Image Classification: From Random Forest to Deep Learning

Kazushi MotomuraNovember 19, 20257 min read
Machine Learning for Satellite Image Classification: From Random Forest to Deep Learning

Quick Answer: Machine learning classifies satellite pixels into land cover categories by learning statistical patterns from training examples. Random Forest remains the workhorse — fast to train, resistant to overfitting, handles mixed feature types, and achieves 80-90% accuracy for most land cover classification tasks. Gradient boosting (XGBoost, LightGBM) often achieves slightly higher accuracy. Deep learning (CNNs like U-Net) excels when spatial context matters (building detection, road extraction) but requires more training data and computation. The most impactful factor is training data quality, not algorithm choice — switching from Random Forest to deep learning typically improves accuracy by 2-5%, while improving training data quality can improve accuracy by 10-20%.

I've seen countless satellite classification projects where teams spent weeks optimizing neural network architectures to squeeze out an extra 1% accuracy — while their training data contained obvious mislabeling errors that were costing them 10%. The most common failure mode in satellite image classification isn't the algorithm. It's the training data.

That said, algorithm choice does matter, and understanding when to use which approach saves time and produces better results.

The Classification Problem

Satellite image classification assigns each pixel (or object) to a category:

  • Land cover classes: forest, cropland, urban, water, bare soil
  • Crop types: wheat, rice, corn, soybean
  • Damage levels: undamaged, moderate, severe
  • Any categorical distinction visible in satellite data

The process:

  1. Training data: Labeled examples of each class (pixels or polygons with known class labels)
  2. Feature extraction: Spectral bands, indices, texture, temporal features for each training sample
  3. Model training: Algorithm learns the relationship between features and classes
  4. Prediction: Apply the trained model to classify every pixel in the image
  5. Accuracy assessment: Compare classified map against independent reference data

Algorithm Options

Random Forest

An ensemble of decision trees, each trained on a random subset of training data and features. Classification is by majority vote across all trees.

Why it's the default choice:

  • Fast to train (minutes for millions of pixels)
  • Resistant to overfitting (the ensemble averaging smooths out individual tree errors)
  • Handles mixed feature types (spectral, indices, texture, elevation, categorical)
  • Provides feature importance rankings (which bands/indices matter most)
  • Minimal hyperparameter tuning needed (number of trees = 500 works for most cases)
  • Works well with relatively small training datasets (hundreds to thousands of samples)

Typical accuracy: 80-90% for 5-10 class land cover classification.

Gradient Boosting (XGBoost, LightGBM, CatBoost)

Builds trees sequentially, each correcting the errors of the previous tree. Produces stronger individual predictions than Random Forest's averaging approach.

When to choose over Random Forest:

  • Often achieves 1-3% higher accuracy
  • Better at capturing complex feature interactions
  • More hyperparameter tuning required (learning rate, max depth, regularization)
  • Slightly slower to train but still fast enough for operational use

Support Vector Machine (SVM)

Finds the optimal hyperplane separating classes in feature space. With kernel functions (RBF), handles non-linear class boundaries.

When to choose:

  • Small training datasets (<1000 samples) where SVMs can outperform ensemble methods
  • High-dimensional feature spaces (many bands/indices)
  • Becoming less popular as Random Forest and gradient boosting are easier to use and scale better

Convolutional Neural Networks (CNNs)

Deep learning models that process image patches rather than individual pixels, learning spatial patterns (edges, textures, shapes) in addition to spectral features.

Architectures for satellite classification:

  • U-Net: Encoder-decoder architecture for semantic segmentation. Standard choice for pixel-wise classification with spatial context.
  • DeepLab: Atrous convolution for multi-scale feature extraction. Good for objects of varying sizes.
  • ResNet/EfficientNet: For scene classification (classifying entire image patches rather than individual pixels).

When to choose deep learning:

  • Spatial context is important (building footprint extraction, road detection)
  • Large training datasets available (tens of thousands of labeled patches)
  • GPU resources available
  • Task involves pattern recognition beyond spectral signatures (object shape, arrangement)

When NOT to choose deep learning:

  • Small training datasets (deep learning overfits severely with <1000 samples)
  • Pixel-level spectral classification where spatial context doesn't help (mineral mapping)
  • Computational resources are limited
  • Interpretability is important (deep learning is harder to explain than Random Forest)

Feature Engineering

For pixel-based classifiers (Random Forest, XGBoost, SVM), the features you provide matter enormously:

Spectral Features

  • All available bands (don't pre-select — let the algorithm decide importance)
  • Band ratios (NDVI, NDWI, NDBI, NBR)
  • Red-edge indices (for vegetation type discrimination)

Temporal Features

  • Multi-date observations (spring, summer, autumn composites)
  • Phenological metrics (green-up date, peak NDVI, growing season length)
  • Time series statistics (mean, median, standard deviation, min, max per band)

Texture Features

  • GLCM metrics (contrast, homogeneity, entropy, correlation) computed over neighborhood windows
  • Edge detection filters
  • Most valuable for distinguishing classes with similar spectral signatures but different spatial patterns

Ancillary Features

  • Elevation and slope from DEM
  • Distance to water, roads, settlements
  • Climate variables (precipitation, temperature)
  • Soil type

The feature engineering effect: Adding temporal and texture features to a spectral-only Random Forest typically improves accuracy by 5-15%. This improvement often exceeds what you'd gain by switching algorithms.

Training Data: The Most Critical Factor

Quality Over Quantity

1000 carefully labeled, representative training samples outperform 10,000 sloppy labels every time. Common training data problems:

Mislabeled samples: A "forest" training polygon that includes a road or clearing introduces confusion. Verify training data against high-resolution imagery.

Class imbalance: If 90% of training data is "forest" and 2% is "wetland," the classifier will rarely predict wetland. Balance classes through stratified sampling or class weighting.

Non-representative samples: Training data collected only in flat terrain won't generalize to mountainous areas. Ensure geographic and environmental diversity.

Temporal mismatch: Training labels from 2020 applied to 2024 imagery — if land cover has changed, the labels are wrong.

How Much Training Data

MethodMinimum Samples per ClassRecommended
Random Forest50-100500-2000
Gradient Boosting100-200500-2000
SVM20-50200-1000
CNN (U-Net)1000-5000 patches10,000+ patches

Sources of Training Labels

  • Field survey (most reliable but expensive and limited)
  • Interpretation of VHR imagery (Google Earth, Planet, aerial photos)
  • Existing land cover maps (may be outdated)
  • OpenStreetMap (variable quality)
  • Active learning (classifier identifies uncertain pixels for human labeling)

Accuracy Assessment

The Golden Rule

Never assess accuracy on training data. Always use independent reference data that wasn't used for training.

Metrics

Overall Accuracy (OA): Percentage of correctly classified reference samples. Simple but can be misleading with class imbalance.

Kappa Coefficient: Accounts for chance agreement. Values 0.6-0.8 = substantial agreement; >0.8 = excellent.

Producer's Accuracy: Per-class metric — what fraction of the reference data for class X was correctly classified? (Measures omission error.)

User's Accuracy: Per-class metric — what fraction of pixels classified as class X actually are class X? (Measures commission error.)

F1 Score: Harmonic mean of precision and recall. Useful for imbalanced classes.

Confusion Matrix: The complete picture — shows exactly which classes are confused with which.

Sample Size for Validation

Minimum ~50 reference samples per class (more for rare classes or where high confidence is needed). Total validation sample size of 500-1000 points is typical for regional studies.

Practical Workflow

  1. Define classes clearly — unambiguous, mutually exclusive, detectable by satellite
  2. Collect training data — diverse, representative, correctly labeled
  3. Extract features — spectral + temporal + texture + ancillary
  4. Split data — 70% training, 30% validation (spatially stratified)
  5. Train initial model — Random Forest with default parameters
  6. Evaluate — confusion matrix, identify problem classes
  7. Iterate — improve training data for confused classes, add features
  8. Compare algorithms — try gradient boosting, deep learning if justified
  9. Final accuracy assessment — on fully independent validation data
  10. Document — training data sources, features used, parameters, accuracy

The most important steps are 2 (training data quality) and 6-7 (iterative improvement). The algorithm selection (step 8) is often the least impactful decision in the entire workflow.

Kazushi Motomura

Kazushi Motomura

Remote sensing specialist with 10+ years in satellite data processing. Founder of Off-Nadir Lab. Master's in Satellite Oceanography (Kyushu University).