Why Traditional Weather Forecasting Models Still Beat AI for Extreme Events: A Hands-On Guide

Overview

Extreme weather events—such as heatwaves, cold snaps, and violent storms—cause hundreds of billions of dollars in damage annually. Accurate forecasting of these rare, record-breaking events is critical for early warning systems that save lives and protect infrastructure. In recent years, artificial intelligence (AI) models have surpassed traditional physics-based models in many routine weather forecasts. However, a 2024 study published in Science Advances reveals a crucial limitation: AI models significantly underperform traditional models when predicting record-breaking extreme weather. This guide explains the mechanics behind both modeling approaches, walks through the study's methodology, and provides actionable insights for anyone working with weather forecasts.

Why Traditional Weather Forecasting Models Still Beat AI for Extreme Events: A Hands-On Guide — Source: www.carbonbrief.org

Prerequisites

Before diving into this tutorial, you should be familiar with:

Basic weather forecasting concepts (e.g., numerical weather prediction, ensemble models).
Fundamentals of machine learning (training data, neural networks, overfitting).
Statistical concepts (frequency, intensity, percentile thresholds).
Optional: Basic Python for code examples (but no programming required to understand the guide).

Step-by-Step: Analyzing Model Performance for Extreme Events

1. Understand the Two Modeling Paradigms

Physics-based models (also called numerical weather prediction or NWP) solve complex equations representing atmospheric and oceanic physics. These models are deterministic, rely on decades of research, and require massive computational power. They can simulate entirely new weather patterns because the physics equations are universal.

AI models learn patterns from historical data. They are trained on large datasets (e.g., ERA5 reanalysis). The model's predictions are constrained by the range of its training data. For example, if a heatwave reaches 50°C but the training data only includes up to 45°C, the AI model tends to predict something below 48°C—it "hedges" toward the mean.

2. Reproduce the Study’s Design

The researchers selected record-breaking hot, cold, and windy events from 2018 and 2020. They then ran both AI and traditional models to forecast those days. A simple Python snippet below illustrates the core idea (conceptual code):

import numpy as np

# Assume historical temperature data (training set)
historical_temps = np.random.normal(loc=20, scale=10, size=10000)  # mean 20°C, std 10

# Record-breaking event (true value 55°C)
true_extreme = 55.0

# AI model predicts based on historical distribution
def ai_predict(historical, true_extreme):
    # Simple model: predict within 2 std devs of historical mean
    mean = np.mean(historical)
    std = np.std(historical)
    # AI won't predict beyond historical range
    prediction = np.clip(true_extreme, mean - 2*std, mean + 2*std)
    return prediction

print(ai_predict(historical_temps, true_extreme))  # Output: around 40 (clipped)

This demonstrates how AI underestimates extremes because it "plays it safe" within the training data range.

3. Examine the Key Findings

The study tested models on thousands of extreme events. Results showed:

Frequency: AI models predicted record-breaking events far less often than they actually occurred.
Intensity: AI models underestimated the magnitude—e.g., a 50°C heatwave was forecast as 45°C.
Traditional models had higher spread in ensemble forecasts, allowing them to better capture rare extremes.

Lead author Prof. Sebastian Engelke (University of Geneva) calls this a "warning shot" against prematurely replacing physics-based models with AI.

4. Implement a Simple Test on Your Own Data

To see this effect, you can download a historical weather dataset (e.g., from NOAA) and train a simple neural network for temperature prediction. Then evaluate its performance on the top 1% hottest days. Expect MAE (mean absolute error) to spike on those extremes compared to physics-based baselines.

5. Apply the Lessons to Your Forecasting Workflow

For operational forecasts, use a hybrid approach:

Rely on AI models for routine, high-resolution forecasts (they're faster and often more accurate for typical conditions).
Always cross-check with physics-based ensemble models when extreme events are possible.
Monitor the training data range of your AI model—if it doesn't include the extremes you expect, the model will fail.

Common Mistakes

Assuming AI models can extrapolate. AI is essentially interpolation within training data. Record-breaking events are by definition outside historical norms, leading to underprediction.
Ignoring uncertainty quantification. Many AI models output a single deterministic value. Traditional models provide ensemble spread, which helps quantify uncertainty for rare events.
Overfitting to past extremes. If you include too many similar extreme events in training, the model may still fail on unprecedented ones—this is the "black swan" problem.
Using AI for long-range extreme warnings. The study focused on short-term forecasts (days ahead), but severity increases with lead time.

Summary

While AI weather models offer speed and skill for typical forecasts, they consistently underestimate the frequency and intensity of record-breaking extreme events. Traditional physics-based models remain essential for reliable early warnings. The key takeaway: never rely solely on AI for extreme weather forecasting—always complement with physics-based ensembles. This guide walked through the study's findings, provided a simple code example to illustrate the limitation, and outlined best practices for integrating both approaches.