Collecting Historical Weather Data

Posted by Nick Paul on Thursday, January 13, 2022

TOC

  1. Introudction
  2. Pull the Data
  3. Transform the Data
  4. Find Something of Interest
  5. Model the Data
  6. Visualize the Results
  7. Conclusion

Introduction

I was recently exploring different options to collect historical weather data for an NFL model I am working on and found there is no shortage of weather API options, including many paid services that want to sell you their product.

Given that US Government collects a lot of the commercial weather data, I was hoping to find a free option. After a little digging, I found the meteostat python library, which is easy to use and has a few different NOAA data sets.

I did find the raw data from NOAA, which you can easily grab as a json file (Newark weather data: https://www.ncei.noaa.gov/access/past-weather/USW00014734/data.json), but the meteostat package removes a lot of the boiler plate data transformation work required for your own bespoke solution: creating a process to look up weather stations, parse the returned data, convert the units, etc..

Using meteostat, you should be up and running with the data you need in under 5 minutes. Sounds like a win to me!

Pull the Data

Let’s start by pulling a sample data set.

The easiest way to pull a data set is to pick a point on the map you are interested in, and meteostat will find the closest weather station to this point. This is super helpful and removes the need to look up weather station codes and locations.

Below I will pull data for D.C. using the lat and long, which is easy to grab from Google maps - just right click on the point you want.

from datetime import datetime
from meteostat import Point, Daily, units

# Set time period
start = datetime(1900, 1, 1)
end = datetime(2022, 1, 10)

# Create Point for DC
location = Point(38.72168084908155, -77.04777737365207)

# Get daily data
data = Daily(location, start, end)
data = data.convert(units.imperial)
data = data.fetch()

Note that we converted the data to imperial units (e.g., Fahrenheit and inches). A nice feature from meteostat that saves us time.

Transform the Data

Next we will transform the data to facilitate exploring a quick question: given the discussion of global warming, does D.C. have more 90+ days than it used to.

import pandas as pd
import math

df = data.reset_index()
df['time'] = pd.to_datetime(df['time'])
df['month'] = df['time'].dt.month
df['year'] = df['time'].dt.year
df['decade'] = [math.floor(x/10) * 10 for x in df['year']]

Find Something of Interest

Now that our data is ready to work with, let’s conduct some specific transformations to identify if there has been an increase in 90+ days in the D.C. area.

temp = 90
df_summary = (
    df
    .query('tmax >= @temp')
    .groupby(['year'], as_index = False)
    .size()
    )

df_summary
##     year  size
## 0   1937    14
## 1   1938    31
## 2   1939    41
## 3   1940    21
## 4   1941    37
## ..   ...   ...
## 80  2017    43
## 81  2018    45
## 82  2019    62
## 83  2020    46
## 84  2021    48
## 
## [85 rows x 2 columns]

Model the Data

Weather data is highly variable from year to year, so let’s create two models to help us identify a pattern: a 5 year moving average and a simple linear regression over the entire data set (year will be our dependent variable and number of 90+ days will be our independent variable).

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# rolling average
df_summary = (df_summary
                .assign(rolling = lambda df: df['size'].rolling(5, min_periods = 5).mean())
                .query('~rolling.isnull()')
                )

# linear regression
X = df_summary['year'].values.reshape(-1, 1)  
Y = df_summary['size'].values  
linear_regressor = LinearRegression()  

linear_regressor.fit(X, Y)  # perform linear regression
## LinearRegression()
Y_pred = linear_regressor.predict(X)  # make predictions

df_summary['pred'] = Y_pred
df_summary
##     year  size  rolling       pred
## 4   1941    37     28.8  30.903643
## 5   1942    36     33.2  31.054818
## 6   1943    60     39.0  31.205992
## 7   1944    47     40.2  31.357167
## 8   1945    23     40.6  31.508341
## ..   ...   ...      ...        ...
## 80  2017    43     42.4  42.392894
## 81  2018    45     44.4  42.544068
## 82  2019    62     52.0  42.695242
## 83  2020    46     50.8  42.846417
## 84  2021    48     48.8  42.997591
## 
## [81 rows x 4 columns]

Visualize the Results

Now that we have our models, let’s visualize the results.

from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, output_file, show
from bokeh.io import save
from bokeh.models import ColumnDataSource


source = ColumnDataSource(df_summary)

TOOLTIPS = [("Year", "@year"),
          ("Count of days", "@size")]

p = figure(tooltips = TOOLTIPS,
          title = "Washington D.C. Is Having More 90+ Days")

p.circle(x = "year", y = 'size', source = source, color = "lightgrey")
p.line(x ='year', y = 'rolling', source = source, color = "orange")
p.line(x = 'year', y = 'pred', source = source, color = "red")
p.yaxis.axis_label = 'Number of Days Exceeding 90 Degrees'
p.yaxis.minor_tick_line_color = None
p.xaxis.minor_tick_line_color = None

output_file("dc_weather.html")

save(p)
Bokeh Plot

From the data, we can see that D.C. appears to have an additional 10 90-plus-days over the last 100ish years.

Conclusion

As you can see, meteostat makes it fairly easy to pull historical weather data and get straight to your analysis. In the future, I plan to use this data to augment an over/under model for NFL games.