Imputing the Time-Series Using Python

Dr Mohammad El-Nesr
4 min readDec 31, 2018

Time series are an important form of indexed data found in stocks data, climate datasets, and many other time-dependent data forms. Due to its time-dependency, time series are subject to have missing points due to problems in reading or recording the data.

To apply machine learning models effectively, the time series has to be continuous, as most of the ML models are not designed to deal with missing values. Hence, the rows with missing data should be either dropped or filled with appropriate values.

In time-independent data (non-time-series), a common practice is to fill the gaps with the mean or median value of the field. However, this is not applicable in the time series. To understand the reason, let’s consider a temperature dataset. The temperature value of February is very far from its value in July. This is also applicable to sales dataset that has some seasons with high sales and others with low or regular sales. So the imputation method should be dependent on time.

To prove this assumption, let’s take an example and solve it in python.

Loading and preparing the dataset

Import the required libraries, and read the data

# Import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import a scoring metric to compare methods
from sklearn.metrics import r2_score
%matplotlib inline
# the data path
#To View the data go here:
#https://github.com/drnesr/WaterConsumption/blob/master/data/SampleData.csv"
# To load the raw data:
remote_path = "https://raw.githubusercontent.com/drnesr/WaterConsumption/master/data/SampleData.csv"

df=pd.read_csv(remote_path)
df.head()
The head of the dataset, notice the date is a normal column.

The dataset contains three columns, Date,the date in dd-mm-yyyy format; reference the temperature column with no missing data for reference; and target the temperature column with random missing points.

The first step is to set the index of the dataframe to be the Date column

# Converting the column to DateTime format
df.Date = pd.to_datetime(df.Date, format='%d-%m-%Y')
df = df.set_index('Date')
df.head()
The `Date` column is now an index

For charting purposes, we will add a column that contains the missing values only.

df = df.assign(missing= np.nan)
df.missing[df.target.isna()] = df.reference
df.info()
DatetimeIndex: 96 entries, 2010-01-15 to 2017-12-15
Data columns (total 3 columns):
reference 96 non-null float64
target 75 non-null float64
missing 21 non-null float64
dtypes: float64(3)

Notice that we have 21 missing points out of 96 total points.

Have a look at the data df.plot(style=['k--', 'bo-', 'r*'], figsize=(20, 10));

The temperature data with the available points in blue and missing points in red

Trying to impute using the mean/median values.

I will create a column for each tested method to compare the values later.

# Filling using mean or median# Creating a column in the dataframe
# instead of : df['NewCol']=0, we use
# df = df.assign(NewCol=default_value)
# to avoid pandas warning.
df = df.assign(FillMean=df.target.fillna(df.target.mean()))
df = df.assign(FillMedian=df.target.fillna(df.target.median()))

Trying to impute using the rolling average.

# imputing using the rolling average
df = df.assign(RollingMean=df.target.fillna(df.target.rolling(24,min_periods=1,).mean()))
# imputing using the rolling median
df = df.assign(RollingMedian=df.target.fillna(df.target.rolling(24,min_periods=1,).median()))# imputing using the median

Imputing using interpolation with different methods

df = df.assign(InterpolateLinear=df.target.interpolate(method='linear'))
df = df.assign(InterpolateTime=df.target.interpolate(method='time'))
df = df.assign(InterpolateQuadratic=df.target.interpolate(method='quadratic'))
df = df.assign(InterpolateCubic=df.target.interpolate(method='cubic'))
df = df.assign(InterpolateSLinear=df.target.interpolate(method='slinear'))
df = df.assign(InterpolateAkima=df.target.interpolate(method='akima'))
df = df.assign(InterpolatePoly5=df.target.interpolate(method='polynomial', order=5))
df = df.assign(InterpolatePoly7=df.target.interpolate(method='polynomial', order=7))
df = df.assign(InterpolateSpline3=df.target.interpolate(method='spline', order=3))
df = df.assign(InterpolateSpline4=df.target.interpolate(method='spline', order=4))
df = df.assign(InterpolateSpline5=df.target.interpolate(method='spline', order=5))

Scoring the results and see which is better

results = [(method, r2_score(df.reference, df[method])) for method in list(df)[3:]]
results_df = pd.DataFrame(np.array(results), columns=['Method', 'R_squared'])
results_df.sort_values(by='R_squared', ascending=False)

The result shows that the 'time' method as well as the 'slinear' method produces the closest values to the original values, while the rolling mean and median produces very low values of r^2.

To plot the data after imputation.

final_df= df[['reference', 'target', 'missing', 'InterpolateTime' ]]
final_df.plot(style=['b-.', 'ko', 'r.', 'rx-'], figsize=(20,10));
plt.ylabel('Temperature');
plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.05),
fancybox=True, shadow=True, ncol=5, prop={'size': 14} );
The `time` interpolation is the best method for time series.

Some limitations

  1. For the time interpolation to succeed, the dataframe must have the index in Date format with intervals of 1 day or more (daily, monthly, …); however, it will not work for time-based data, like hourly data.
  2. if it is important to use a different index for the dataframe, use reset_index().set_index('Date'), do the interpolation, and then apply the reset_index().set_index('DesiredIndex').
  3. If the data contains another dividing column, like the type of merchandise, and we are imputing sales, then the imputation should be for each merchandise separately.

Conclusions

It is important to keep the date in mind while imputing time-series, make the date as the dataset index, then use pandas interpolation with the time method.

Application on a real project

This time series imputation method was used to analyze real data in the study described in this post.

References

Note: This is my first story at Medium. I appreciate your valuable feedback and encouragement.

--

--