Creating Lag and Lead Variables

Introduction

In this tutorial, we examine how to create lagged and lead variables: essential tools for time series and panel data analysis. Whether you’re modeling financial trends or running regressions, such transformations are often essential to compute new variables, such as Return on Assets or Revenue Growth. We’ll also explore how the Pychemist library simplifies this task.

We use pandas and the pychemist library to demonstrate this. For this tutorial we rely on a dataset of financial information for 20 large US companies, scraped from EDGAR. Some missing values in the dataset have been imputed for demonstration purposes.

While pandas lets you create lagged values using .shift(), it doesn’t always behave as expected for grouped data or irregular time series. Pychemist provides a simpler and more reliable alternative through its .chemaccessor.

Step 1: Load the Data and Libraries

We start by importing the necessary libraries and loading the dataset. pandas and numpy are used for working with panel datasets. Additionally, we import pychemist to download the financial dataset and to enable the custom pandas extensions (accessors) used in this example.

Note: If you haven’t installed pychemist yet, you can do so with the following command:

Import the required libraries and load the dataset into a DataFrame (df):

Let’s inspect the first few rows:

indextickeryearnet_incometotal_assetsrevenue
011AAPL20205.953100e+103.385160e+11NaN
112AAPL20215.525600e+103.238880e+11NaN
213AAPL20225.741100e+103.510020e+11NaN
314AAPL2023NaN3.527550e+11NaN
415AAPL20249.980300e+103.525830e+11NaN

5 rows × 6 columns

Step 2: Create a Lag Manually

First, we sort the dataset by company (ticker) and time (year). Second, we create a lagged variable total_assets_lag using Pandas’ built-in .shift() method.

Next, we inspect the DataFrame:

indextickeryearnet_incometotal_assetsrevenuetotal_assets_lag
011AAPL20205.953100e+103.385160e+11NaNNaN
112AAPL20215.525600e+103.238880e+11NaN3.385160e+11
213AAPL20225.741100e+103.510020e+11NaN3.238880e+11
314AAPL2023NaN3.527550e+11NaN3.510020e+11
415AAPL20249.980300e+103.525830e+11NaN3.527550e+11
527ADBE20202.591000e+092.076200e+109.030000e+093.525830e+11
628ADBE20212.951000e+092.428400e+101.117100e+102.076200e+10
729ADBE20225.260000e+092.724100e+101.286800e+102.428400e+10
830ADBE20234.822000e+092.716500e+101.578500e+102.724100e+10
931ADBE20244.756000e+092.977900e+101.760600e+102.716500e+10

10 rows × 7 columns

Problem: We see that indeed a lagged version of total assets is added to the DataFrame. However, this method does not take into account whether the previous row belongs to the same company. It simply shifts row-wise, even across companies. We see for example that total_assets_lag for Adobe in 2020 equals Apple’s Total Assets for 2024. Obviously, this is not what we want.

Step 3: Fix Grouping Issues Manually

To address this, we can conditionally shift values only if the previous row belongs to the same ticker:

Inspection of the first 10 rows of the DataFrame indicates that this indeed resolved the issue.

indextickeryearnet_incometotal_assetsrevenuetotal_assets_lag
011AAPL20205.953100e+103.385160e+11NaNNaN
112AAPL20215.525600e+103.238880e+11NaN3.385160e+11
213AAPL20225.741100e+103.510020e+11NaN3.238880e+11
314AAPL2023NaN3.527550e+11NaN3.510020e+11
415AAPL20249.980300e+103.525830e+11NaN3.527550e+11
527ADBE20202.591000e+092.076200e+109.030000e+09NaN
628ADBE20212.951000e+092.428400e+101.117100e+102.076200e+10
729ADBE20225.260000e+092.724100e+101.286800e+102.428400e+10
830ADBE20234.822000e+092.716500e+101.578500e+102.724100e+10
931ADBE20244.756000e+092.977900e+101.760600e+102.716500e+10

10 rows × 7 columns

However, there are still cases where the calculation did not happen as expected. For example, take a look at the observations for Nvidia and Tesla:

indextickeryearnet_incometotal_assetsrevenuetotal_assets_lag
71209NVDA20203.047000e+091.329200e+10NaNNaN
72210NVDA20214.141000e+091.731500e+10NaN1.329200e+10
73212NVDA20234.332000e+094.418700e+101.667500e+101.731500e+10
74213NVDA20249.752000e+094.118200e+102.691400e+104.418700e+10
75214NVDA20254.368000e+096.572800e+102.697400e+104.118200e+10
87256TSLA2020-9.760000e+083.430900e+102.146100e+10NaN
88258TSLA20227.210000e+086.213100e+103.153600e+103.430900e+10
89259TSLA20235.519000e+098.233800e+105.382300e+106.213100e+10
90260TSLA20241.255600e+101.066180e+118.146200e+108.233800e+10

9 rows × 7 columns

Even when years are missing, the current logic carries forward the last available value, which may be from two or more years ago. As a result of data for NVIDIA for 2022 being missing, the lagged value total_assets_lag for 2023 is now incorrectly set equal to the total assets for 2021. Similarly, the lagged total assets for Tesla for 2022 are incorrectly set equal to the value for 2020.

Step 4: Handle Missing Years

Let’s refine the condition further. Our condition should verify that the previous row belongs to the same ticker and that the year increment equals one:

Let’s inspect the DataFrame again:

indextickeryearnet_incometotal_assetsrevenuetotal_assets_lag
71209NVDA20203.047000e+091.329200e+10NaNNaN
72210NVDA20214.141000e+091.731500e+10NaN1.329200e+10
73212NVDA20234.332000e+094.418700e+101.667500e+10NaN
74213NVDA20249.752000e+094.118200e+102.691400e+104.418700e+10
75214NVDA20254.368000e+096.572800e+102.697400e+104.118200e+10
87256TSLA2020-9.760000e+083.430900e+102.146100e+10NaN
88258TSLA20227.210000e+086.213100e+103.153600e+10NaN
89259TSLA20235.519000e+098.233800e+105.382300e+106.213100e+10
90260TSLA20241.255600e+101.066180e+118.146200e+108.233800e+10

9 rows × 7 columns

This approach works, but the code becomes verbose and difficult to scale. If we want a 2-year lag, the condition becomes even more complex:

Step 5: Let Pychemist Do the Work

To simplify and generalize this process, we can use Pychemist’s built-in lag function.

Note: this requires installing and importing the Pychemist library as done at the start of this tutorial.

We can now generate a lagged variable as follows:

This automatically handles company grouping and time consistency. The chem.lag method also handles important edge cases automatically. For instance, it only assigns lag values when both the grouping variable (e.g., ticker) matches and the year variable increments by exactly the lag interval. This means that if a year is missing in the dataset, the function will not incorrectly carry over data from a non-consecutive year, and will instead return NaN as expected. If the lagged column already exists, it will not be overwritten, and a warning will be issued to prevent accidental data loss.

Step 6: Lag Multiple Columns

To created multiple lagged variables at once, we can set a list of variables for which lagged variables should be computed. We set replace=True because the DataFrame already contains the lagged variable for total_assets. Without this argument the function would raise a warning that this variable already exists in the DataFrame. Alternatively, we could drop this column manually.

We can see that the results are as expected:

indextickeryearnet_incometotal_assetsrevenuetotal_assets_lagrevenue_lag
71209NVDA20203.047000e+091.329200e+10NaNNaNNaN
72210NVDA20214.141000e+091.731500e+10NaN1.329200e+10NaN
73212NVDA20234.332000e+094.418700e+101.667500e+10NaNNaN
74213NVDA20249.752000e+094.118200e+102.691400e+104.418700e+101.667500e+10
75214NVDA20254.368000e+096.572800e+102.697400e+104.118200e+102.691400e+10
87256TSLA2020-9.760000e+083.430900e+102.146100e+10NaNNaN
88258TSLA20227.210000e+086.213100e+103.153600e+10NaNNaN
89259TSLA20235.519000e+098.233800e+105.382300e+106.213100e+103.153600e+10
90260TSLA20241.255600e+101.066180e+118.146200e+108.233800e+105.382300e+10

9 rows × 8 columns

To generate 2-year lags (or more), simply pass a ‘shift’ parameter:

This results in the following DataFrame:

indextickeryearnet_incometotal_assetsrevenuetotal_assets_lagrevenue_lagtotal_assets_lag2revenue_lag2
011AAPL20205.953100e+103.385160e+11NaNNaNNaNNaNNaN
112AAPL20215.525600e+103.238880e+11NaN3.385160e+11NaNNaNNaN
213AAPL20225.741100e+103.510020e+11NaN3.238880e+11NaN3.385160e+11NaN
314AAPL2023NaN3.527550e+11NaN3.510020e+11NaN3.238880e+11NaN
415AAPL20249.980300e+103.525830e+11NaN3.527550e+11NaN3.510020e+11NaN

5 rows × 10 columns

Step 7: Create Lead Variables

Creating lead variables (i.e., future values) is just as easy. Simply use .chem.lead instead:

The DataFrame would look as follows:

indextickeryearnet_incometotal_assetsrevenuetotal_assets_lagrevenue_lagtotal_assets_lag2revenue_lag2total_assets_leadrevenue_lead
011AAPL20205.953100e+103.385160e+11NaNNaNNaNNaNNaN3.238880e+11NaN
112AAPL20215.525600e+103.238880e+11NaN3.385160e+11NaNNaNNaN3.510020e+11NaN
213AAPL20225.741100e+103.510020e+11NaN3.238880e+11NaN3.385160e+11NaN3.527550e+11NaN
314AAPL2023NaN3.527550e+11NaN3.510020e+11NaN3.238880e+11NaN3.525830e+11NaN
415AAPL20249.980300e+103.525830e+11NaN3.527550e+11NaN3.510020e+11NaNNaNNaN

5 rows × 12 columns

Step 8: Compute Derived Metrics

Now that we have lags, we can compute variables such as Return on Assets (ROA) or Revenue Growth:

Inspect the results:

indextickeryearnet_incometotal_assetsrevenuetotal_assets_lagrevenue_lagtotal_assets_lag2revenue_lag2total_assets_leadrevenue_leadroagrowth
71209NVDA20203.047000e+091.329200e+10NaNNaNNaNNaNNaN1.731500e+10NaNNaNNaN
72210NVDA20214.141000e+091.731500e+10NaN1.329200e+10NaNNaNNaNNaNNaN0.270592NaN
73212NVDA20234.332000e+094.418700e+101.667500e+10NaNNaN1.731500e+10NaN4.118200e+102.691400e+10NaNNaN
74213NVDA20249.752000e+094.118200e+102.691400e+104.418700e+101.667500e+10NaNNaN6.572800e+102.697400e+100.2284670.614033
75214NVDA20254.368000e+096.572800e+102.697400e+104.118200e+102.691400e+104.418700e+101.667500e+10NaNNaN0.0817140.002229
87256TSLA2020-9.760000e+083.430900e+102.146100e+10NaNNaNNaNNaNNaNNaNNaNNaN
88258TSLA20227.210000e+086.213100e+103.153600e+10NaNNaN3.430900e+102.146100e+108.233800e+105.382300e+10NaNNaN
89259TSLA20235.519000e+098.233800e+105.382300e+106.213100e+103.153600e+10NaNNaN1.066180e+118.146200e+100.0764040.706716
90260TSLA20241.255600e+101.066180e+118.146200e+108.233800e+105.382300e+106.213100e+103.153600e+10NaNNaN0.1328990.513517

9 rows × 14 columns

Conclusion

In this tutorial, we learned how to create lagged and lead variables manually using pandas, and why that approach often falls short, especially in grouped, irregular panel data.

Then, we saw how the Pychemist library solves these problems elegantly:

  • Accurate grouping
  • Handles gaps in time
  • Clean, concise API
  • Easy multi-period shifts
  • Works seamlessly with pandas

Whether you’re building financial models or prepping data for regression analysis, Pychemist streamlines your workflow and eliminates common mistakes.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top