Creating Lag and Lead Variables

Introduction

In this tutorial, we examine how to create lagged and lead variables: essential tools for time series and panel data analysis. Whether you’re modeling financial trends or running regressions, such transformations are often essential to compute new variables, such as Return on Assets or Revenue Growth. We’ll also explore how the Pychemist library simplifies this task.

We use pandas and the pychemist library to demonstrate this. For this tutorial we rely on a dataset of financial information for 20 large US companies, scraped from EDGAR. Some missing values in the dataset have been imputed for demonstration purposes.

While pandas lets you create lagged values using .shift(), it doesn’t always behave as expected for grouped data or irregular time series. Pychemist provides a simpler and more reliable alternative through its .chemaccessor.

Step 1: Load the Data and Libraries

We start by importing the necessary libraries and loading the dataset. pandas and numpy are used for working with panel datasets. Additionally, we import pychemist to download the financial dataset and to enable the custom pandas extensions (accessors) used in this example.

Note: If you haven’t installed pychemist yet, you can do so with the following command:

1	pip install pychemist

Import the required libraries and load the dataset into a DataFrame (df):

import pandas as pd

import numpy as np

import pychemist

df = pychemist.load('financials')

Let’s inspect the first few rows:

df.head()

	index	ticker	year	net_income	total_assets	revenue
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN
3	14	AAPL	2023	NaN	3.527550e+11	NaN
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN

5 rows × 6 columns

Step 2: Create a Lag Manually

First, we sort the dataset by company (ticker) and time (year). Second, we create a lagged variable total_assets_lag using Pandas’ built-in .shift() method.

1 2	df = df.sort_values(['ticker', 'year']) df['total_assets_lag'] = df['total_assets'].shift()

Next, we inspect the DataFrame:

1	df.head(10)

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11
5	27	ADBE	2020	2.591000e+09	2.076200e+10	9.030000e+09	3.525830e+11
6	28	ADBE	2021	2.951000e+09	2.428400e+10	1.117100e+10	2.076200e+10
7	29	ADBE	2022	5.260000e+09	2.724100e+10	1.286800e+10	2.428400e+10
8	30	ADBE	2023	4.822000e+09	2.716500e+10	1.578500e+10	2.724100e+10
9	31	ADBE	2024	4.756000e+09	2.977900e+10	1.760600e+10	2.716500e+10

10 rows × 7 columns

Problem: We see that indeed a lagged version of total assets is added to the DataFrame. However, this method does not take into account whether the previous row belongs to the same company. It simply shifts row-wise, even across companies. We see for example that total_assets_lag for Adobe in 2020 equals Apple’s Total Assets for 2024. Obviously, this is not what we want.

Step 3: Fix Grouping Issues Manually

To address this, we can conditionally shift values only if the previous row belongs to the same ticker:

df = df.sort_values(['ticker', 'year'])

df['total_assets_lag'] = np.where(

df['ticker'] == df['ticker'].shift(),

df['total_assets'].shift(),

np.nan

)

Inspection of the first 10 rows of the DataFrame indicates that this indeed resolved the issue.

1	df.head(10)

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11
5	27	ADBE	2020	2.591000e+09	2.076200e+10	9.030000e+09	NaN
6	28	ADBE	2021	2.951000e+09	2.428400e+10	1.117100e+10	2.076200e+10
7	29	ADBE	2022	5.260000e+09	2.724100e+10	1.286800e+10	2.428400e+10
8	30	ADBE	2023	4.822000e+09	2.716500e+10	1.578500e+10	2.724100e+10
9	31	ADBE	2024	4.756000e+09	2.977900e+10	1.760600e+10	2.716500e+10

10 rows × 7 columns

However, there are still cases where the calculation did not happen as expected. For example, take a look at the observations for Nvidia and Tesla:

1	df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	1.731500e+10
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	3.430900e+10
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10

9 rows × 7 columns

Even when years are missing, the current logic carries forward the last available value, which may be from two or more years ago. As a result of data for NVIDIA for 2022 being missing, the lagged value total_assets_lag for 2023 is now incorrectly set equal to the total assets for 2021. Similarly, the lagged total assets for Tesla for 2022 are incorrectly set equal to the value for 2020.

Step 4: Handle Missing Years

Let’s refine the condition further. Our condition should verify that the previous row belongs to the same ticker and that the year increment equals one:

df = df.sort_values(['ticker', 'year'])

df['total_assets_lag'] = np.where(

(df['ticker'] == df['ticker'].shift()) &

(df['year'] - df['year'].shift() == 1),

df['total_assets'].shift(),

np.nan

)

Let’s inspect the DataFrame again:

1	df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	NaN
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	NaN
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10

9 rows × 7 columns

This approach works, but the code becomes verbose and difficult to scale. If we want a 2-year lag, the condition becomes even more complex:

df = df.sort_values(['ticker', 'year'])

df['total_assets_lag2'] = np.where(

(df['ticker'] == df['ticker'].shift(2)) &

(df['year'] - df['year'].shift(2) == 2),

df['total_assets'].shift(2),

np.nan

)

Step 5: Let Pychemist Do the Work

To simplify and generalize this process, we can use Pychemist’s built-in lag function.

Note: this requires installing and importing the Pychemist library as done at the start of this tutorial.

We can now generate a lagged variable as follows:

1	df = df.chem.lag('total_assets', 'ticker', 'year')

This automatically handles company grouping and time consistency. The chem.lag method also handles important edge cases automatically. For instance, it only assigns lag values when both the grouping variable (e.g., ticker) matches and the year variable increments by exactly the lag interval. This means that if a year is missing in the dataset, the function will not incorrectly carry over data from a non-consecutive year, and will instead return NaN as expected. If the lagged column already exists, it will not be overwritten, and a warning will be issued to prevent accidental data loss.

Step 6: Lag Multiple Columns

To created multiple lagged variables at once, we can set a list of variables for which lagged variables should be computed. We set replace=True because the DataFrame already contains the lagged variable for total_assets. Without this argument the function would raise a warning that this variable already exists in the DataFrame. Alternatively, we could drop this column manually.

1	df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', replace=True)

We can see that the results are as expected:

1	df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10	NaN
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	NaN	NaN
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10	1.667500e+10
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10	2.691400e+10
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	NaN	NaN
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10	3.153600e+10
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10	5.382300e+10

9 rows × 8 columns

To generate 2-year lags (or more), simply pass a ‘shift’ parameter:

1	df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', 2)

This results in the following DataFrame:

df.head()

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag	total_assets_lag2	revenue_lag2
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN	NaN	NaN	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11	NaN	NaN	NaN
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11	NaN	3.385160e+11	NaN
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11	NaN	3.238880e+11	NaN
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11	NaN	3.510020e+11	NaN

5 rows × 10 columns

Step 7: Create Lead Variables

Creating lead variables (i.e., future values) is just as easy. Simply use .chem.lead instead:

1	df = df.chem.lead(['total_assets', 'revenue'], 'ticker', 'year')

The DataFrame would look as follows:

df.head()

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag	total_assets_lag2	revenue_lag2	total_assets_lead	revenue_lead
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN	NaN	NaN	NaN	3.238880e+11	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11	NaN	NaN	NaN	3.510020e+11	NaN
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11	NaN	3.385160e+11	NaN	3.527550e+11	NaN
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11	NaN	3.238880e+11	NaN	3.525830e+11	NaN
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11	NaN	3.510020e+11	NaN	NaN	NaN

5 rows × 12 columns

Step 8: Compute Derived Metrics

Now that we have lags, we can compute variables such as Return on Assets (ROA) or Revenue Growth:

df=df.eval("""

roa=net_income / ((total_assets+total_assets_lag)/2)

growth = (revenue-revenue_lag) / revenue_lag

""")

Inspect the results:

1	df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag	total_assets_lag2	revenue_lag2	total_assets_lead	revenue_lead	roa	growth
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN	NaN	NaN	NaN	1.731500e+10	NaN	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10	NaN	NaN	NaN	NaN	NaN	0.270592	NaN
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	NaN	NaN	1.731500e+10	NaN	4.118200e+10	2.691400e+10	NaN	NaN
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10	1.667500e+10	NaN	NaN	6.572800e+10	2.697400e+10	0.228467	0.614033
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10	2.691400e+10	4.418700e+10	1.667500e+10	NaN	NaN	0.081714	0.002229
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	NaN	NaN	3.430900e+10	2.146100e+10	8.233800e+10	5.382300e+10	NaN	NaN
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10	3.153600e+10	NaN	NaN	1.066180e+11	8.146200e+10	0.076404	0.706716
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10	5.382300e+10	6.213100e+10	3.153600e+10	NaN	NaN	0.132899	0.513517

9 rows × 14 columns

Conclusion

In this tutorial, we learned how to create lagged and lead variables manually using pandas, and why that approach often falls short, especially in grouped, irregular panel data.

Then, we saw how the Pychemist library solves these problems elegantly:

Accurate grouping
Handles gaps in time
Clean, concise API
Easy multi-period shifts
Works seamlessly with pandas

Whether you’re building financial models or prepping data for regression analysis, Pychemist streamlines your workflow and eliminates common mistakes.

Introduction

Step 1: Load the Data and Libraries

Step 2: Create a Lag Manually

Step 3: Fix Grouping Issues Manually

Step 4: Handle Missing Years

Step 5: Let Pychemist Do the Work

Step 6: Lag Multiple Columns

Step 7: Create Lead Variables

Step 8: Compute Derived Metrics

Conclusion

Leave a Comment Cancel Reply