Introduction
In this tutorial, we examine how to create lagged and lead variables: essential tools for time series and panel data analysis. Whether you’re modeling financial trends or running regressions, such transformations are often essential to compute new variables, such as Return on Assets or Revenue Growth. We’ll also explore how the Pychemist
library simplifies this task.
We use pandas
and the pychemist
library to demonstrate this. For this tutorial we rely on a dataset of financial information for 20 large US companies, scraped from EDGAR. Some missing values in the dataset have been imputed for demonstration purposes.
While pandas lets you create lagged values using .shift()
, it doesn’t always behave as expected for grouped data or irregular time series. Pychemist provides a simpler and more reliable alternative through its .chem
accessor.
Step 1: Load the Data and Libraries
We start by importing the necessary libraries and loading the dataset. pandas
and numpy
are used for working with panel datasets. Additionally, we import pychemist
to download the financial dataset and to enable the custom pandas extensions (accessors) used in this example.
Note: If you haven’t installed pychemist yet, you can do so with the following command:
1 |
pip install pychemist |
Import the required libraries and load the dataset into a DataFrame (df):
1 2 3 4 5 |
import pandas as pd import numpy as np import pychemist df = pychemist.load('financials') |
Let’s inspect the first few rows:
1 |
df.head() |
index | ticker | year | net_income | total_assets | revenue | |
---|---|---|---|---|---|---|
0 | 11 | AAPL | 2020 | 5.953100e+10 | 3.385160e+11 | NaN |
1 | 12 | AAPL | 2021 | 5.525600e+10 | 3.238880e+11 | NaN |
2 | 13 | AAPL | 2022 | 5.741100e+10 | 3.510020e+11 | NaN |
3 | 14 | AAPL | 2023 | NaN | 3.527550e+11 | NaN |
4 | 15 | AAPL | 2024 | 9.980300e+10 | 3.525830e+11 | NaN |
5 rows × 6 columns
Step 2: Create a Lag Manually
First, we sort the dataset by company (ticker
) and time (year
). Second, we create a lagged variable total_assets_lag
using Pandas’ built-in .shift()
method.
1 2 |
df = df.sort_values(['ticker', 'year']) df['total_assets_lag'] = df['total_assets'].shift() |
Next, we inspect the DataFrame:
1 |
df.head(10) |
index | ticker | year | net_income | total_assets | revenue | total_assets_lag | |
---|---|---|---|---|---|---|---|
0 | 11 | AAPL | 2020 | 5.953100e+10 | 3.385160e+11 | NaN | NaN |
1 | 12 | AAPL | 2021 | 5.525600e+10 | 3.238880e+11 | NaN | 3.385160e+11 |
2 | 13 | AAPL | 2022 | 5.741100e+10 | 3.510020e+11 | NaN | 3.238880e+11 |
3 | 14 | AAPL | 2023 | NaN | 3.527550e+11 | NaN | 3.510020e+11 |
4 | 15 | AAPL | 2024 | 9.980300e+10 | 3.525830e+11 | NaN | 3.527550e+11 |
5 | 27 | ADBE | 2020 | 2.591000e+09 | 2.076200e+10 | 9.030000e+09 | 3.525830e+11 |
6 | 28 | ADBE | 2021 | 2.951000e+09 | 2.428400e+10 | 1.117100e+10 | 2.076200e+10 |
7 | 29 | ADBE | 2022 | 5.260000e+09 | 2.724100e+10 | 1.286800e+10 | 2.428400e+10 |
8 | 30 | ADBE | 2023 | 4.822000e+09 | 2.716500e+10 | 1.578500e+10 | 2.724100e+10 |
9 | 31 | ADBE | 2024 | 4.756000e+09 | 2.977900e+10 | 1.760600e+10 | 2.716500e+10 |
10 rows × 7 columns
Problem: We see that indeed a lagged version of total assets
is added to the DataFrame. However, this method does not take into account whether the previous row belongs to the same company. It simply shifts row-wise, even across companies. We see for example that total_assets_lag
for Adobe in 2020 equals Apple’s Total Assets for 2024. Obviously, this is not what we want.
Step 3: Fix Grouping Issues Manually
To address this, we can conditionally shift values only if the previous row belongs to the same ticker
:
1 2 3 4 5 6 |
df = df.sort_values(['ticker', 'year']) df['total_assets_lag'] = np.where( df['ticker'] == df['ticker'].shift(), df['total_assets'].shift(), np.nan ) |
Inspection of the first 10 rows of the DataFrame indicates that this indeed resolved the issue.
1 |
df.head(10) |
index | ticker | year | net_income | total_assets | revenue | total_assets_lag | |
---|---|---|---|---|---|---|---|
0 | 11 | AAPL | 2020 | 5.953100e+10 | 3.385160e+11 | NaN | NaN |
1 | 12 | AAPL | 2021 | 5.525600e+10 | 3.238880e+11 | NaN | 3.385160e+11 |
2 | 13 | AAPL | 2022 | 5.741100e+10 | 3.510020e+11 | NaN | 3.238880e+11 |
3 | 14 | AAPL | 2023 | NaN | 3.527550e+11 | NaN | 3.510020e+11 |
4 | 15 | AAPL | 2024 | 9.980300e+10 | 3.525830e+11 | NaN | 3.527550e+11 |
5 | 27 | ADBE | 2020 | 2.591000e+09 | 2.076200e+10 | 9.030000e+09 | NaN |
6 | 28 | ADBE | 2021 | 2.951000e+09 | 2.428400e+10 | 1.117100e+10 | 2.076200e+10 |
7 | 29 | ADBE | 2022 | 5.260000e+09 | 2.724100e+10 | 1.286800e+10 | 2.428400e+10 |
8 | 30 | ADBE | 2023 | 4.822000e+09 | 2.716500e+10 | 1.578500e+10 | 2.724100e+10 |
9 | 31 | ADBE | 2024 | 4.756000e+09 | 2.977900e+10 | 1.760600e+10 | 2.716500e+10 |
10 rows × 7 columns
However, there are still cases where the calculation did not happen as expected. For example, take a look at the observations for Nvidia and Tesla:
1 |
df.query('ticker=="NVDA" or ticker=="TSLA"') |
index | ticker | year | net_income | total_assets | revenue | total_assets_lag | |
---|---|---|---|---|---|---|---|
71 | 209 | NVDA | 2020 | 3.047000e+09 | 1.329200e+10 | NaN | NaN |
72 | 210 | NVDA | 2021 | 4.141000e+09 | 1.731500e+10 | NaN | 1.329200e+10 |
73 | 212 | NVDA | 2023 | 4.332000e+09 | 4.418700e+10 | 1.667500e+10 | 1.731500e+10 |
74 | 213 | NVDA | 2024 | 9.752000e+09 | 4.118200e+10 | 2.691400e+10 | 4.418700e+10 |
75 | 214 | NVDA | 2025 | 4.368000e+09 | 6.572800e+10 | 2.697400e+10 | 4.118200e+10 |
87 | 256 | TSLA | 2020 | -9.760000e+08 | 3.430900e+10 | 2.146100e+10 | NaN |
88 | 258 | TSLA | 2022 | 7.210000e+08 | 6.213100e+10 | 3.153600e+10 | 3.430900e+10 |
89 | 259 | TSLA | 2023 | 5.519000e+09 | 8.233800e+10 | 5.382300e+10 | 6.213100e+10 |
90 | 260 | TSLA | 2024 | 1.255600e+10 | 1.066180e+11 | 8.146200e+10 | 8.233800e+10 |
9 rows × 7 columns
Even when years are missing, the current logic carries forward the last available value, which may be from two or more years ago. As a result of data for NVIDIA for 2022 being missing, the lagged value total_assets_lag
for 2023 is now incorrectly set equal to the total assets for 2021. Similarly, the lagged total assets for Tesla for 2022 are incorrectly set equal to the value for 2020.
Step 4: Handle Missing Years
Let’s refine the condition further. Our condition should verify that the previous row belongs to the same ticker
and that the year
increment equals one:
1 2 3 4 5 6 7 |
df = df.sort_values(['ticker', 'year']) df['total_assets_lag'] = np.where( (df['ticker'] == df['ticker'].shift()) & (df['year'] - df['year'].shift() == 1), df['total_assets'].shift(), np.nan ) |
Let’s inspect the DataFrame again:
1 |
df.query('ticker=="NVDA" or ticker=="TSLA"') |
index | ticker | year | net_income | total_assets | revenue | total_assets_lag | |
---|---|---|---|---|---|---|---|
71 | 209 | NVDA | 2020 | 3.047000e+09 | 1.329200e+10 | NaN | NaN |
72 | 210 | NVDA | 2021 | 4.141000e+09 | 1.731500e+10 | NaN | 1.329200e+10 |
73 | 212 | NVDA | 2023 | 4.332000e+09 | 4.418700e+10 | 1.667500e+10 | NaN |
74 | 213 | NVDA | 2024 | 9.752000e+09 | 4.118200e+10 | 2.691400e+10 | 4.418700e+10 |
75 | 214 | NVDA | 2025 | 4.368000e+09 | 6.572800e+10 | 2.697400e+10 | 4.118200e+10 |
87 | 256 | TSLA | 2020 | -9.760000e+08 | 3.430900e+10 | 2.146100e+10 | NaN |
88 | 258 | TSLA | 2022 | 7.210000e+08 | 6.213100e+10 | 3.153600e+10 | NaN |
89 | 259 | TSLA | 2023 | 5.519000e+09 | 8.233800e+10 | 5.382300e+10 | 6.213100e+10 |
90 | 260 | TSLA | 2024 | 1.255600e+10 | 1.066180e+11 | 8.146200e+10 | 8.233800e+10 |
9 rows × 7 columns
This approach works, but the code becomes verbose and difficult to scale. If we want a 2-year lag, the condition becomes even more complex:
1 2 3 4 5 6 7 |
df = df.sort_values(['ticker', 'year']) df['total_assets_lag2'] = np.where( (df['ticker'] == df['ticker'].shift(2)) & (df['year'] - df['year'].shift(2) == 2), df['total_assets'].shift(2), np.nan ) |
Step 5: Let Pychemist Do the Work
To simplify and generalize this process, we can use Pychemist
’s built-in lag function.
Note: this requires installing and importing the Pychemist library as done at the start of this tutorial.
We can now generate a lagged variable as follows:
1 |
df = df.chem.lag('total_assets', 'ticker', 'year') |
This automatically handles company grouping and time consistency. The chem.lag
method also handles important edge cases automatically. For instance, it only assigns lag values when both the grouping variable (e.g., ticker
) matches and the year
variable increments by exactly the lag interval. This means that if a year is missing in the dataset, the function will not incorrectly carry over data from a non-consecutive year, and will instead return NaN
as expected. If the lagged column already exists, it will not be overwritten, and a warning will be issued to prevent accidental data loss.
Step 6: Lag Multiple Columns
To created multiple lagged variables at once, we can set a list of variables for which lagged variables should be computed. We set replace=True
because the DataFrame already contains the lagged variable for total_assets. Without this argument the function would raise a warning that this variable already exists in the DataFrame. Alternatively, we could drop this column manually.
1 |
df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', replace=True) |
We can see that the results are as expected:
1 |
df.query('ticker=="NVDA" or ticker=="TSLA"') |
index | ticker | year | net_income | total_assets | revenue | total_assets_lag | revenue_lag | |
---|---|---|---|---|---|---|---|---|
71 | 209 | NVDA | 2020 | 3.047000e+09 | 1.329200e+10 | NaN | NaN | NaN |
72 | 210 | NVDA | 2021 | 4.141000e+09 | 1.731500e+10 | NaN | 1.329200e+10 | NaN |
73 | 212 | NVDA | 2023 | 4.332000e+09 | 4.418700e+10 | 1.667500e+10 | NaN | NaN |
74 | 213 | NVDA | 2024 | 9.752000e+09 | 4.118200e+10 | 2.691400e+10 | 4.418700e+10 | 1.667500e+10 |
75 | 214 | NVDA | 2025 | 4.368000e+09 | 6.572800e+10 | 2.697400e+10 | 4.118200e+10 | 2.691400e+10 |
87 | 256 | TSLA | 2020 | -9.760000e+08 | 3.430900e+10 | 2.146100e+10 | NaN | NaN |
88 | 258 | TSLA | 2022 | 7.210000e+08 | 6.213100e+10 | 3.153600e+10 | NaN | NaN |
89 | 259 | TSLA | 2023 | 5.519000e+09 | 8.233800e+10 | 5.382300e+10 | 6.213100e+10 | 3.153600e+10 |
90 | 260 | TSLA | 2024 | 1.255600e+10 | 1.066180e+11 | 8.146200e+10 | 8.233800e+10 | 5.382300e+10 |
9 rows × 8 columns
To generate 2-year lags (or more), simply pass a ‘shift’ parameter:
1 |
df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', 2) |
This results in the following DataFrame:
1 |
df.head() |
index | ticker | year | net_income | total_assets | revenue | total_assets_lag | revenue_lag | total_assets_lag2 | revenue_lag2 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 11 | AAPL | 2020 | 5.953100e+10 | 3.385160e+11 | NaN | NaN | NaN | NaN | NaN |
1 | 12 | AAPL | 2021 | 5.525600e+10 | 3.238880e+11 | NaN | 3.385160e+11 | NaN | NaN | NaN |
2 | 13 | AAPL | 2022 | 5.741100e+10 | 3.510020e+11 | NaN | 3.238880e+11 | NaN | 3.385160e+11 | NaN |
3 | 14 | AAPL | 2023 | NaN | 3.527550e+11 | NaN | 3.510020e+11 | NaN | 3.238880e+11 | NaN |
4 | 15 | AAPL | 2024 | 9.980300e+10 | 3.525830e+11 | NaN | 3.527550e+11 | NaN | 3.510020e+11 | NaN |
5 rows × 10 columns
Step 7: Create Lead Variables
Creating lead variables (i.e., future values) is just as easy. Simply use .chem.lead
instead:
1 |
df = df.chem.lead(['total_assets', 'revenue'], 'ticker', 'year') |
The DataFrame would look as follows:
1 |
df.head() |
index | ticker | year | net_income | total_assets | revenue | total_assets_lag | revenue_lag | total_assets_lag2 | revenue_lag2 | total_assets_lead | revenue_lead | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 11 | AAPL | 2020 | 5.953100e+10 | 3.385160e+11 | NaN | NaN | NaN | NaN | NaN | 3.238880e+11 | NaN |
1 | 12 | AAPL | 2021 | 5.525600e+10 | 3.238880e+11 | NaN | 3.385160e+11 | NaN | NaN | NaN | 3.510020e+11 | NaN |
2 | 13 | AAPL | 2022 | 5.741100e+10 | 3.510020e+11 | NaN | 3.238880e+11 | NaN | 3.385160e+11 | NaN | 3.527550e+11 | NaN |
3 | 14 | AAPL | 2023 | NaN | 3.527550e+11 | NaN | 3.510020e+11 | NaN | 3.238880e+11 | NaN | 3.525830e+11 | NaN |
4 | 15 | AAPL | 2024 | 9.980300e+10 | 3.525830e+11 | NaN | 3.527550e+11 | NaN | 3.510020e+11 | NaN | NaN | NaN |
5 rows × 12 columns
Step 8: Compute Derived Metrics
Now that we have lags, we can compute variables such as Return on Assets (ROA) or Revenue Growth:
1 2 3 4 |
df=df.eval(""" roa=net_income / ((total_assets+total_assets_lag)/2) growth = (revenue-revenue_lag) / revenue_lag """) |
Inspect the results:
1 |
df.query('ticker=="NVDA" or ticker=="TSLA"') |
index | ticker | year | net_income | total_assets | revenue | total_assets_lag | revenue_lag | total_assets_lag2 | revenue_lag2 | total_assets_lead | revenue_lead | roa | growth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
71 | 209 | NVDA | 2020 | 3.047000e+09 | 1.329200e+10 | NaN | NaN | NaN | NaN | NaN | 1.731500e+10 | NaN | NaN | NaN |
72 | 210 | NVDA | 2021 | 4.141000e+09 | 1.731500e+10 | NaN | 1.329200e+10 | NaN | NaN | NaN | NaN | NaN | 0.270592 | NaN |
73 | 212 | NVDA | 2023 | 4.332000e+09 | 4.418700e+10 | 1.667500e+10 | NaN | NaN | 1.731500e+10 | NaN | 4.118200e+10 | 2.691400e+10 | NaN | NaN |
74 | 213 | NVDA | 2024 | 9.752000e+09 | 4.118200e+10 | 2.691400e+10 | 4.418700e+10 | 1.667500e+10 | NaN | NaN | 6.572800e+10 | 2.697400e+10 | 0.228467 | 0.614033 |
75 | 214 | NVDA | 2025 | 4.368000e+09 | 6.572800e+10 | 2.697400e+10 | 4.118200e+10 | 2.691400e+10 | 4.418700e+10 | 1.667500e+10 | NaN | NaN | 0.081714 | 0.002229 |
87 | 256 | TSLA | 2020 | -9.760000e+08 | 3.430900e+10 | 2.146100e+10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
88 | 258 | TSLA | 2022 | 7.210000e+08 | 6.213100e+10 | 3.153600e+10 | NaN | NaN | 3.430900e+10 | 2.146100e+10 | 8.233800e+10 | 5.382300e+10 | NaN | NaN |
89 | 259 | TSLA | 2023 | 5.519000e+09 | 8.233800e+10 | 5.382300e+10 | 6.213100e+10 | 3.153600e+10 | NaN | NaN | 1.066180e+11 | 8.146200e+10 | 0.076404 | 0.706716 |
90 | 260 | TSLA | 2024 | 1.255600e+10 | 1.066180e+11 | 8.146200e+10 | 8.233800e+10 | 5.382300e+10 | 6.213100e+10 | 3.153600e+10 | NaN | NaN | 0.132899 | 0.513517 |
9 rows × 14 columns
Conclusion
In this tutorial, we learned how to create lagged and lead variables manually using pandas, and why that approach often falls short, especially in grouped, irregular panel data.
Then, we saw how the Pychemist library solves these problems elegantly:
- Accurate grouping
- Handles gaps in time
- Clean, concise API
- Easy multi-period shifts
- Works seamlessly with pandas
Whether you’re building financial models or prepping data for regression analysis, Pychemist
streamlines your workflow and eliminates common mistakes.