Guides

Creating Lag and Lead Variables

Pychemist — Mon, 28 Jul 2025 18:35:57 +0000

Introduction

In this tutorial, we examine how to create lagged and lead variables: essential tools for time series and panel data analysis. Whether you’re modeling financial trends or running regressions, such transformations are often essential to compute new variables, such as Return on Assets or Revenue Growth. We’ll also explore how the Pychemist library simplifies this task.

We use pandas and the pychemist library to demonstrate this. For this tutorial we rely on a dataset of financial information for 20 large US companies, scraped from EDGAR. Some missing values in the dataset have been imputed for demonstration purposes.

While pandas lets you create lagged values using .shift(), it doesn’t always behave as expected for grouped data or irregular time series. Pychemist provides a simpler and more reliable alternative through its .chemaccessor.

Step 1: Load the Data and Libraries

We start by importing the necessary libraries and loading the dataset. pandas and numpy are used for working with panel datasets. Additionally, we import pychemist to download the financial dataset and to enable the custom pandas extensions (accessors) used in this example.

Note: If you haven’t installed pychemist yet, you can do so with the following command:

pip install pychemist

Import the required libraries and load the dataset into a DataFrame (df):

import pandas as pd
import numpy as np
import pychemist

df = pychemist.load('financials')

Let’s inspect the first few rows:

df.head()

	index	ticker	year	net_income	total_assets	revenue
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN
3	14	AAPL	2023	NaN	3.527550e+11	NaN
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN

5 rows × 6 columns

Step 2: Create a Lag Manually

First, we sort the dataset by company (ticker) and time (year). Second, we create a lagged variable total_assets_lag using Pandas’ built-in .shift() method.

df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = df['total_assets'].shift()

Next, we inspect the DataFrame:

df.head(10)

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11
5	27	ADBE	2020	2.591000e+09	2.076200e+10	9.030000e+09	3.525830e+11
6	28	ADBE	2021	2.951000e+09	2.428400e+10	1.117100e+10	2.076200e+10
7	29	ADBE	2022	5.260000e+09	2.724100e+10	1.286800e+10	2.428400e+10
8	30	ADBE	2023	4.822000e+09	2.716500e+10	1.578500e+10	2.724100e+10
9	31	ADBE	2024	4.756000e+09	2.977900e+10	1.760600e+10	2.716500e+10

10 rows × 7 columns

Problem: We see that indeed a lagged version of total assets is added to the DataFrame. However, this method does not take into account whether the previous row belongs to the same company. It simply shifts row-wise, even across companies. We see for example that total_assets_lag for Adobe in 2020 equals Apple’s Total Assets for 2024. Obviously, this is not what we want.

Step 3: Fix Grouping Issues Manually

To address this, we can conditionally shift values only if the previous row belongs to the same ticker:

df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = np.where(
    df['ticker'] == df['ticker'].shift(),
    df['total_assets'].shift(),
    np.nan
)

Inspection of the first 10 rows of the DataFrame indicates that this indeed resolved the issue.

df.head(10)

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11
5	27	ADBE	2020	2.591000e+09	2.076200e+10	9.030000e+09	NaN
6	28	ADBE	2021	2.951000e+09	2.428400e+10	1.117100e+10	2.076200e+10
7	29	ADBE	2022	5.260000e+09	2.724100e+10	1.286800e+10	2.428400e+10
8	30	ADBE	2023	4.822000e+09	2.716500e+10	1.578500e+10	2.724100e+10
9	31	ADBE	2024	4.756000e+09	2.977900e+10	1.760600e+10	2.716500e+10

10 rows × 7 columns

However, there are still cases where the calculation did not happen as expected. For example, take a look at the observations for Nvidia and Tesla:

df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	1.731500e+10
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	3.430900e+10
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10

9 rows × 7 columns

Even when years are missing, the current logic carries forward the last available value, which may be from two or more years ago. As a result of data for NVIDIA for 2022 being missing, the lagged value total_assets_lag for 2023 is now incorrectly set equal to the total assets for 2021. Similarly, the lagged total assets for Tesla for 2022 are incorrectly set equal to the value for 2020.

Step 4: Handle Missing Years

Let’s refine the condition further. Our condition should verify that the previous row belongs to the same ticker and that the year increment equals one:

df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = np.where(
    (df['ticker'] == df['ticker'].shift()) & 
    (df['year'] - df['year'].shift() == 1),
    df['total_assets'].shift(),
    np.nan
)

Let’s inspect the DataFrame again:

df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	NaN
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	NaN
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10

9 rows × 7 columns

This approach works, but the code becomes verbose and difficult to scale. If we want a 2-year lag, the condition becomes even more complex:

df = df.sort_values(['ticker', 'year'])
df['total_assets_lag2'] = np.where(
    (df['ticker'] == df['ticker'].shift(2)) & 
    (df['year'] - df['year'].shift(2) == 2),
    df['total_assets'].shift(2),
    np.nan
)

Step 5: Let Pychemist Do the Work

To simplify and generalize this process, we can use Pychemist’s built-in lag function.

Note: this requires installing and importing the Pychemist library as done at the start of this tutorial.

We can now generate a lagged variable as follows:

df = df.chem.lag('total_assets', 'ticker', 'year')

This automatically handles company grouping and time consistency. The chem.lag method also handles important edge cases automatically. For instance, it only assigns lag values when both the grouping variable (e.g., ticker) matches and the year variable increments by exactly the lag interval. This means that if a year is missing in the dataset, the function will not incorrectly carry over data from a non-consecutive year, and will instead return NaN as expected. If the lagged column already exists, it will not be overwritten, and a warning will be issued to prevent accidental data loss.

Step 6: Lag Multiple Columns

To created multiple lagged variables at once, we can set a list of variables for which lagged variables should be computed. We set replace=True because the DataFrame already contains the lagged variable for total_assets. Without this argument the function would raise a warning that this variable already exists in the DataFrame. Alternatively, we could drop this column manually.

df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', replace=True)

We can see that the results are as expected:

df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10	NaN
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	NaN	NaN
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10	1.667500e+10
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10	2.691400e+10
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	NaN	NaN
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10	3.153600e+10
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10	5.382300e+10

9 rows × 8 columns

To generate 2-year lags (or more), simply pass a ‘shift’ parameter:

df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', 2)

This results in the following DataFrame:

df.head()

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag	total_assets_lag2	revenue_lag2
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN	NaN	NaN	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11	NaN	NaN	NaN
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11	NaN	3.385160e+11	NaN
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11	NaN	3.238880e+11	NaN
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11	NaN	3.510020e+11	NaN

5 rows × 10 columns

Step 7: Create Lead Variables

Creating lead variables (i.e., future values) is just as easy. Simply use .chem.lead instead:

df = df.chem.lead(['total_assets', 'revenue'], 'ticker', 'year')

The DataFrame would look as follows:

df.head()

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag	total_assets_lag2	revenue_lag2	total_assets_lead	revenue_lead
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN	NaN	NaN	NaN	3.238880e+11	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11	NaN	NaN	NaN	3.510020e+11	NaN
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11	NaN	3.385160e+11	NaN	3.527550e+11	NaN
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11	NaN	3.238880e+11	NaN	3.525830e+11	NaN
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11	NaN	3.510020e+11	NaN	NaN	NaN

5 rows × 12 columns

Step 8: Compute Derived Metrics

Now that we have lags, we can compute variables such as Return on Assets (ROA) or Revenue Growth:

df=df.eval("""
    roa=net_income / ((total_assets+total_assets_lag)/2)
    growth = (revenue-revenue_lag) / revenue_lag
    """)

Inspect the results:

df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag	total_assets_lag2	revenue_lag2	total_assets_lead	revenue_lead	roa	growth
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN	NaN	NaN	NaN	1.731500e+10	NaN	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10	NaN	NaN	NaN	NaN	NaN	0.270592	NaN
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	NaN	NaN	1.731500e+10	NaN	4.118200e+10	2.691400e+10	NaN	NaN
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10	1.667500e+10	NaN	NaN	6.572800e+10	2.697400e+10	0.228467	0.614033
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10	2.691400e+10	4.418700e+10	1.667500e+10	NaN	NaN	0.081714	0.002229
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	NaN	NaN	3.430900e+10	2.146100e+10	8.233800e+10	5.382300e+10	NaN	NaN
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10	3.153600e+10	NaN	NaN	1.066180e+11	8.146200e+10	0.076404	0.706716
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10	5.382300e+10	6.213100e+10	3.153600e+10	NaN	NaN	0.132899	0.513517

9 rows × 14 columns

Conclusion

In this tutorial, we learned how to create lagged and lead variables manually using pandas, and why that approach often falls short, especially in grouped, irregular panel data.

Then, we saw how the Pychemist library solves these problems elegantly:

Accurate grouping
Handles gaps in time
Clean, concise API
Easy multi-period shifts
Works seamlessly with pandas

Whether you’re building financial models or prepping data for regression analysis, Pychemist streamlines your workflow and eliminates common mistakes.

Mutate

Pychemist — Sun, 20 Jul 2025 21:38:18 +0000

When preparing data for analysis, it is often necessary to create new variables or modify existing values: whether to fix data entry errors, derive variables based on existing ones, or flag subsets of data. In pandas, this is typically done using df.loc or np.where, but these methods can lead to verbose, repetitive, and hard-to-read code.

In this tutorial, we introduce a more readable and expressive alternative using a custom pandas accessor provided by the pychemist library. By leveraging df.query under the hood, the .chem.mutate accessor allows you to perform chained, conditional assignments in a cleaner and more maintainable way.

To demonstrate how to modify variables, we’ll use the IBM HR Analytics Employee Attrition & Performance dataset, available on Kaggle:
https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

We start by importing the required libraries and loading the dataset into a DataFrame:

import pandas as pd
import numpy as np

df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')

This creates a new DataFrame df.

We can examine the first five rows of the dataset to get a sense of the variables, their names, and the types of values they contain:

df.head()

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	MonthlyRate	NumCompaniesWorked	Over18	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	2	Female	94	3	2	Sales Executive	4	Single	5993	19479	8	Y	Yes	11	3	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	3	Male	61	2	2	Research Scientist	2	Married	5130	24907	1	Y	No	23	4	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	4	Male	92	2	1	Laboratory Technician	3	Single	2090	2396	6	Y	Yes	15	3	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	4	Female	56	3	1	Research Scientist	3	Married	2909	23159	1	Y	Yes	11	3	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	1	Male	40	3	1	Laboratory Technician	2	Married	3468	16632	9	Y	No	12	3	4	80	1	6	3	3	2	2	2	2

5 rows × 35 columns

Conditional Mutation with Base Pandas

Imagine we want to decide which employees are eligible for a bonus. As an example, we could use Numpy’s where function (np.where) to flag employees who haven’t been promoted in 3 or more years:

df['promotion'] = np.where(df['YearsSinceLastPromotion'] >= 3, 1, 0)

Alternatively, the same logic using df.loc:

df['promotion'] = 0
df.loc[df['YearsSinceLastPromotion'] >= 3, 'promotion'] = 1

Both approaches generate a new column promotion with a value of 1 for employees who haven’t been promoted in 3 or more years, and 0 otherwise.

These methods work well for simple logic. However, suppose we now want to promote employees who meet all the following conditions:

Haven’t been promoted in at least 3 years
Hold the position of Manager
Have a performance rating of 4 or higher

Using np.where, this looks like:

df['promotion'] = np.where(
    (df['YearsSinceLastPromotion'] >= 3) & 
    (df['JobRole'] == "Manager") & 
    (df['PerformanceRating'] >= 4),
    1, 0
    )

Or using df.loc:

df['promotion'] = 0
df.loc[
    (df['YearsSinceLastPromotion'] >= 3) & 
    (df['JobRole'] == "Manager") & 
    (df['PerformanceRating'] >= 4),
    'promotion'
    ] = 1

While both work, they’re increasingly verbose and harder to read and maintain, especially as the complexity of the conditions grows.

Using `pychemist` for Cleaner Mutation

Instead of repeating df[...] and writing complex boolean logic, we can use a custom accessor for pandas built on top of df.query and df.loc. This method enables readable, query-style conditional logic. We’ll use the pychemist package for this.

Install the package using pip:

pip install pychemist

Then, import the library to register the accessor and perform the conditional mutation:

import pychemist

df = df.chem.mutate('YearsSinceLastPromotion >= 3 & JobRole == "Manager" & PerformanceRating >= 4', 'Promotion', 1, 0)

How `chem.mutate` works

The method requires:

Query string: A valid expression compatible with df.query
Column name: The name of the column to create or modify
Value if True: The value to assign to rows that match the query
Value if False (optional): The value to assign to rows that don’t match

This syntax is significantly more readable and easier to chain.

Advanced Example: Chain Multiple Mutations

You can also chain multiple mutate calls together:

df = (df
    .chem.mutate('YearsSinceLastPromotion >= 3 & JobRole == "Manager" & PerformanceRating >= 4', 'Promotion', 1, 0)
    .chem.mutate('WorkLifeBalance<2 & OverTime=="Yes"', 'HighPressure', 'Yes', 'No')
    )

This generates two new columns:

Promotion: 1 for managers with a high performance rating (4 or higher) that haven’t received a promotion for a least 3 years, else 0
HighPressure: 'Yes' for employees who score low on work-life balance and work overtime, else 'No'

This chaining allows your transformations to remain tidy and declarative.

Summary

This tutorial demonstrated how to use pychemist’s custom mutate accessor to simplify data manipulation in pandas. It allows:

More concise and readable conditional logic
Easy chaining of multiple transformations
Reduced risk of typos and parentheses mismatches

By abstracting away the boilerplate of df.loc and np.where, pychemist helps make your data preparation code more expressive and maintainable.