Manual: .chem.lead() Method for Creating Lead Variables

Pychemist — Tue, 29 Jul 2025 07:41:24 +0000

Overview

The .chem.lead() method is a custom pandas accessor that creates lead versions of one or more variables in a DataFrame. It works by shifting the specified columns backward in time (i.e., to future periods) and merging the result back onto the original DataFrame.

This is especially useful for panel or time-series data where each row represents an observation at a time point for a particular unit (e.g., experiment, company, individual).

For a concrete example of usage, refer to https://pychemist.com/creating-lag-and-lead-variables/.

Accessor Registration

This method is registered with pandas under the accessor name .chem:

Access the method like this:

df.chem.lead(...)

Method Signature

df.chem.lead(variables, identifier, time, shift=1, *, replace=False)

Parameters

Parameter	Type	Description
variables	`str` or `list of str`	Column(s) for which to create lead versions.
identifier	`str`	The name of the column identifying individual units (e.g., subject ID or group).
time	`str`	The name of the time column. Used to shift values within groups.
shift	`int`, default `1`	Number of time periods to shift. Must be a positive integer.
replace	`bool`, default `False`	If `True`, overwrites existing lead columns. If `False`, raises an error if there’s a naming conflict.

Returns

A modified copy of the original pd.DataFrame, with lead versions of the specified variables added as new columns.

Behavior

Validates parameter types and column existence.
Creates a shifted version of the selected variables by subtracting the shift from the time column.
Merges this lead DataFrame back into the original, using identifier and time as keys.
New columns are suffixed with:
- _lead for shift=1
- _leadN for shift=N (e.g., _lead3 for shift=3)
If replace=False and a target lead column already exists, a ValueError is raised.

Notes

The original DataFrame remains unchanged.
Supports multiple variables and vectorized group-wise operations.
Useful for forecasting models or previewing future values.
For backward-looking operations, see .chem.lag().

Example

df = pd.DataFrame({
    "id": [1, 1, 1, 2, 2, 2],
    "time": [1, 2, 3, 1, 2, 3],
    "mass": [10, 15, 20, 5, 10, 15]
})

df2 = df.chem.lead(variables="mass", identifier="id", time="time", shift=1)

This will produce a new column called mass_lead with the mass value from the next time step (grouped by id).

Or more concisely:

df2 = df.chem.lead("mass", "id", "time")

Common Use Cases

Creating future-looking predictors in modeling.
Forecast validation (comparing current and future state).
Detecting upcoming transitions or changes in sequence.

Error Handling

TypeError if variables is not a string or list of strings.
TypeError if replace is not a boolean.
TypeError if shift is not a positive integer.
ValueError if a specified column does not exist.
ValueError if a lead column already exists and replace=False.

Internals

The method:

Copies the relevant subset of the DataFrame.
Shifts the time column backward (df[time] - shift) to align future values.
Applies suffixes such as _lead or _leadN.
Merges the lead values back into the original DataFrame using pd.merge(...).

Manual: .chem.lag() Method for Creating Lagged Variables

Pychemist — Tue, 29 Jul 2025 07:35:39 +0000

Overview

The .chem.lag() method is a custom pandas accessor that creates lagged versions of one or more variables in a DataFrame. It operates by shifting the specified columns by a given number of time periods and merging the result back onto the original DataFrame.

This is especially useful for time-series panel data where each row belongs to a unique unit (e.g., company, experiment, patient) over time.

For a concrete example of usage, refer to https://pychemist.com/creating-lag-and-lead-variables/.

Accessor Registration

This method is registered with pandas under the accessor name .chem:

Access the method like this:

df.chem.lag(...)

Method Signature

df.chem.lag(variables, identifier, time, shift=1, *, replace=False)

Parameters

Parameter	Type	Description
variables	`str` or `list of str`	Column(s) for which to create lagged versions.
identifier	`str`	The name of the column identifying individual units (e.g., subject ID or group).
time	`str`	The name of the time column. Used to shift values within groups.
shift	`int`, default `1`	Number of time periods to shift. Positive values create lags. Negative values are not allowed.
replace	`bool`, default `False`	If `True`, overwrites existing lagged columns. If `False`, raises an error if there’s a naming conflict.

Returns

A modified copy of the original pd.DataFrame, with lagged versions of the specified variables added as new columns.

Behavior

Verifies input types and column existence.
Constructs a lagged version of the selected variables by shifting the time column forward by the given shift amount.
Merges this lagged DataFrame back onto the original, based on the identifier and time.
New columns are suffixed with:
- _lag for shift=1
- _lagN for shift=N (e.g., _lag3 for shift=3)
If replace=False and any of the output columns already exist, the method raises a ValueError.

Notes

The original DataFrame remains unchanged.
Supports multiple variables and vectorized operations.
Designed for panel or longitudinal data.
For negative shift (lead variables) refert to .chem.lead()

Example

df = pd.DataFrame({
    "id": [1, 1, 1, 2, 2, 2],
    "time": [1, 2, 3, 1, 2, 3],
    "mass": [10, 15, 20, 5, 10, 15]
})

df_lagged = df.chem.lag(variables="mass", identifier="id", time="time", shift=1)

This will produce a new column called mass_lag with the mass value from the previous time step (by id).

Or more concisely:

df_lagged = df.chem.lag("mass", "id", "time")

Common Use Cases

Creating lagged predictors in time-series regression.
Modeling delayed effects in experiments.
Comparing values across sequential periods.

Error Handling

TypeError if variables is not a string or list of strings.
TypeError if replace is not a boolean.
TypeError if shift is not a positive integer.
ValueError if any specified variable is missing from the DataFrame.
ValueError if a lagged column already exists and replace=False.

Internals

The method:

Creates a shifted copy of the target columns using df[time] + shift.
Applies suffixes like _lag, _lag2, etc.
Merges the lagged columns back onto the original DataFrame using pd.merge(...).

Manual: .chem.mutate() Method for Conditional DataFrame Updates

Pychemist — Mon, 28 Jul 2025 19:13:44 +0000

Overview

The .chem.mutate() method is a custom pandas accessor that enables conditional assignment of values to a column in a DataFrame using a query string. It allows for clean, chainable DataFrame transformations and always returns a modified copy of the DataFrame.

This is particularly useful when working with chemistry-related tabular data and transformations, but it can be applied more broadly.

For a concrete example of usage, refer to https://pychemist.com/mutate/.

Accessor Registration

This method is registered using the pandas API extensions system. This means you can access the method via:

df.chem.mutate(...)

Method Signature

df.chem.mutate(query_str, column, value, other=None)

Parameters

Parameter	Type	Description
query_str	`str`	A pandas query string used to select rows that satisfy the condition.
column	`str`	The column to update or create.
value	`scalar` or `array-like`	The value(s) assigned to rows that meet the query condition.
other	`scalar` or `array-like`, optional	The value(s) assigned to rows not meeting the condition. If `None`, rows not matching the condition are left unchanged.

Returns

A modified copy of the original pd.DataFrame.

This method does not modify the DataFrame in-place.

Behavior

The method evaluates the query_str on the DataFrame.
For rows matching the query, the specified column is set to value.
For rows not matching the query, the column is set to other only if other is provided.
The column is created if it does not already exist.

Notes

This method is part of a custom accessor named .chem.
The original DataFrame is left unchanged.
You can use it in method chains, e.g.: df2 = df.chem.mutate(query_str=”mass > 10″, column=”label”, value=”heavy”, other=”light”)
value and other can be scalars or array-like, but must match the number of rows being assigned.

Or simply:

df2 = df.chem.mutate("mass > 10", "label", "heavy", "light")

Example

See https://pychemist.com/mutate/ for a practical example using .chem.mutate().

Common Use Cases

Labeling chemical species by some threshold: df.chem.mutate(“concentration > 1.0”, “status”, “high”, “low”)
Creating a new boolean flag: df.chem.mutate(“pH < 7”, “is_acidic”, True, other=False)

Error Handling

Any syntax errors in query_str will raise a pandas exception at runtime.
If value or other are array-like but do not match the shape of the selected rows, a ValueError will be raised.

Internals

Internally, the method:

Copies the DataFrame (df = self._obj.copy()),
Applies the query string using df.query(query_str),
Locates matching and non-matching row indices,
Assigns value and optionally other to the specified column,
Returns the modified DataFrame.

Creating Lag and Lead Variables

Pychemist — Mon, 28 Jul 2025 18:35:57 +0000

Introduction

In this tutorial, we examine how to create lagged and lead variables: essential tools for time series and panel data analysis. Whether you’re modeling financial trends or running regressions, such transformations are often essential to compute new variables, such as Return on Assets or Revenue Growth. We’ll also explore how the Pychemist library simplifies this task.

We use pandas and the pychemist library to demonstrate this. For this tutorial we rely on a dataset of financial information for 20 large US companies, scraped from EDGAR. Some missing values in the dataset have been imputed for demonstration purposes.

While pandas lets you create lagged values using .shift(), it doesn’t always behave as expected for grouped data or irregular time series. Pychemist provides a simpler and more reliable alternative through its .chemaccessor.

Step 1: Load the Data and Libraries

We start by importing the necessary libraries and loading the dataset. pandas and numpy are used for working with panel datasets. Additionally, we import pychemist to download the financial dataset and to enable the custom pandas extensions (accessors) used in this example.

Note: If you haven’t installed pychemist yet, you can do so with the following command:

pip install pychemist

Import the required libraries and load the dataset into a DataFrame (df):

import pandas as pd
import numpy as np
import pychemist

df = pychemist.load('financials')

Let’s inspect the first few rows:

df.head()

	index	ticker	year	net_income	total_assets	revenue
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN
3	14	AAPL	2023	NaN	3.527550e+11	NaN
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN

5 rows × 6 columns

Step 2: Create a Lag Manually

First, we sort the dataset by company (ticker) and time (year). Second, we create a lagged variable total_assets_lag using Pandas’ built-in .shift() method.

df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = df['total_assets'].shift()

Next, we inspect the DataFrame:

df.head(10)

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11
5	27	ADBE	2020	2.591000e+09	2.076200e+10	9.030000e+09	3.525830e+11
6	28	ADBE	2021	2.951000e+09	2.428400e+10	1.117100e+10	2.076200e+10
7	29	ADBE	2022	5.260000e+09	2.724100e+10	1.286800e+10	2.428400e+10
8	30	ADBE	2023	4.822000e+09	2.716500e+10	1.578500e+10	2.724100e+10
9	31	ADBE	2024	4.756000e+09	2.977900e+10	1.760600e+10	2.716500e+10

10 rows × 7 columns

Problem: We see that indeed a lagged version of total assets is added to the DataFrame. However, this method does not take into account whether the previous row belongs to the same company. It simply shifts row-wise, even across companies. We see for example that total_assets_lag for Adobe in 2020 equals Apple’s Total Assets for 2024. Obviously, this is not what we want.

Step 3: Fix Grouping Issues Manually

To address this, we can conditionally shift values only if the previous row belongs to the same ticker:

df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = np.where(
    df['ticker'] == df['ticker'].shift(),
    df['total_assets'].shift(),
    np.nan
)

Inspection of the first 10 rows of the DataFrame indicates that this indeed resolved the issue.

df.head(10)

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11
5	27	ADBE	2020	2.591000e+09	2.076200e+10	9.030000e+09	NaN
6	28	ADBE	2021	2.951000e+09	2.428400e+10	1.117100e+10	2.076200e+10
7	29	ADBE	2022	5.260000e+09	2.724100e+10	1.286800e+10	2.428400e+10
8	30	ADBE	2023	4.822000e+09	2.716500e+10	1.578500e+10	2.724100e+10
9	31	ADBE	2024	4.756000e+09	2.977900e+10	1.760600e+10	2.716500e+10

10 rows × 7 columns

However, there are still cases where the calculation did not happen as expected. For example, take a look at the observations for Nvidia and Tesla:

df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	1.731500e+10
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	3.430900e+10
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10

9 rows × 7 columns

Even when years are missing, the current logic carries forward the last available value, which may be from two or more years ago. As a result of data for NVIDIA for 2022 being missing, the lagged value total_assets_lag for 2023 is now incorrectly set equal to the total assets for 2021. Similarly, the lagged total assets for Tesla for 2022 are incorrectly set equal to the value for 2020.

Step 4: Handle Missing Years

Let’s refine the condition further. Our condition should verify that the previous row belongs to the same ticker and that the year increment equals one:

df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = np.where(
    (df['ticker'] == df['ticker'].shift()) & 
    (df['year'] - df['year'].shift() == 1),
    df['total_assets'].shift(),
    np.nan
)

Let’s inspect the DataFrame again:

df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	NaN
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	NaN
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10

9 rows × 7 columns

This approach works, but the code becomes verbose and difficult to scale. If we want a 2-year lag, the condition becomes even more complex:

df = df.sort_values(['ticker', 'year'])
df['total_assets_lag2'] = np.where(
    (df['ticker'] == df['ticker'].shift(2)) & 
    (df['year'] - df['year'].shift(2) == 2),
    df['total_assets'].shift(2),
    np.nan
)

Step 5: Let Pychemist Do the Work

To simplify and generalize this process, we can use Pychemist’s built-in lag function.

Note: this requires installing and importing the Pychemist library as done at the start of this tutorial.

We can now generate a lagged variable as follows:

df = df.chem.lag('total_assets', 'ticker', 'year')

This automatically handles company grouping and time consistency. The chem.lag method also handles important edge cases automatically. For instance, it only assigns lag values when both the grouping variable (e.g., ticker) matches and the year variable increments by exactly the lag interval. This means that if a year is missing in the dataset, the function will not incorrectly carry over data from a non-consecutive year, and will instead return NaN as expected. If the lagged column already exists, it will not be overwritten, and a warning will be issued to prevent accidental data loss.

Step 6: Lag Multiple Columns

To created multiple lagged variables at once, we can set a list of variables for which lagged variables should be computed. We set replace=True because the DataFrame already contains the lagged variable for total_assets. Without this argument the function would raise a warning that this variable already exists in the DataFrame. Alternatively, we could drop this column manually.

df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', replace=True)

We can see that the results are as expected:

df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10	NaN
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	NaN	NaN
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10	1.667500e+10
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10	2.691400e+10
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	NaN	NaN
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10	3.153600e+10
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10	5.382300e+10

9 rows × 8 columns

To generate 2-year lags (or more), simply pass a ‘shift’ parameter:

df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', 2)

This results in the following DataFrame:

df.head()

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag	total_assets_lag2	revenue_lag2
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN	NaN	NaN	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11	NaN	NaN	NaN
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11	NaN	3.385160e+11	NaN
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11	NaN	3.238880e+11	NaN
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11	NaN	3.510020e+11	NaN

5 rows × 10 columns

Step 7: Create Lead Variables

Creating lead variables (i.e., future values) is just as easy. Simply use .chem.lead instead:

df = df.chem.lead(['total_assets', 'revenue'], 'ticker', 'year')

The DataFrame would look as follows:

df.head()

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag	total_assets_lag2	revenue_lag2	total_assets_lead	revenue_lead
0	11	AAPL	2020	5.953100e+10	3.385160e+11	NaN	NaN	NaN	NaN	NaN	3.238880e+11	NaN
1	12	AAPL	2021	5.525600e+10	3.238880e+11	NaN	3.385160e+11	NaN	NaN	NaN	3.510020e+11	NaN
2	13	AAPL	2022	5.741100e+10	3.510020e+11	NaN	3.238880e+11	NaN	3.385160e+11	NaN	3.527550e+11	NaN
3	14	AAPL	2023	NaN	3.527550e+11	NaN	3.510020e+11	NaN	3.238880e+11	NaN	3.525830e+11	NaN
4	15	AAPL	2024	9.980300e+10	3.525830e+11	NaN	3.527550e+11	NaN	3.510020e+11	NaN	NaN	NaN

5 rows × 12 columns

Step 8: Compute Derived Metrics

Now that we have lags, we can compute variables such as Return on Assets (ROA) or Revenue Growth:

df=df.eval("""
    roa=net_income / ((total_assets+total_assets_lag)/2)
    growth = (revenue-revenue_lag) / revenue_lag
    """)

Inspect the results:

df.query('ticker=="NVDA" or ticker=="TSLA"')

	index	ticker	year	net_income	total_assets	revenue	total_assets_lag	revenue_lag	total_assets_lag2	revenue_lag2	total_assets_lead	revenue_lead	roa	growth
71	209	NVDA	2020	3.047000e+09	1.329200e+10	NaN	NaN	NaN	NaN	NaN	1.731500e+10	NaN	NaN	NaN
72	210	NVDA	2021	4.141000e+09	1.731500e+10	NaN	1.329200e+10	NaN	NaN	NaN	NaN	NaN	0.270592	NaN
73	212	NVDA	2023	4.332000e+09	4.418700e+10	1.667500e+10	NaN	NaN	1.731500e+10	NaN	4.118200e+10	2.691400e+10	NaN	NaN
74	213	NVDA	2024	9.752000e+09	4.118200e+10	2.691400e+10	4.418700e+10	1.667500e+10	NaN	NaN	6.572800e+10	2.697400e+10	0.228467	0.614033
75	214	NVDA	2025	4.368000e+09	6.572800e+10	2.697400e+10	4.118200e+10	2.691400e+10	4.418700e+10	1.667500e+10	NaN	NaN	0.081714	0.002229
87	256	TSLA	2020	-9.760000e+08	3.430900e+10	2.146100e+10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
88	258	TSLA	2022	7.210000e+08	6.213100e+10	3.153600e+10	NaN	NaN	3.430900e+10	2.146100e+10	8.233800e+10	5.382300e+10	NaN	NaN
89	259	TSLA	2023	5.519000e+09	8.233800e+10	5.382300e+10	6.213100e+10	3.153600e+10	NaN	NaN	1.066180e+11	8.146200e+10	0.076404	0.706716
90	260	TSLA	2024	1.255600e+10	1.066180e+11	8.146200e+10	8.233800e+10	5.382300e+10	6.213100e+10	3.153600e+10	NaN	NaN	0.132899	0.513517

9 rows × 14 columns

Conclusion

In this tutorial, we learned how to create lagged and lead variables manually using pandas, and why that approach often falls short, especially in grouped, irregular panel data.

Then, we saw how the Pychemist library solves these problems elegantly:

Accurate grouping
Handles gaps in time
Clean, concise API
Easy multi-period shifts
Works seamlessly with pandas

Whether you’re building financial models or prepping data for regression analysis, Pychemist streamlines your workflow and eliminates common mistakes.

Mutate

Pychemist — Sun, 20 Jul 2025 21:38:18 +0000

When preparing data for analysis, it is often necessary to create new variables or modify existing values: whether to fix data entry errors, derive variables based on existing ones, or flag subsets of data. In pandas, this is typically done using df.loc or np.where, but these methods can lead to verbose, repetitive, and hard-to-read code.

In this tutorial, we introduce a more readable and expressive alternative using a custom pandas accessor provided by the pychemist library. By leveraging df.query under the hood, the .chem.mutate accessor allows you to perform chained, conditional assignments in a cleaner and more maintainable way.

To demonstrate how to modify variables, we’ll use the IBM HR Analytics Employee Attrition & Performance dataset, available on Kaggle:
https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

We start by importing the required libraries and loading the dataset into a DataFrame:

import pandas as pd
import numpy as np

df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')

This creates a new DataFrame df.

We can examine the first five rows of the dataset to get a sense of the variables, their names, and the types of values they contain:

df.head()

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	MonthlyRate	NumCompaniesWorked	Over18	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	2	Female	94	3	2	Sales Executive	4	Single	5993	19479	8	Y	Yes	11	3	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	3	Male	61	2	2	Research Scientist	2	Married	5130	24907	1	Y	No	23	4	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	4	Male	92	2	1	Laboratory Technician	3	Single	2090	2396	6	Y	Yes	15	3	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	4	Female	56	3	1	Research Scientist	3	Married	2909	23159	1	Y	Yes	11	3	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	1	Male	40	3	1	Laboratory Technician	2	Married	3468	16632	9	Y	No	12	3	4	80	1	6	3	3	2	2	2	2

5 rows × 35 columns

Conditional Mutation with Base Pandas

Imagine we want to decide which employees are eligible for a bonus. As an example, we could use Numpy’s where function (np.where) to flag employees who haven’t been promoted in 3 or more years:

df['promotion'] = np.where(df['YearsSinceLastPromotion'] >= 3, 1, 0)

Alternatively, the same logic using df.loc:

df['promotion'] = 0
df.loc[df['YearsSinceLastPromotion'] >= 3, 'promotion'] = 1

Both approaches generate a new column promotion with a value of 1 for employees who haven’t been promoted in 3 or more years, and 0 otherwise.

These methods work well for simple logic. However, suppose we now want to promote employees who meet all the following conditions:

Haven’t been promoted in at least 3 years
Hold the position of Manager
Have a performance rating of 4 or higher

Using np.where, this looks like:

df['promotion'] = np.where(
    (df['YearsSinceLastPromotion'] >= 3) & 
    (df['JobRole'] == "Manager") & 
    (df['PerformanceRating'] >= 4),
    1, 0
    )

Or using df.loc:

df['promotion'] = 0
df.loc[
    (df['YearsSinceLastPromotion'] >= 3) & 
    (df['JobRole'] == "Manager") & 
    (df['PerformanceRating'] >= 4),
    'promotion'
    ] = 1

While both work, they’re increasingly verbose and harder to read and maintain, especially as the complexity of the conditions grows.

Using `pychemist` for Cleaner Mutation

Instead of repeating df[...] and writing complex boolean logic, we can use a custom accessor for pandas built on top of df.query and df.loc. This method enables readable, query-style conditional logic. We’ll use the pychemist package for this.

Install the package using pip:

pip install pychemist

Then, import the library to register the accessor and perform the conditional mutation:

import pychemist

df = df.chem.mutate('YearsSinceLastPromotion >= 3 & JobRole == "Manager" & PerformanceRating >= 4', 'Promotion', 1, 0)

How `chem.mutate` works

The method requires:

Query string: A valid expression compatible with df.query
Column name: The name of the column to create or modify
Value if True: The value to assign to rows that match the query
Value if False (optional): The value to assign to rows that don’t match

This syntax is significantly more readable and easier to chain.

Advanced Example: Chain Multiple Mutations

You can also chain multiple mutate calls together:

df = (df
    .chem.mutate('YearsSinceLastPromotion >= 3 & JobRole == "Manager" & PerformanceRating >= 4', 'Promotion', 1, 0)
    .chem.mutate('WorkLifeBalance<2 & OverTime=="Yes"', 'HighPressure', 'Yes', 'No')
    )

This generates two new columns:

Promotion: 1 for managers with a high performance rating (4 or higher) that haven’t received a promotion for a least 3 years, else 0
HighPressure: 'Yes' for employees who score low on work-life balance and work overtime, else 'No'

This chaining allows your transformations to remain tidy and declarative.

Summary

This tutorial demonstrated how to use pychemist’s custom mutate accessor to simplify data manipulation in pandas. It allows:

More concise and readable conditional logic
Easy chaining of multiple transformations
Reduced risk of typos and parentheses mismatches

By abstracting away the boilerplate of df.loc and np.where, pychemist helps make your data preparation code more expressive and maintainable.

Manual: .chem.lead() Method for Creating Lead Variables

Overview

Accessor Registration

Method Signature

Parameters

Returns

Behavior

Notes

Example

Common Use Cases

Error Handling

Internals

See Also

Manual: .chem.lag() Method for Creating Lagged Variables

Overview

Accessor Registration

Method Signature

Parameters

Returns

Behavior

Notes

Example

Common Use Cases

Error Handling

Internals

See Also

Manual: .chem.mutate() Method for Conditional DataFrame Updates

Overview

Accessor Registration

Method Signature

Parameters

Returns

Behavior

Notes

Example

Common Use Cases

Error Handling

Internals

Creating Lag and Lead Variables

Introduction

Step 1: Load the Data and Libraries

Step 2: Create a Lag Manually

Step 3: Fix Grouping Issues Manually

Step 4: Handle Missing Years

Step 5: Let Pychemist Do the Work

Step 6: Lag Multiple Columns

Step 7: Create Lead Variables

Step 8: Compute Derived Metrics

Conclusion

Mutate

Conditional Mutation with Base Pandas

Using pychemist for Cleaner Mutation

How chem.mutate works

Advanced Example: Chain Multiple Mutations

Summary

Using `pychemist` for Cleaner Mutation

How `chem.mutate` works