Mutate

When preparing data for analysis, it is often necessary to create new variables or modify existing values: whether to fix data entry errors, derive variables based on existing ones, or flag subsets of data. In pandas, this is typically done using df.loc or np.where, but these methods can lead to verbose, repetitive, and hard-to-read code.

In this tutorial, we introduce a more readable and expressive alternative using a custom pandas accessor provided by the pychemist library. By leveraging df.query under the hood, the .chem.mutate accessor allows you to perform chained, conditional assignments in a cleaner and more maintainable way.

To demonstrate how to modify variables, we’ll use the IBM HR Analytics Employee Attrition & Performance dataset, available on Kaggle:
https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

We start by importing the required libraries and loading the dataset into a DataFrame:

import pandas as pd

import numpy as np

df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')

This creates a new DataFrame df.

We can examine the first five rows of the dataset to get a sense of the variables, their names, and the types of values they contain:

df.head()

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	MonthlyRate	NumCompaniesWorked	Over18	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	2	Female	94	3	2	Sales Executive	4	Single	5993	19479	8	Y	Yes	11	3	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	3	Male	61	2	2	Research Scientist	2	Married	5130	24907	1	Y	No	23	4	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	4	Male	92	2	1	Laboratory Technician	3	Single	2090	2396	6	Y	Yes	15	3	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	4	Female	56	3	1	Research Scientist	3	Married	2909	23159	1	Y	Yes	11	3	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	1	Male	40	3	1	Laboratory Technician	2	Married	3468	16632	9	Y	No	12	3	4	80	1	6	3	3	2	2	2	2

5 rows × 35 columns

Conditional Mutation with Base Pandas

Imagine we want to decide which employees are eligible for a bonus. As an example, we could use Numpy’s where function (np.where) to flag employees who haven’t been promoted in 3 or more years:

1	df['promotion'] = np.where(df['YearsSinceLastPromotion'] >= 3, 1, 0)

Alternatively, the same logic using df.loc:

1 2	df['promotion'] = 0 df.loc[df['YearsSinceLastPromotion'] >= 3, 'promotion'] = 1

Both approaches generate a new column promotion with a value of 1 for employees who haven’t been promoted in 3 or more years, and 0 otherwise.

These methods work well for simple logic. However, suppose we now want to promote employees who meet all the following conditions:

Haven’t been promoted in at least 3 years
Hold the position of Manager
Have a performance rating of 4 or higher

Using np.where, this looks like:

df['promotion'] = np.where(

(df['YearsSinceLastPromotion'] >= 3) &

(df['JobRole'] == "Manager") &

(df['PerformanceRating'] >= 4),

1, 0

)

Or using df.loc:

df['promotion'] = 0

df.loc[

(df['YearsSinceLastPromotion'] >= 3) &

(df['JobRole'] == "Manager") &

(df['PerformanceRating'] >= 4),

'promotion'

] = 1

While both work, they’re increasingly verbose and harder to read and maintain, especially as the complexity of the conditions grows.

Using `pychemist` for Cleaner Mutation

Instead of repeating df[...] and writing complex boolean logic, we can use a custom accessor for pandas built on top of df.query and df.loc. This method enables readable, query-style conditional logic. We’ll use the pychemist package for this.

Install the package using pip:

1	pip install pychemist

Then, import the library to register the accessor and perform the conditional mutation:

import pychemist

df = df.chem.mutate('YearsSinceLastPromotion >= 3 & JobRole == "Manager" & PerformanceRating >= 4', 'Promotion', 1, 0)

How `chem.mutate` works

The method requires:

Query string: A valid expression compatible with df.query
Column name: The name of the column to create or modify
Value if True: The value to assign to rows that match the query
Value if False (optional): The value to assign to rows that don’t match

This syntax is significantly more readable and easier to chain.

Advanced Example: Chain Multiple Mutations

You can also chain multiple mutate calls together:

df = (df

.chem.mutate('YearsSinceLastPromotion >= 3 & JobRole == "Manager" & PerformanceRating >= 4', 'Promotion', 1, 0)

.chem.mutate('WorkLifeBalance<2 & OverTime=="Yes"', 'HighPressure', 'Yes', 'No')

)

This generates two new columns:

Promotion: 1 for managers with a high performance rating (4 or higher) that haven’t received a promotion for a least 3 years, else 0
HighPressure: 'Yes' for employees who score low on work-life balance and work overtime, else 'No'

This chaining allows your transformations to remain tidy and declarative.

Summary

This tutorial demonstrated how to use pychemist’s custom mutate accessor to simplify data manipulation in pandas. It allows:

More concise and readable conditional logic
Easy chaining of multiple transformations
Reduced risk of typos and parentheses mismatches

By abstracting away the boilerplate of df.loc and np.where, pychemist helps make your data preparation code more expressive and maintainable.

Conditional Mutation with Base Pandas

Using pychemist for Cleaner Mutation

How chem.mutate works

Advanced Example: Chain Multiple Mutations

Summary

Leave a Comment Cancel Reply

Using `pychemist` for Cleaner Mutation

How `chem.mutate` works