Mutate
When preparing data for analysis, it is often necessary to create new variables or modify existing values: whether to fix data entry errors, derive variables based on existing ones, or flag subsets of data. In pandas, this is typically done using df.loc
or np.where
, but these methods can lead to verbose, repetitive, and hard-to-read code.
In this tutorial, we introduce a more readable and expressive alternative using a custom pandas accessor provided by the pychemist
library. By leveraging df.query
under the hood, the .chem.mutate
accessor allows you to perform chained, conditional assignments in a cleaner and more maintainable way.
To demonstrate how to modify variables, we’ll use the IBM HR Analytics Employee Attrition & Performance dataset, available on Kaggle:
https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
We start by importing the required libraries and loading the dataset into a DataFrame:
1 2 3 4 |
import pandas as pd import numpy as np df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv') |
This creates a new DataFrame df
.
We can examine the first five rows of the dataset to get a sense of the variables, their names, and the types of values they contain:
1 |
df.head() |
Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
5 rows × 35 columns
Conditional Mutation with Base Pandas
Imagine we want to decide which employees are eligible for a bonus. As an example, we could use Numpy’s where function (np.where
) to flag employees who haven’t been promoted in 3 or more years:
1 |
df['promotion'] = np.where(df['YearsSinceLastPromotion'] >= 3, 1, 0) |
Alternatively, the same logic using df.loc
:
1 2 |
df['promotion'] = 0 df.loc[df['YearsSinceLastPromotion'] >= 3, 'promotion'] = 1 |
Both approaches generate a new column promotion
with a value of 1
for employees who haven’t been promoted in 3 or more years, and 0
otherwise.
These methods work well for simple logic. However, suppose we now want to promote employees who meet all the following conditions:
- Haven’t been promoted in at least 3 years
- Hold the position of Manager
- Have a performance rating of 4 or higher
Using np.where
, this looks like:
1 2 3 4 5 6 |
df['promotion'] = np.where( (df['YearsSinceLastPromotion'] >= 3) & (df['JobRole'] == "Manager") & (df['PerformanceRating'] >= 4), 1, 0 ) |
Or using df.loc
:
1 2 3 4 5 6 7 |
df['promotion'] = 0 df.loc[ (df['YearsSinceLastPromotion'] >= 3) & (df['JobRole'] == "Manager") & (df['PerformanceRating'] >= 4), 'promotion' ] = 1 |
While both work, they’re increasingly verbose and harder to read and maintain, especially as the complexity of the conditions grows.
Using pychemist
for Cleaner Mutation
Instead of repeating df[...]
and writing complex boolean logic, we can use a custom accessor for pandas
built on top of df.query
and df.loc
. This method enables readable, query-style conditional logic. We’ll use the pychemist
package for this.
Install the package using pip:
1 |
pip install pychemist |
Then, import the library to register the accessor and perform the conditional mutation:
1 2 3 |
import pychemist df = df.chem.mutate('YearsSinceLastPromotion >= 3 & JobRole == "Manager" & PerformanceRating >= 4', 'Promotion', 1, 0) |
How chem.mutate
works
The method requires:
- Query string: A valid expression compatible with
df.query
- Column name: The name of the column to create or modify
- Value if True: The value to assign to rows that match the query
- Value if False (optional): The value to assign to rows that don’t match
This syntax is significantly more readable and easier to chain.
Advanced Example: Chain Multiple Mutations
You can also chain multiple mutate
calls together:
1 2 3 4 |
df = (df .chem.mutate('YearsSinceLastPromotion >= 3 & JobRole == "Manager" & PerformanceRating >= 4', 'Promotion', 1, 0) .chem.mutate('WorkLifeBalance<2 & OverTime=="Yes"', 'HighPressure', 'Yes', 'No') ) |
This generates two new columns:
Promotion
:1
for managers with a high performance rating (4 or higher) that haven’t received a promotion for a least 3 years, else0
HighPressure
:'Yes'
for employees who score low on work-life balance and work overtime, else'No'
This chaining allows your transformations to remain tidy and declarative.
Summary
This tutorial demonstrated how to use pychemist
’s custom mutate
accessor to simplify data manipulation in pandas. It allows:
- More concise and readable conditional logic
- Easy chaining of multiple transformations
- Reduced risk of typos and parentheses mismatches
By abstracting away the boilerplate of df.loc
and np.where
, pychemist
helps make your data preparation code more expressive and maintainable.