When preparing data for analysis, it is often necessary to create new variables or modify existing values: whether to fix data entry errors, derive variables based on existing ones, or flag subsets of data. In pandas, this is typically done using df.loc or np.where, but these methods can lead to verbose, repetitive, and hard-to-read code.
In this tutorial, we introduce a more readable and expressive alternative using a custom pandas accessor provided by the pychemist library. By leveraging df.query under the hood, the .chem.mutate accessor allows you to perform chained, conditional assignments in a cleaner and more maintainable way.
To demonstrate how to modify variables, we’ll use the IBM HR Analytics Employee Attrition & Performance dataset, available on Kaggle:
https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
We start by importing the required libraries and loading the dataset into a DataFrame:
|
1 2 3 4 |
import pandas as pd import numpy as np df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv') |
This creates a new DataFrame df.
We can examine the first five rows of the dataset to get a sense of the variables, their names, and the types of values they contain:
|
1 |
df.head() |
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
5 rows × 35 columns
Conditional Mutation with Base Pandas
Imagine we want to decide which employees are eligible for a bonus. As an example, we could use Numpy’s where function (np.where) to flag employees who haven’t been promoted in 3 or more years:
|
1 |
df['promotion'] = np.where(df['YearsSinceLastPromotion'] >= 3, 1, 0) |
Alternatively, the same logic using df.loc:
|
1 2 |
df['promotion'] = 0 df.loc[df['YearsSinceLastPromotion'] >= 3, 'promotion'] = 1 |
Both approaches generate a new column promotion with a value of 1 for employees who haven’t been promoted in 3 or more years, and 0 otherwise.
These methods work well for simple logic. However, suppose we now want to promote employees who meet all the following conditions:
- Haven’t been promoted in at least 3 years
- Hold the position of Manager
- Have a performance rating of 4 or higher
Using np.where, this looks like:
|
1 2 3 4 5 6 |
df['promotion'] = np.where( (df['YearsSinceLastPromotion'] >= 3) & (df['JobRole'] == "Manager") & (df['PerformanceRating'] >= 4), 1, 0 ) |
Or using df.loc:
|
1 2 3 4 5 6 7 |
df['promotion'] = 0 df.loc[ (df['YearsSinceLastPromotion'] >= 3) & (df['JobRole'] == "Manager") & (df['PerformanceRating'] >= 4), 'promotion' ] = 1 |
While both work, they’re increasingly verbose and harder to read and maintain, especially as the complexity of the conditions grows.
Using pychemist for Cleaner Mutation
Instead of repeating df[...] and writing complex boolean logic, we can use a custom accessor for pandas built on top of df.query and df.loc. This method enables readable, query-style conditional logic. We’ll use the pychemist package for this.
Install the package using pip:
|
1 |
pip install pychemist |
Then, import the library to register the accessor and perform the conditional mutation:
|
1 2 3 |
import pychemist df = df.chem.mutate('YearsSinceLastPromotion >= 3 & JobRole == "Manager" & PerformanceRating >= 4', 'Promotion', 1, 0) |
How chem.mutate works
The method requires:
- Query string: A valid expression compatible with
df.query - Column name: The name of the column to create or modify
- Value if True: The value to assign to rows that match the query
- Value if False (optional): The value to assign to rows that don’t match
This syntax is significantly more readable and easier to chain.
Advanced Example: Chain Multiple Mutations
You can also chain multiple mutate calls together:
|
1 2 3 4 |
df = (df .chem.mutate('YearsSinceLastPromotion >= 3 & JobRole == "Manager" & PerformanceRating >= 4', 'Promotion', 1, 0) .chem.mutate('WorkLifeBalance<2 & OverTime=="Yes"', 'HighPressure', 'Yes', 'No') ) |
This generates two new columns:
Promotion:1for managers with a high performance rating (4 or higher) that haven’t received a promotion for a least 3 years, else0HighPressure:'Yes'for employees who score low on work-life balance and work overtime, else'No'
This chaining allows your transformations to remain tidy and declarative.
Summary
This tutorial demonstrated how to use pychemist’s custom mutate accessor to simplify data manipulation in pandas. It allows:
- More concise and readable conditional logic
- Easy chaining of multiple transformations
- Reduced risk of typos and parentheses mismatches
By abstracting away the boilerplate of df.loc and np.where, pychemist helps make your data preparation code more expressive and maintainable.
