Mutate

When preparing data for analysis, it is often necessary to create new variables or modify existing values: whether to fix data entry errors, derive variables based on existing ones, or flag subsets of data. In pandas, this is typically done using df.loc or np.where, but these methods can lead to verbose, repetitive, and hard-to-read code.

In this tutorial, we introduce a more readable and expressive alternative using a custom pandas accessor provided by the pychemist library. By leveraging df.query under the hood, the .chem.mutate accessor allows you to perform chained, conditional assignments in a cleaner and more maintainable way.

To demonstrate how to modify variables, we’ll use the IBM HR Analytics Employee Attrition & Performance dataset, available on Kaggle:
https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

We start by importing the required libraries and loading the dataset into a DataFrame:

This creates a new DataFrame df.

We can examine the first five rows of the dataset to get a sense of the variables, their names, and the types of values they contain:

AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEmployeeCountEmployeeNumberEnvironmentSatisfactionGenderHourlyRateJobInvolvementJobLevelJobRoleJobSatisfactionMaritalStatusMonthlyIncomeMonthlyRateNumCompaniesWorkedOver18OverTimePercentSalaryHikePerformanceRatingRelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
041YesTravel_Rarely1102Sales12Life Sciences112Female9432Sales Executive4Single5993194798YYes11318008016405
149NoTravel_Frequently279Research & Development81Life Sciences123Male6122Research Scientist2Married5130249071YNo2344801103310717
237YesTravel_Rarely1373Research & Development22Other144Male9221Laboratory Technician3Single209023966YYes15328007330000
333NoTravel_Frequently1392Research & Development34Life Sciences154Female5631Research Scientist3Married2909231591YYes11338008338730
427NoTravel_Rarely591Research & Development21Medical171Male4031Laboratory Technician2Married3468166329YNo12348016332222

5 rows × 35 columns


Conditional Mutation with Base Pandas

Imagine we want to decide which employees are eligible for a bonus. As an example, we could use Numpy’s where function (np.where) to flag employees who haven’t been promoted in 3 or more years:

Alternatively, the same logic using df.loc:

Both approaches generate a new column promotion with a value of 1 for employees who haven’t been promoted in 3 or more years, and 0 otherwise.

These methods work well for simple logic. However, suppose we now want to promote employees who meet all the following conditions:

  • Haven’t been promoted in at least 3 years
  • Hold the position of Manager
  • Have a performance rating of 4 or higher

Using np.where, this looks like:

Or using df.loc:

While both work, they’re increasingly verbose and harder to read and maintain, especially as the complexity of the conditions grows.


Using pychemist for Cleaner Mutation

Instead of repeating df[...] and writing complex boolean logic, we can use a custom accessor for pandas built on top of df.query and df.loc. This method enables readable, query-style conditional logic. We’ll use the pychemist package for this.

Install the package using pip:

Then, import the library to register the accessor and perform the conditional mutation:

How chem.mutate works

The method requires:

  • Query string: A valid expression compatible with df.query
  • Column name: The name of the column to create or modify
  • Value if True: The value to assign to rows that match the query
  • Value if False (optional): The value to assign to rows that don’t match

This syntax is significantly more readable and easier to chain.


Advanced Example: Chain Multiple Mutations

You can also chain multiple mutate calls together:

This generates two new columns:

  • Promotion: 1 for managers with a high performance rating (4 or higher) that haven’t received a promotion for a least 3 years, else 0
  • HighPressure: 'Yes' for employees who score low on work-life balance and work overtime, else 'No'

This chaining allows your transformations to remain tidy and declarative.


Summary

This tutorial demonstrated how to use pychemist’s custom mutate accessor to simplify data manipulation in pandas. It allows:

  • More concise and readable conditional logic
  • Easy chaining of multiple transformations
  • Reduced risk of typos and parentheses mismatches

By abstracting away the boilerplate of df.loc and np.where, pychemist helps make your data preparation code more expressive and maintainable.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top