Manuals

Manual: .chem.lead() Method for Creating Lead Variables

Pychemist — Tue, 29 Jul 2025 07:41:24 +0000

Overview

The .chem.lead() method is a custom pandas accessor that creates lead versions of one or more variables in a DataFrame. It works by shifting the specified columns backward in time (i.e., to future periods) and merging the result back onto the original DataFrame.

This is especially useful for panel or time-series data where each row represents an observation at a time point for a particular unit (e.g., experiment, company, individual).

For a concrete example of usage, refer to https://pychemist.com/creating-lag-and-lead-variables/.

Accessor Registration

This method is registered with pandas under the accessor name .chem:

Access the method like this:

df.chem.lead(...)

Method Signature

df.chem.lead(variables, identifier, time, shift=1, *, replace=False)

Parameters

Parameter	Type	Description
variables	`str` or `list of str`	Column(s) for which to create lead versions.
identifier	`str`	The name of the column identifying individual units (e.g., subject ID or group).
time	`str`	The name of the time column. Used to shift values within groups.
shift	`int`, default `1`	Number of time periods to shift. Must be a positive integer.
replace	`bool`, default `False`	If `True`, overwrites existing lead columns. If `False`, raises an error if there’s a naming conflict.

Returns

A modified copy of the original pd.DataFrame, with lead versions of the specified variables added as new columns.

Behavior

Validates parameter types and column existence.
Creates a shifted version of the selected variables by subtracting the shift from the time column.
Merges this lead DataFrame back into the original, using identifier and time as keys.
New columns are suffixed with:
- _lead for shift=1
- _leadN for shift=N (e.g., _lead3 for shift=3)
If replace=False and a target lead column already exists, a ValueError is raised.

Notes

The original DataFrame remains unchanged.
Supports multiple variables and vectorized group-wise operations.
Useful for forecasting models or previewing future values.
For backward-looking operations, see .chem.lag().

Example

df = pd.DataFrame({
    "id": [1, 1, 1, 2, 2, 2],
    "time": [1, 2, 3, 1, 2, 3],
    "mass": [10, 15, 20, 5, 10, 15]
})

df2 = df.chem.lead(variables="mass", identifier="id", time="time", shift=1)

This will produce a new column called mass_lead with the mass value from the next time step (grouped by id).

Or more concisely:

df2 = df.chem.lead("mass", "id", "time")

Common Use Cases

Creating future-looking predictors in modeling.
Forecast validation (comparing current and future state).
Detecting upcoming transitions or changes in sequence.

Error Handling

TypeError if variables is not a string or list of strings.
TypeError if replace is not a boolean.
TypeError if shift is not a positive integer.
ValueError if a specified column does not exist.
ValueError if a lead column already exists and replace=False.

Internals

The method:

Copies the relevant subset of the DataFrame.
Shifts the time column backward (df[time] - shift) to align future values.
Applies suffixes such as _lead or _leadN.
Merges the lead values back into the original DataFrame using pd.merge(...).

Manual: .chem.lag() Method for Creating Lagged Variables

Pychemist — Tue, 29 Jul 2025 07:35:39 +0000

Overview

The .chem.lag() method is a custom pandas accessor that creates lagged versions of one or more variables in a DataFrame. It operates by shifting the specified columns by a given number of time periods and merging the result back onto the original DataFrame.

This is especially useful for time-series panel data where each row belongs to a unique unit (e.g., company, experiment, patient) over time.

For a concrete example of usage, refer to https://pychemist.com/creating-lag-and-lead-variables/.

Accessor Registration

This method is registered with pandas under the accessor name .chem:

Access the method like this:

df.chem.lag(...)

Method Signature

df.chem.lag(variables, identifier, time, shift=1, *, replace=False)

Parameters

Parameter	Type	Description
variables	`str` or `list of str`	Column(s) for which to create lagged versions.
identifier	`str`	The name of the column identifying individual units (e.g., subject ID or group).
time	`str`	The name of the time column. Used to shift values within groups.
shift	`int`, default `1`	Number of time periods to shift. Positive values create lags. Negative values are not allowed.
replace	`bool`, default `False`	If `True`, overwrites existing lagged columns. If `False`, raises an error if there’s a naming conflict.

Returns

A modified copy of the original pd.DataFrame, with lagged versions of the specified variables added as new columns.

Behavior

Verifies input types and column existence.
Constructs a lagged version of the selected variables by shifting the time column forward by the given shift amount.
Merges this lagged DataFrame back onto the original, based on the identifier and time.
New columns are suffixed with:
- _lag for shift=1
- _lagN for shift=N (e.g., _lag3 for shift=3)
If replace=False and any of the output columns already exist, the method raises a ValueError.

Notes

The original DataFrame remains unchanged.
Supports multiple variables and vectorized operations.
Designed for panel or longitudinal data.
For negative shift (lead variables) refert to .chem.lead()

Example

df = pd.DataFrame({
    "id": [1, 1, 1, 2, 2, 2],
    "time": [1, 2, 3, 1, 2, 3],
    "mass": [10, 15, 20, 5, 10, 15]
})

df_lagged = df.chem.lag(variables="mass", identifier="id", time="time", shift=1)

This will produce a new column called mass_lag with the mass value from the previous time step (by id).

Or more concisely:

df_lagged = df.chem.lag("mass", "id", "time")

Common Use Cases

Creating lagged predictors in time-series regression.
Modeling delayed effects in experiments.
Comparing values across sequential periods.

Error Handling

TypeError if variables is not a string or list of strings.
TypeError if replace is not a boolean.
TypeError if shift is not a positive integer.
ValueError if any specified variable is missing from the DataFrame.
ValueError if a lagged column already exists and replace=False.

Internals

The method:

Creates a shifted copy of the target columns using df[time] + shift.
Applies suffixes like _lag, _lag2, etc.
Merges the lagged columns back onto the original DataFrame using pd.merge(...).

Manual: .chem.mutate() Method for Conditional DataFrame Updates

Pychemist — Mon, 28 Jul 2025 19:13:44 +0000

Overview

The .chem.mutate() method is a custom pandas accessor that enables conditional assignment of values to a column in a DataFrame using a query string. It allows for clean, chainable DataFrame transformations and always returns a modified copy of the DataFrame.

This is particularly useful when working with chemistry-related tabular data and transformations, but it can be applied more broadly.

For a concrete example of usage, refer to https://pychemist.com/mutate/.

Accessor Registration

This method is registered using the pandas API extensions system. This means you can access the method via:

df.chem.mutate(...)

Method Signature

df.chem.mutate(query_str, column, value, other=None)

Parameters

Parameter	Type	Description
query_str	`str`	A pandas query string used to select rows that satisfy the condition.
column	`str`	The column to update or create.
value	`scalar` or `array-like`	The value(s) assigned to rows that meet the query condition.
other	`scalar` or `array-like`, optional	The value(s) assigned to rows not meeting the condition. If `None`, rows not matching the condition are left unchanged.

Returns

A modified copy of the original pd.DataFrame.

This method does not modify the DataFrame in-place.

Behavior

The method evaluates the query_str on the DataFrame.
For rows matching the query, the specified column is set to value.
For rows not matching the query, the column is set to other only if other is provided.
The column is created if it does not already exist.

Notes

This method is part of a custom accessor named .chem.
The original DataFrame is left unchanged.
You can use it in method chains, e.g.: df2 = df.chem.mutate(query_str=”mass > 10″, column=”label”, value=”heavy”, other=”light”)
value and other can be scalars or array-like, but must match the number of rows being assigned.

Or simply:

df2 = df.chem.mutate("mass > 10", "label", "heavy", "light")

Example

See https://pychemist.com/mutate/ for a practical example using .chem.mutate().

Common Use Cases

Labeling chemical species by some threshold: df.chem.mutate(“concentration > 1.0”, “status”, “high”, “low”)
Creating a new boolean flag: df.chem.mutate(“pH < 7”, “is_acidic”, True, other=False)

Error Handling

Any syntax errors in query_str will raise a pandas exception at runtime.
If value or other are array-like but do not match the shape of the selected rows, a ValueError will be raised.

Internals

Internally, the method:

Copies the DataFrame (df = self._obj.copy()),
Applies the query string using df.query(query_str),
Locates matching and non-matching row indices,
Assigns value and optionally other to the specified column,
Returns the modified DataFrame.

Manuals

Manual: .chem.lead() Method for Creating Lead Variables

Overview

Accessor Registration

Method Signature

Parameters

Returns

Behavior

Notes

Example

Common Use Cases

Error Handling

Internals

See Also

Manual: .chem.lag() Method for Creating Lagged Variables

Overview

Accessor Registration

Method Signature

Parameters

Returns

Behavior

Notes

Example

Common Use Cases

Error Handling

Internals

See Also

Manual: .chem.mutate() Method for Conditional DataFrame Updates

Overview

Accessor Registration

Method Signature

Parameters

Returns

Behavior

Notes

Example

Common Use Cases

Error Handling

Internals