Overview
The .chem.lag()
method is a custom pandas accessor that creates lagged versions of one or more variables in a DataFrame. It operates by shifting the specified columns by a given number of time periods and merging the result back onto the original DataFrame.
This is especially useful for time-series panel data where each row belongs to a unique unit (e.g., company, experiment, patient) over time.
For a concrete example of usage, refer to https://pychemist.com/creating-lag-and-lead-variables/.
Accessor Registration
This method is registered with pandas under the accessor name .chem
:
Access the method like this:
1 |
df.chem.lag(...) |
Method Signature
1 |
df.chem.lag(variables, identifier, time, shift=1, *, replace=False) |
Parameters
Parameter | Type | Description |
---|---|---|
variables | str or list of str | Column(s) for which to create lagged versions. |
identifier | str | The name of the column identifying individual units (e.g., subject ID or group). |
time | str | The name of the time column. Used to shift values within groups. |
shift | int , default 1 | Number of time periods to shift. Positive values create lags. Negative values are not allowed. |
replace | bool , default False | If True , overwrites existing lagged columns. If False , raises an error if there’s a naming conflict. |
Returns
- A modified copy of the original
pd.DataFrame
, with lagged versions of the specified variables added as new columns.
Behavior
- Verifies input types and column existence.
- Constructs a lagged version of the selected variables by shifting the
time
column forward by the givenshift
amount. - Merges this lagged DataFrame back onto the original, based on the
identifier
andtime
. - New columns are suffixed with:
_lag
forshift=1
_lagN
forshift=N
(e.g.,_lag3
forshift=3
)
- If
replace=False
and any of the output columns already exist, the method raises aValueError
.
Notes
- The original DataFrame remains unchanged.
- Supports multiple variables and vectorized operations.
- Designed for panel or longitudinal data.
- For negative shift (lead variables) refert to
.chem.lead()
Example
1 2 3 4 5 6 7 |
df = pd.DataFrame({ "id": [1, 1, 1, 2, 2, 2], "time": [1, 2, 3, 1, 2, 3], "mass": [10, 15, 20, 5, 10, 15] }) df_lagged = df.chem.lag(variables="mass", identifier="id", time="time", shift=1) |
This will produce a new column called mass_lag
with the mass value from the previous time step (by id
).
Or more concisely:
1 |
df_lagged = df.chem.lag("mass", "id", "time") |
Common Use Cases
- Creating lagged predictors in time-series regression.
- Modeling delayed effects in experiments.
- Comparing values across sequential periods.
Error Handling
- TypeError if
variables
is not a string or list of strings. - TypeError if
replace
is not a boolean. - TypeError if
shift
is not a positive integer. - ValueError if any specified variable is missing from the DataFrame.
- ValueError if a lagged column already exists and
replace=False
.
Internals
The method:
- Creates a shifted copy of the target columns using
df[time] + shift
. - Applies suffixes like
_lag
,_lag2
, etc. - Merges the lagged columns back onto the original DataFrame using
pd.merge(...)
.
See Also
.chem.mutate()
– for conditional column assignmentpd.DataFrame.shift()
– basic shifting