<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title></title>
	<atom:link href="https://pychemist.com/feed/" rel="self" type="application/rss+xml" />
	<link>https://pychemist.com</link>
	<description>Pychemist</description>
	<lastBuildDate>Tue, 29 Jul 2025 10:12:24 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.8.2</generator>

<image>
	<url>https://pychemist.com/wp-content/uploads/2025/07/cropped-mini-logo1-01-32x32.png</url>
	<title></title>
	<link>https://pychemist.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Manual: .chem.lead() Method for Creating Lead Variables</title>
		<link>https://pychemist.com/manual-chem-lead/</link>
					<comments>https://pychemist.com/manual-chem-lead/#respond</comments>
		
		<dc:creator><![CDATA[Pychemist]]></dc:creator>
		<pubDate>Tue, 29 Jul 2025 07:41:24 +0000</pubDate>
				<category><![CDATA[Manuals]]></category>
		<guid isPermaLink="false">https://pychemist.com/?p=265</guid>

					<description><![CDATA[Overview The .chem.lead() method is a custom pandas accessor that creates lead versions of one or more variables in a [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Overview</h2>



<p>The <code>.chem.lead()</code> method is a custom pandas accessor that creates <strong>lead</strong> versions of one or more variables in a DataFrame. It works by shifting the specified columns <strong>backward</strong> in time (i.e., to future periods) and merging the result back onto the original DataFrame.</p>



<p>This is especially useful for panel or time-series data where each row represents an observation at a time point for a particular unit (e.g., experiment, company, individual).</p>



<p>For a concrete example of usage, refer to <strong><a href="https://pychemist.com/creating-lag-and-lead-variables/">https://pychemist.com/creating-lag-and-lead-variables/</a></strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Accessor Registration</h2>



<p>This method is registered with pandas under the accessor name <code>.chem</code>:</p>



<p>Access the method like this:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.chem.lead(...)</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Method Signature</h2>



<pre class="urvanov-syntax-highlighter-plain-tag">df.chem.lead(variables, identifier, time, shift=1, *, replace=False)</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Parameters</h2>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Parameter</th><th>Type</th><th>Description</th></tr></thead><tbody><tr><td><strong>variables</strong></td><td><code>str</code> or <code>list of str</code></td><td>Column(s) for which to create lead versions.</td></tr><tr><td><strong>identifier</strong></td><td><code>str</code></td><td>The name of the column identifying individual units (e.g., subject ID or group).</td></tr><tr><td><strong>time</strong></td><td><code>str</code></td><td>The name of the time column. Used to shift values within groups.</td></tr><tr><td><strong>shift</strong></td><td><code>int</code>, default <code>1</code></td><td>Number of time periods to shift. Must be a <strong>positive</strong> integer.</td></tr><tr><td><strong>replace</strong></td><td><code>bool</code>, default <code>False</code></td><td>If <code>True</code>, overwrites existing lead columns. If <code>False</code>, raises an error if there&#8217;s a naming conflict.</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Returns</h2>



<ul class="wp-block-list">
<li>A modified copy of the original <code>pd.DataFrame</code>, with lead versions of the specified variables added as new columns.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Behavior</h2>



<ol class="wp-block-list">
<li>Validates parameter types and column existence.</li>



<li>Creates a shifted version of the selected variables by <strong>subtracting</strong> the <code>shift</code> from the <code>time</code> column.</li>



<li>Merges this lead DataFrame back into the original, using <code>identifier</code> and <code>time</code> as keys.</li>



<li>New columns are suffixed with:
<ul class="wp-block-list">
<li><code>_lead</code> for <code>shift=1</code></li>



<li><code>_leadN</code> for <code>shift=N</code> (e.g., <code>_lead3</code> for <code>shift=3</code>)</li>
</ul>
</li>



<li>If <code>replace=False</code> and a target lead column already exists, a <code>ValueError</code> is raised.</li>
</ol>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Notes</h2>



<ul class="wp-block-list">
<li>The original DataFrame remains unchanged.</li>



<li>Supports multiple variables and vectorized group-wise operations.</li>



<li>Useful for forecasting models or previewing future values.</li>



<li>For backward-looking operations, see <code>.chem.lag()</code>.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Example</h2>



<pre class="urvanov-syntax-highlighter-plain-tag">df = pd.DataFrame({
    "id": [1, 1, 1, 2, 2, 2],
    "time": [1, 2, 3, 1, 2, 3],
    "mass": [10, 15, 20, 5, 10, 15]
})

df2 = df.chem.lead(variables="mass", identifier="id", time="time", shift=1)</pre>



<p>This will produce a new column called <code>mass_lead</code> with the mass value from the <strong>next</strong> time step (grouped by <code>id</code>).</p>



<p>Or more concisely:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df2 = df.chem.lead("mass", "id", "time")</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Common Use Cases</h2>



<ul class="wp-block-list">
<li>Creating future-looking predictors in modeling.</li>



<li>Forecast validation (comparing current and future state).</li>



<li>Detecting upcoming transitions or changes in sequence.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Error Handling</h2>



<ul class="wp-block-list">
<li><strong>TypeError</strong> if <code>variables</code> is not a string or list of strings.</li>



<li><strong>TypeError</strong> if <code>replace</code> is not a boolean.</li>



<li><strong>TypeError</strong> if <code>shift</code> is not a positive integer.</li>



<li><strong>ValueError</strong> if a specified column does not exist.</li>



<li><strong>ValueError</strong> if a lead column already exists and <code>replace=False</code>.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Internals</h2>



<p>The method:</p>



<ol class="wp-block-list">
<li>Copies the relevant subset of the DataFrame.</li>



<li>Shifts the <code>time</code> column <strong>backward</strong> (<code>df[time] - shift</code>) to align future values.</li>



<li>Applies suffixes such as <code>_lead</code> or <code>_leadN</code>.</li>



<li>Merges the lead values back into the original DataFrame using <code>pd.merge(...)</code>.</li>
</ol>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">See Also</h2>



<ul class="wp-block-list">
<li><code>.chem.lag()</code> – for creating lagged (past) variables</li>



<li><code>.chem.mutate()</code> – for conditional column assignment</li>



<li><code>pd.DataFrame.shift()</code> – basic shifting</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>
]]></content:encoded>
					
					<wfw:commentRss>https://pychemist.com/manual-chem-lead/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Manual: .chem.lag() Method for Creating Lagged Variables</title>
		<link>https://pychemist.com/manual-chem-lag/</link>
					<comments>https://pychemist.com/manual-chem-lag/#respond</comments>
		
		<dc:creator><![CDATA[Pychemist]]></dc:creator>
		<pubDate>Tue, 29 Jul 2025 07:35:39 +0000</pubDate>
				<category><![CDATA[Manuals]]></category>
		<guid isPermaLink="false">https://pychemist.com/?p=262</guid>

					<description><![CDATA[Overview The .chem.lag() method is a custom pandas accessor that creates lagged versions of one or more variables in a [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Overview</h2>



<p>The <code>.chem.lag()</code> method is a custom pandas accessor that creates lagged versions of one or more variables in a DataFrame. It operates by shifting the specified columns by a given number of time periods and merging the result back onto the original DataFrame.</p>



<p>This is especially useful for time-series panel data where each row belongs to a unique unit (e.g., company, experiment, patient) over time.</p>



<p>For a concrete example of usage, refer to <strong><a href="https://pychemist.com/creating-lag-and-lead-variables/">https://pychemist.com/creating-lag-and-lead-variables/</a></strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Accessor Registration</h2>



<p>This method is registered with pandas under the accessor name <code>.chem</code>:</p>



<p>Access the method like this:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.chem.lag(...)</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Method Signature</h2>



<pre class="urvanov-syntax-highlighter-plain-tag">df.chem.lag(variables, identifier, time, shift=1, *, replace=False)</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Parameters</h2>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Parameter</th><th>Type</th><th>Description</th></tr></thead><tbody><tr><td><strong>variables</strong></td><td><code>str</code> or <code>list of str</code></td><td>Column(s) for which to create lagged versions.</td></tr><tr><td><strong>identifier</strong></td><td><code>str</code></td><td>The name of the column identifying individual units (e.g., subject ID or group).</td></tr><tr><td><strong>time</strong></td><td><code>str</code></td><td>The name of the time column. Used to shift values within groups.</td></tr><tr><td><strong>shift</strong></td><td><code>int</code>, default <code>1</code></td><td>Number of time periods to shift. Positive values create lags. Negative values are not allowed.</td></tr><tr><td><strong>replace</strong></td><td><code>bool</code>, default <code>False</code></td><td>If <code>True</code>, overwrites existing lagged columns. If <code>False</code>, raises an error if there&#8217;s a naming conflict.</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Returns</h2>



<ul class="wp-block-list">
<li>A modified copy of the original <code>pd.DataFrame</code>, with lagged versions of the specified variables added as new columns.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Behavior</h2>



<ol class="wp-block-list">
<li>Verifies input types and column existence.</li>



<li>Constructs a lagged version of the selected variables by shifting the <code>time</code> column forward by the given <code>shift</code> amount.</li>



<li>Merges this lagged DataFrame back onto the original, based on the <code>identifier</code> and <code>time</code>.</li>



<li>New columns are suffixed with:
<ul class="wp-block-list">
<li><code>_lag</code> for <code>shift=1</code></li>



<li><code>_lagN</code> for <code>shift=N</code> (e.g., <code>_lag3</code> for <code>shift=3</code>)</li>
</ul>
</li>



<li>If <code>replace=False</code> and any of the output columns already exist, the method raises a <code>ValueError</code>.</li>
</ol>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Notes</h2>



<ul class="wp-block-list">
<li>The original DataFrame remains unchanged.</li>



<li>Supports multiple variables and vectorized operations.</li>



<li>Designed for panel or longitudinal data.</li>



<li>For negative shift (lead variables) refert to <code>.chem.lead()</code></li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Example</h2>



<pre class="urvanov-syntax-highlighter-plain-tag">df = pd.DataFrame({
    "id": &#91;1, 1, 1, 2, 2, 2],
    "time": &#91;1, 2, 3, 1, 2, 3],
    "mass": &#91;10, 15, 20, 5, 10, 15]
})

df_lagged = df.chem.lag(variables="mass", identifier="id", time="time", shift=1)</pre>



<p>This will produce a new column called <code>mass_lag</code> with the mass value from the previous time step (by <code>id</code>).</p>



<p>Or more concisely:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df_lagged = df.chem.lag("mass", "id", "time")</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Common Use Cases</h2>



<ul class="wp-block-list">
<li>Creating lagged predictors in time-series regression.</li>



<li>Modeling delayed effects in experiments.</li>



<li>Comparing values across sequential periods.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Error Handling</h2>



<ul class="wp-block-list">
<li><strong>TypeError</strong> if <code>variables</code> is not a string or list of strings.</li>



<li><strong>TypeError</strong> if <code>replace</code> is not a boolean.</li>



<li><strong>TypeError</strong> if <code>shift</code> is not a positive integer.</li>



<li><strong>ValueError</strong> if any specified variable is missing from the DataFrame.</li>



<li><strong>ValueError</strong> if a lagged column already exists and <code>replace=False</code>.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Internals</h2>



<p>The method:</p>



<ol class="wp-block-list">
<li>Creates a shifted copy of the target columns using <code>df[time] + shift</code>.</li>



<li>Applies suffixes like <code>_lag</code>, <code>_lag2</code>, etc.</li>



<li>Merges the lagged columns back onto the original DataFrame using <code>pd.merge(...)</code>.</li>
</ol>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">See Also</h2>



<ul class="wp-block-list">
<li><code>.chem.mutate()</code> – for conditional column assignment</li>



<li><code>pd.DataFrame.shift()</code> – basic shifting</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>
]]></content:encoded>
					
					<wfw:commentRss>https://pychemist.com/manual-chem-lag/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Manual: .chem.mutate() Method for Conditional DataFrame Updates</title>
		<link>https://pychemist.com/manual-chem-mutate/</link>
					<comments>https://pychemist.com/manual-chem-mutate/#respond</comments>
		
		<dc:creator><![CDATA[Pychemist]]></dc:creator>
		<pubDate>Mon, 28 Jul 2025 19:13:44 +0000</pubDate>
				<category><![CDATA[Manuals]]></category>
		<guid isPermaLink="false">https://pychemist.com/?p=259</guid>

					<description><![CDATA[Overview The .chem.mutate() method is a custom pandas accessor that enables conditional assignment of values to a column in a [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Overview</h2>



<p>The <code>.chem.mutate()</code> method is a custom pandas accessor that enables conditional assignment of values to a column in a DataFrame using a query string. It allows for clean, chainable DataFrame transformations and always returns a modified copy of the DataFrame.</p>



<p>This is particularly useful when working with chemistry-related tabular data and transformations, but it can be applied more broadly.</p>



<p>For a concrete example of usage, refer to <strong><a href="https://pychemist.com/mutate/">https://pychemist.com/mutate/</a></strong>.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Accessor Registration</h2>



<p>This method is registered using the pandas API extensions system. This means you can access the method via:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.chem.mutate(...)</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Method Signature</h2>



<pre class="urvanov-syntax-highlighter-plain-tag">df.chem.mutate(query_str, column, value, other=None)</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Parameters</h2>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th>Parameter</th><th>Type</th><th>Description</th></tr></thead><tbody><tr><td><strong>query_str</strong></td><td><code>str</code></td><td>A <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html">pandas query string</a> used to select rows that satisfy the condition.</td></tr><tr><td><strong>column</strong></td><td><code>str</code></td><td>The column to update or create.</td></tr><tr><td><strong>value</strong></td><td><code>scalar</code> or <code>array-like</code></td><td>The value(s) assigned to rows that meet the query condition.</td></tr><tr><td><strong>other</strong></td><td><code>scalar</code> or <code>array-like</code>, optional</td><td>The value(s) assigned to rows <strong>not</strong> meeting the condition. If <code>None</code>, rows not matching the condition are left unchanged.</td></tr></tbody></table></figure>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Returns</h2>



<ul class="wp-block-list">
<li>A modified copy of the original <code>pd.DataFrame</code>.</li>
</ul>



<p>This method does <strong>not</strong> modify the DataFrame in-place.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Behavior</h2>



<ol class="wp-block-list">
<li>The method evaluates the <code>query_str</code> on the DataFrame.</li>



<li>For rows <strong>matching the query</strong>, the specified column is set to <code>value</code>.</li>



<li>For rows <strong>not matching the query</strong>, the column is set to <code>other</code> <strong>only if <code>other</code> is provided</strong>.</li>



<li>The column is created if it does not already exist.</li>
</ol>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Notes</h2>



<ul class="wp-block-list">
<li>This method is part of a custom accessor named <code>.chem</code>.</li>



<li>The original DataFrame is left unchanged.</li>



<li>You can use it in method chains, e.g.: df2 = df.chem.mutate(query_str=&#8221;mass &gt; 10&#8243;, column=&#8221;label&#8221;, value=&#8221;heavy&#8221;, other=&#8221;light&#8221;)</li>



<li><code>value</code> and <code>other</code> can be scalars or array-like, but must match the number of rows being assigned.</li>
</ul>



<p>Or simply:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df2 = df.chem.mutate("mass &gt; 10", "label", "heavy", "light")</pre>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Example</h2>



<p><em>See <a href="https://pychemist.com/mutate/">https://pychemist.com/mutate/</a> for a practical example using <code>.chem.mutate()</code>.</em></p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Common Use Cases</h2>



<ul class="wp-block-list">
<li>Labeling chemical species by some threshold: df.chem.mutate(&#8220;concentration &gt; 1.0&#8221;, &#8220;status&#8221;, &#8220;high&#8221;, &#8220;low&#8221;)</li>



<li>Creating a new boolean flag: df.chem.mutate(&#8220;pH &lt; 7&#8221;, &#8220;is_acidic&#8221;, True, other=False)</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Error Handling</h2>



<ul class="wp-block-list">
<li>Any syntax errors in <code>query_str</code> will raise a <code>pandas</code> exception at runtime.</li>



<li>If <code>value</code> or <code>other</code> are array-like but do not match the shape of the selected rows, a <code>ValueError</code> will be raised.</li>
</ul>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Internals</h2>



<p>Internally, the method:</p>



<ol class="wp-block-list">
<li>Copies the DataFrame (<code>df = self._obj.copy()</code>),</li>



<li>Applies the query string using <code>df.query(query_str)</code>,</li>



<li>Locates matching and non-matching row indices,</li>



<li>Assigns <code>value</code> and optionally <code>other</code> to the specified column,</li>



<li>Returns the modified DataFrame.</li>
</ol>



<hr class="wp-block-separator has-alpha-channel-opacity"/>
]]></content:encoded>
					
					<wfw:commentRss>https://pychemist.com/manual-chem-mutate/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Creating Lag and Lead Variables</title>
		<link>https://pychemist.com/creating-lag-and-lead-variables/</link>
					<comments>https://pychemist.com/creating-lag-and-lead-variables/#respond</comments>
		
		<dc:creator><![CDATA[Pychemist]]></dc:creator>
		<pubDate>Mon, 28 Jul 2025 18:35:57 +0000</pubDate>
				<category><![CDATA[Guides]]></category>
		<guid isPermaLink="false">https://pychemist.com/?p=254</guid>

					<description><![CDATA[Introduction In this tutorial, we examine how to create lagged and lead variables: essential tools for time series and panel [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Introduction</h2>



<p>In this tutorial, we examine how to create <strong>lagged</strong> and <strong>lead</strong> variables: essential tools for time series and panel data analysis. Whether you’re modeling financial trends or running regressions, such transformations are often essential to compute new variables, such as <em>Return on Assets</em> or <em>Revenue Growth</em>. We’ll also explore how the <code>Pychemist</code> library simplifies this task.</p>



<p>We use <code>pandas</code> and the <code>pychemist</code> library to demonstrate this. For this tutorial we rely on a dataset of financial information for 20 large US companies, scraped from <em>EDGAR</em>. Some missing values in the dataset have been imputed for demonstration purposes.</p>



<p>While pandas lets you create lagged values using <code>.shift()</code>, it doesn’t always behave as expected for grouped data or irregular time series. Pychemist provides a simpler and more reliable alternative through its <code>.chem</code>accessor.</p>



<h2 class="wp-block-heading">Step 1: Load the Data and Libraries</h2>



<p>We start by importing the necessary libraries and loading the dataset. <code>pandas</code> and <code>numpy</code> are used for working with panel datasets. Additionally, we import <code>pychemist</code> to download the financial dataset and to enable the custom pandas extensions (accessors) used in this example.</p>



<p><strong>Note:</strong> If you haven’t installed pychemist yet, you can do so with the following command:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">pip install pychemist</pre>



<p>Import the required libraries and load the dataset into a DataFrame (df):</p>



<pre class="urvanov-syntax-highlighter-plain-tag">import pandas as pd
import numpy as np
import pychemist

df = pychemist.load('financials')</pre>



<p>Let’s inspect the first few rows:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head()</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th></tr></thead><tbody><tr><td>0</td><td>11</td><td>AAPL</td><td>2020</td><td>5.953100e+10</td><td>3.385160e+11</td><td>NaN</td></tr><tr><td>1</td><td>12</td><td>AAPL</td><td>2021</td><td>5.525600e+10</td><td>3.238880e+11</td><td>NaN</td></tr><tr><td>2</td><td>13</td><td>AAPL</td><td>2022</td><td>5.741100e+10</td><td>3.510020e+11</td><td>NaN</td></tr><tr><td>3</td><td>14</td><td>AAPL</td><td>2023</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td></tr><tr><td>4</td><td>15</td><td>AAPL</td><td>2024</td><td>9.980300e+10</td><td>3.525830e+11</td><td>NaN</td></tr></tbody></table></figure>



<p>5 rows × 6 columns</p>



<h2 class="wp-block-heading">Step 2: Create a Lag Manually</h2>



<p>First, we sort the dataset by company (<code>ticker</code>) and time (<code>year</code>). Second, we create a lagged variable <code>total_assets_lag</code> using Pandas’ built-in <code>.shift()</code> method.</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = df['total_assets'].shift()</pre>



<p>Next, we inspect the DataFrame:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head(10)</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th></tr></thead><tbody><tr><td>0</td><td>11</td><td>AAPL</td><td>2020</td><td>5.953100e+10</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td></tr><tr><td>1</td><td>12</td><td>AAPL</td><td>2021</td><td>5.525600e+10</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td></tr><tr><td>2</td><td>13</td><td>AAPL</td><td>2022</td><td>5.741100e+10</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td></tr><tr><td>3</td><td>14</td><td>AAPL</td><td>2023</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td></tr><tr><td>4</td><td>15</td><td>AAPL</td><td>2024</td><td>9.980300e+10</td><td>3.525830e+11</td><td>NaN</td><td>3.527550e+11</td></tr><tr><td>5</td><td>27</td><td>ADBE</td><td>2020</td><td>2.591000e+09</td><td>2.076200e+10</td><td>9.030000e+09</td><td>3.525830e+11</td></tr><tr><td>6</td><td>28</td><td>ADBE</td><td>2021</td><td>2.951000e+09</td><td>2.428400e+10</td><td>1.117100e+10</td><td>2.076200e+10</td></tr><tr><td>7</td><td>29</td><td>ADBE</td><td>2022</td><td>5.260000e+09</td><td>2.724100e+10</td><td>1.286800e+10</td><td>2.428400e+10</td></tr><tr><td>8</td><td>30</td><td>ADBE</td><td>2023</td><td>4.822000e+09</td><td>2.716500e+10</td><td>1.578500e+10</td><td>2.724100e+10</td></tr><tr><td>9</td><td>31</td><td>ADBE</td><td>2024</td><td>4.756000e+09</td><td>2.977900e+10</td><td>1.760600e+10</td><td>2.716500e+10</td></tr></tbody></table></figure>



<p>10 rows × 7 columns</p>



<p><strong>Problem:</strong> We see that indeed a lagged version of <code>total assets</code> is added to the DataFrame. However, this method does <strong>not</strong> take into account whether the previous row belongs to the same company. It simply shifts row-wise, even across companies. We see for example that <code>total_assets_lag</code> for Adobe in 2020 equals Apple’s Total Assets for 2024. Obviously, this is not what we want.</p>



<h2 class="wp-block-heading">Step 3: Fix Grouping Issues Manually</h2>



<p>To address this, we can conditionally shift values only if the previous row belongs to the same <code>ticker</code>:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = np.where(
    df['ticker'] == df['ticker'].shift(),
    df['total_assets'].shift(),
    np.nan
)</pre>



<p>Inspection of the first 10 rows of the DataFrame indicates that this indeed resolved the issue.</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head(10)</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th></tr></thead><tbody><tr><td>0</td><td>11</td><td>AAPL</td><td>2020</td><td>5.953100e+10</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td></tr><tr><td>1</td><td>12</td><td>AAPL</td><td>2021</td><td>5.525600e+10</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td></tr><tr><td>2</td><td>13</td><td>AAPL</td><td>2022</td><td>5.741100e+10</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td></tr><tr><td>3</td><td>14</td><td>AAPL</td><td>2023</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td></tr><tr><td>4</td><td>15</td><td>AAPL</td><td>2024</td><td>9.980300e+10</td><td>3.525830e+11</td><td>NaN</td><td>3.527550e+11</td></tr><tr><td>5</td><td>27</td><td>ADBE</td><td>2020</td><td>2.591000e+09</td><td>2.076200e+10</td><td>9.030000e+09</td><td>NaN</td></tr><tr><td>6</td><td>28</td><td>ADBE</td><td>2021</td><td>2.951000e+09</td><td>2.428400e+10</td><td>1.117100e+10</td><td>2.076200e+10</td></tr><tr><td>7</td><td>29</td><td>ADBE</td><td>2022</td><td>5.260000e+09</td><td>2.724100e+10</td><td>1.286800e+10</td><td>2.428400e+10</td></tr><tr><td>8</td><td>30</td><td>ADBE</td><td>2023</td><td>4.822000e+09</td><td>2.716500e+10</td><td>1.578500e+10</td><td>2.724100e+10</td></tr><tr><td>9</td><td>31</td><td>ADBE</td><td>2024</td><td>4.756000e+09</td><td>2.977900e+10</td><td>1.760600e+10</td><td>2.716500e+10</td></tr></tbody></table></figure>



<p>10 rows × 7 columns</p>



<p>However, there are still cases where the calculation did not happen as expected. For example, take a look at the observations for Nvidia and Tesla:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.query('ticker=="NVDA" or ticker=="TSLA"')</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th></tr></thead><tbody><tr><td>71</td><td>209</td><td>NVDA</td><td>2020</td><td>3.047000e+09</td><td>1.329200e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>72</td><td>210</td><td>NVDA</td><td>2021</td><td>4.141000e+09</td><td>1.731500e+10</td><td>NaN</td><td>1.329200e+10</td></tr><tr><td>73</td><td>212</td><td>NVDA</td><td>2023</td><td>4.332000e+09</td><td>4.418700e+10</td><td>1.667500e+10</td><td>1.731500e+10</td></tr><tr><td>74</td><td>213</td><td>NVDA</td><td>2024</td><td>9.752000e+09</td><td>4.118200e+10</td><td>2.691400e+10</td><td>4.418700e+10</td></tr><tr><td>75</td><td>214</td><td>NVDA</td><td>2025</td><td>4.368000e+09</td><td>6.572800e+10</td><td>2.697400e+10</td><td>4.118200e+10</td></tr><tr><td>87</td><td>256</td><td>TSLA</td><td>2020</td><td>-9.760000e+08</td><td>3.430900e+10</td><td>2.146100e+10</td><td>NaN</td></tr><tr><td>88</td><td>258</td><td>TSLA</td><td>2022</td><td>7.210000e+08</td><td>6.213100e+10</td><td>3.153600e+10</td><td>3.430900e+10</td></tr><tr><td>89</td><td>259</td><td>TSLA</td><td>2023</td><td>5.519000e+09</td><td>8.233800e+10</td><td>5.382300e+10</td><td>6.213100e+10</td></tr><tr><td>90</td><td>260</td><td>TSLA</td><td>2024</td><td>1.255600e+10</td><td>1.066180e+11</td><td>8.146200e+10</td><td>8.233800e+10</td></tr></tbody></table></figure>



<p>9 rows × 7 columns</p>



<p>Even when years are missing, the current logic carries forward the last available value, which may be from two or more years ago. As a result of data for NVIDIA for 2022 being missing, the lagged value <code>total_assets_lag</code> for 2023 is now incorrectly set equal to the total assets for 2021. Similarly, the lagged total assets for Tesla for 2022 are incorrectly set equal to the value for 2020.</p>



<h2 class="wp-block-heading">Step 4: Handle Missing Years</h2>



<p>Let’s refine the condition further. Our condition should verify that the previous row belongs to the same <code>ticker</code> and that the <code>year</code> increment equals one:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = np.where(
    (df['ticker'] == df['ticker'].shift()) &amp; 
    (df['year'] - df['year'].shift() == 1),
    df['total_assets'].shift(),
    np.nan
)</pre>



<p>Let’s inspect the DataFrame again:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.query('ticker=="NVDA" or ticker=="TSLA"')</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th></tr></thead><tbody><tr><td>71</td><td>209</td><td>NVDA</td><td>2020</td><td>3.047000e+09</td><td>1.329200e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>72</td><td>210</td><td>NVDA</td><td>2021</td><td>4.141000e+09</td><td>1.731500e+10</td><td>NaN</td><td>1.329200e+10</td></tr><tr><td>73</td><td>212</td><td>NVDA</td><td>2023</td><td>4.332000e+09</td><td>4.418700e+10</td><td>1.667500e+10</td><td>NaN</td></tr><tr><td>74</td><td>213</td><td>NVDA</td><td>2024</td><td>9.752000e+09</td><td>4.118200e+10</td><td>2.691400e+10</td><td>4.418700e+10</td></tr><tr><td>75</td><td>214</td><td>NVDA</td><td>2025</td><td>4.368000e+09</td><td>6.572800e+10</td><td>2.697400e+10</td><td>4.118200e+10</td></tr><tr><td>87</td><td>256</td><td>TSLA</td><td>2020</td><td>-9.760000e+08</td><td>3.430900e+10</td><td>2.146100e+10</td><td>NaN</td></tr><tr><td>88</td><td>258</td><td>TSLA</td><td>2022</td><td>7.210000e+08</td><td>6.213100e+10</td><td>3.153600e+10</td><td>NaN</td></tr><tr><td>89</td><td>259</td><td>TSLA</td><td>2023</td><td>5.519000e+09</td><td>8.233800e+10</td><td>5.382300e+10</td><td>6.213100e+10</td></tr><tr><td>90</td><td>260</td><td>TSLA</td><td>2024</td><td>1.255600e+10</td><td>1.066180e+11</td><td>8.146200e+10</td><td>8.233800e+10</td></tr></tbody></table></figure>



<p>9 rows × 7 columns</p>



<p>This approach works, but the code becomes verbose and difficult to scale. If we want a 2-year lag, the condition becomes even more complex:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.sort_values(['ticker', 'year'])
df['total_assets_lag2'] = np.where(
    (df['ticker'] == df['ticker'].shift(2)) &amp; 
    (df['year'] - df['year'].shift(2) == 2),
    df['total_assets'].shift(2),
    np.nan
)</pre>



<h2 class="wp-block-heading">Step 5: Let Pychemist Do the Work</h2>



<p>To simplify and generalize this process, we can use <code>Pychemist</code>’s built-in lag function.</p>



<p>Note: this requires installing and importing the Pychemist library as done at the start of this tutorial.</p>



<p>We can now generate a lagged variable as follows:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.chem.lag('total_assets', 'ticker', 'year')</pre>



<p>This automatically handles company grouping and time consistency. The <code>chem.lag</code> method also handles important edge cases automatically. For instance, it only assigns lag values when both the grouping variable (e.g., <code>ticker</code>) matches <strong>and</strong> the <code>year</code> variable increments by exactly the lag interval. This means that if a year is missing in the dataset, the function will <strong>not</strong> incorrectly carry over data from a non-consecutive year, and will instead return <code>NaN</code> as expected. If the lagged column already exists, it will not be overwritten, and a warning will be issued to prevent accidental data loss.</p>



<h2 class="wp-block-heading">Step 6: Lag Multiple Columns</h2>



<p>To created multiple lagged variables at once, we can set a list of variables for which lagged variables should be computed. We set <code>replace=True</code> because the DataFrame already contains the lagged variable for total_assets. Without this argument the function would raise a warning that this variable already exists in the DataFrame. Alternatively, we could drop this column manually.</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', replace=True)</pre>



<p>We can see that the results are as expected:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.query('ticker=="NVDA" or ticker=="TSLA"')</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th><th>revenue_lag</th></tr></thead><tbody><tr><td>71</td><td>209</td><td>NVDA</td><td>2020</td><td>3.047000e+09</td><td>1.329200e+10</td><td>NaN</td><td>NaN</td><td>NaN</td></tr><tr><td>72</td><td>210</td><td>NVDA</td><td>2021</td><td>4.141000e+09</td><td>1.731500e+10</td><td>NaN</td><td>1.329200e+10</td><td>NaN</td></tr><tr><td>73</td><td>212</td><td>NVDA</td><td>2023</td><td>4.332000e+09</td><td>4.418700e+10</td><td>1.667500e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>74</td><td>213</td><td>NVDA</td><td>2024</td><td>9.752000e+09</td><td>4.118200e+10</td><td>2.691400e+10</td><td>4.418700e+10</td><td>1.667500e+10</td></tr><tr><td>75</td><td>214</td><td>NVDA</td><td>2025</td><td>4.368000e+09</td><td>6.572800e+10</td><td>2.697400e+10</td><td>4.118200e+10</td><td>2.691400e+10</td></tr><tr><td>87</td><td>256</td><td>TSLA</td><td>2020</td><td>-9.760000e+08</td><td>3.430900e+10</td><td>2.146100e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>88</td><td>258</td><td>TSLA</td><td>2022</td><td>7.210000e+08</td><td>6.213100e+10</td><td>3.153600e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>89</td><td>259</td><td>TSLA</td><td>2023</td><td>5.519000e+09</td><td>8.233800e+10</td><td>5.382300e+10</td><td>6.213100e+10</td><td>3.153600e+10</td></tr><tr><td>90</td><td>260</td><td>TSLA</td><td>2024</td><td>1.255600e+10</td><td>1.066180e+11</td><td>8.146200e+10</td><td>8.233800e+10</td><td>5.382300e+10</td></tr></tbody></table></figure>



<p>9 rows × 8 columns</p>



<p>To generate 2-year lags (or more), simply pass a ‘shift’ parameter:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', 2)</pre>



<p>This results in the following DataFrame:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head()</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th><th>revenue_lag</th><th>total_assets_lag2</th><th>revenue_lag2</th></tr></thead><tbody><tr><td>0</td><td>11</td><td>AAPL</td><td>2020</td><td>5.953100e+10</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td></tr><tr><td>1</td><td>12</td><td>AAPL</td><td>2021</td><td>5.525600e+10</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td><td>NaN</td></tr><tr><td>2</td><td>13</td><td>AAPL</td><td>2022</td><td>5.741100e+10</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td><td>NaN</td></tr><tr><td>3</td><td>14</td><td>AAPL</td><td>2023</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td><td>NaN</td></tr><tr><td>4</td><td>15</td><td>AAPL</td><td>2024</td><td>9.980300e+10</td><td>3.525830e+11</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td><td>NaN</td></tr></tbody></table></figure>



<p>5 rows × 10 columns</p>



<h2 class="wp-block-heading">Step 7: Create Lead Variables</h2>



<p>Creating <strong>lead</strong> variables (i.e., future values) is just as easy. Simply use <code>.chem.lead</code> instead:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.chem.lead(['total_assets', 'revenue'], 'ticker', 'year')</pre>



<p>The DataFrame would look as follows:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head()</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th><th>revenue_lag</th><th>total_assets_lag2</th><th>revenue_lag2</th><th>total_assets_lead</th><th>revenue_lead</th></tr></thead><tbody><tr><td>0</td><td>11</td><td>AAPL</td><td>2020</td><td>5.953100e+10</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>3.238880e+11</td><td>NaN</td></tr><tr><td>1</td><td>12</td><td>AAPL</td><td>2021</td><td>5.525600e+10</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td><td>NaN</td><td>3.510020e+11</td><td>NaN</td></tr><tr><td>2</td><td>13</td><td>AAPL</td><td>2022</td><td>5.741100e+10</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td></tr><tr><td>3</td><td>14</td><td>AAPL</td><td>2023</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td><td>NaN</td><td>3.525830e+11</td><td>NaN</td></tr><tr><td>4</td><td>15</td><td>AAPL</td><td>2024</td><td>9.980300e+10</td><td>3.525830e+11</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td><td>NaN</td><td>NaN</td><td>NaN</td></tr></tbody></table></figure>



<p>5 rows × 12 columns</p>



<h2 class="wp-block-heading">Step 8: Compute Derived Metrics</h2>



<p>Now that we have lags, we can compute variables such as <em>Return on Assets (ROA)</em> or <em>Revenue Growth</em>:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df=df.eval("""
    roa=net_income / ((total_assets+total_assets_lag)/2)
    growth = (revenue-revenue_lag) / revenue_lag
    """)</pre>



<p>Inspect the results:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.query('ticker=="NVDA" or ticker=="TSLA"')</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th><th>revenue_lag</th><th>total_assets_lag2</th><th>revenue_lag2</th><th>total_assets_lead</th><th>revenue_lead</th><th>roa</th><th>growth</th></tr></thead><tbody><tr><td>71</td><td>209</td><td>NVDA</td><td>2020</td><td>3.047000e+09</td><td>1.329200e+10</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>1.731500e+10</td><td>NaN</td><td>NaN</td><td>NaN</td></tr><tr><td>72</td><td>210</td><td>NVDA</td><td>2021</td><td>4.141000e+09</td><td>1.731500e+10</td><td>NaN</td><td>1.329200e+10</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>0.270592</td><td>NaN</td></tr><tr><td>73</td><td>212</td><td>NVDA</td><td>2023</td><td>4.332000e+09</td><td>4.418700e+10</td><td>1.667500e+10</td><td>NaN</td><td>NaN</td><td>1.731500e+10</td><td>NaN</td><td>4.118200e+10</td><td>2.691400e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>74</td><td>213</td><td>NVDA</td><td>2024</td><td>9.752000e+09</td><td>4.118200e+10</td><td>2.691400e+10</td><td>4.418700e+10</td><td>1.667500e+10</td><td>NaN</td><td>NaN</td><td>6.572800e+10</td><td>2.697400e+10</td><td>0.228467</td><td>0.614033</td></tr><tr><td>75</td><td>214</td><td>NVDA</td><td>2025</td><td>4.368000e+09</td><td>6.572800e+10</td><td>2.697400e+10</td><td>4.118200e+10</td><td>2.691400e+10</td><td>4.418700e+10</td><td>1.667500e+10</td><td>NaN</td><td>NaN</td><td>0.081714</td><td>0.002229</td></tr><tr><td>87</td><td>256</td><td>TSLA</td><td>2020</td><td>-9.760000e+08</td><td>3.430900e+10</td><td>2.146100e+10</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td></tr><tr><td>88</td><td>258</td><td>TSLA</td><td>2022</td><td>7.210000e+08</td><td>6.213100e+10</td><td>3.153600e+10</td><td>NaN</td><td>NaN</td><td>3.430900e+10</td><td>2.146100e+10</td><td>8.233800e+10</td><td>5.382300e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>89</td><td>259</td><td>TSLA</td><td>2023</td><td>5.519000e+09</td><td>8.233800e+10</td><td>5.382300e+10</td><td>6.213100e+10</td><td>3.153600e+10</td><td>NaN</td><td>NaN</td><td>1.066180e+11</td><td>8.146200e+10</td><td>0.076404</td><td>0.706716</td></tr><tr><td>90</td><td>260</td><td>TSLA</td><td>2024</td><td>1.255600e+10</td><td>1.066180e+11</td><td>8.146200e+10</td><td>8.233800e+10</td><td>5.382300e+10</td><td>6.213100e+10</td><td>3.153600e+10</td><td>NaN</td><td>NaN</td><td>0.132899</td><td>0.513517</td></tr></tbody></table></figure>



<p>9 rows × 14 columns</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>In this tutorial, we learned how to create <strong>lagged</strong> and <strong>lead</strong> variables manually using pandas, and why that approach often falls short, especially in grouped, irregular panel data.</p>



<p>Then, we saw how the <strong>Pychemist</strong> library solves these problems elegantly:</p>



<ul class="wp-block-list">
<li>Accurate grouping</li>



<li>Handles gaps in time</li>



<li>Clean, concise API</li>



<li>Easy multi-period shifts</li>



<li>Works seamlessly with pandas</li>
</ul>



<p>Whether you’re building financial models or prepping data for regression analysis, <code>Pychemist</code> streamlines your workflow and eliminates common mistakes.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://pychemist.com/creating-lag-and-lead-variables/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Mutate</title>
		<link>https://pychemist.com/mutate/</link>
					<comments>https://pychemist.com/mutate/#respond</comments>
		
		<dc:creator><![CDATA[Pychemist]]></dc:creator>
		<pubDate>Sun, 20 Jul 2025 21:38:18 +0000</pubDate>
				<category><![CDATA[Guides]]></category>
		<guid isPermaLink="false">https://pychemist.com/?p=73</guid>

					<description><![CDATA[When preparing data for analysis, it is often necessary to create new variables or modify existing values: whether to fix [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>When preparing data for analysis, it is often necessary to create new variables or modify existing values: whether to fix data entry errors, derive variables based on existing ones, or flag subsets of data. In <em>pandas</em>, this is typically done using <code>df.loc</code> or <code>np.where</code>, but these methods can lead to verbose, repetitive, and hard-to-read code.</p>



<p>In this tutorial, we introduce a more readable and expressive alternative using a custom pandas accessor provided by the <code>pychemist</code> library. By leveraging <code>df.query</code> under the hood, the <code>.chem.mutate</code> accessor allows you to perform chained, conditional assignments in a cleaner and more maintainable way.</p>



<p>To demonstrate how to modify variables, we’ll use the <strong>IBM HR Analytics Employee Attrition &amp; Performance</strong> dataset, available on Kaggle:<br><a href="https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset" target="_blank" rel="noreferrer noopener">https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset</a></p>



<p>We start by importing the required libraries and loading the dataset into a DataFrame:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">import pandas as pd
import numpy as np

df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')</pre>



<p>This creates a new DataFrame <code>df</code>.</p>



<p>We can examine the first five rows of the dataset to get a sense of the variables, their names, and the types of values they contain:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head()</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>Age</th><th>Attrition</th><th>BusinessTravel</th><th>DailyRate</th><th>Department</th><th>DistanceFromHome</th><th>Education</th><th>EducationField</th><th>EmployeeCount</th><th>EmployeeNumber</th><th>EnvironmentSatisfaction</th><th>Gender</th><th>HourlyRate</th><th>JobInvolvement</th><th>JobLevel</th><th>JobRole</th><th>JobSatisfaction</th><th>MaritalStatus</th><th>MonthlyIncome</th><th>MonthlyRate</th><th>NumCompaniesWorked</th><th>Over18</th><th>OverTime</th><th>PercentSalaryHike</th><th>PerformanceRating</th><th>RelationshipSatisfaction</th><th>StandardHours</th><th>StockOptionLevel</th><th>TotalWorkingYears</th><th>TrainingTimesLastYear</th><th>WorkLifeBalance</th><th>YearsAtCompany</th><th>YearsInCurrentRole</th><th>YearsSinceLastPromotion</th><th>YearsWithCurrManager</th></tr></thead><tbody><tr><td>0</td><td>41</td><td>Yes</td><td>Travel_Rarely</td><td>1102</td><td>Sales</td><td>1</td><td>2</td><td>Life Sciences</td><td>1</td><td>1</td><td>2</td><td>Female</td><td>94</td><td>3</td><td>2</td><td>Sales Executive</td><td>4</td><td>Single</td><td>5993</td><td>19479</td><td>8</td><td>Y</td><td>Yes</td><td>11</td><td>3</td><td>1</td><td>80</td><td>0</td><td>8</td><td>0</td><td>1</td><td>6</td><td>4</td><td>0</td><td>5</td></tr><tr><td>1</td><td>49</td><td>No</td><td>Travel_Frequently</td><td>279</td><td>Research &amp; Development</td><td>8</td><td>1</td><td>Life Sciences</td><td>1</td><td>2</td><td>3</td><td>Male</td><td>61</td><td>2</td><td>2</td><td>Research Scientist</td><td>2</td><td>Married</td><td>5130</td><td>24907</td><td>1</td><td>Y</td><td>No</td><td>23</td><td>4</td><td>4</td><td>80</td><td>1</td><td>10</td><td>3</td><td>3</td><td>10</td><td>7</td><td>1</td><td>7</td></tr><tr><td>2</td><td>37</td><td>Yes</td><td>Travel_Rarely</td><td>1373</td><td>Research &amp; Development</td><td>2</td><td>2</td><td>Other</td><td>1</td><td>4</td><td>4</td><td>Male</td><td>92</td><td>2</td><td>1</td><td>Laboratory Technician</td><td>3</td><td>Single</td><td>2090</td><td>2396</td><td>6</td><td>Y</td><td>Yes</td><td>15</td><td>3</td><td>2</td><td>80</td><td>0</td><td>7</td><td>3</td><td>3</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><td>3</td><td>33</td><td>No</td><td>Travel_Frequently</td><td>1392</td><td>Research &amp; Development</td><td>3</td><td>4</td><td>Life Sciences</td><td>1</td><td>5</td><td>4</td><td>Female</td><td>56</td><td>3</td><td>1</td><td>Research Scientist</td><td>3</td><td>Married</td><td>2909</td><td>23159</td><td>1</td><td>Y</td><td>Yes</td><td>11</td><td>3</td><td>3</td><td>80</td><td>0</td><td>8</td><td>3</td><td>3</td><td>8</td><td>7</td><td>3</td><td>0</td></tr><tr><td>4</td><td>27</td><td>No</td><td>Travel_Rarely</td><td>591</td><td>Research &amp; Development</td><td>2</td><td>1</td><td>Medical</td><td>1</td><td>7</td><td>1</td><td>Male</td><td>40</td><td>3</td><td>1</td><td>Laboratory Technician</td><td>2</td><td>Married</td><td>3468</td><td>16632</td><td>9</td><td>Y</td><td>No</td><td>12</td><td>3</td><td>4</td><td>80</td><td>1</td><td>6</td><td>3</td><td>3</td><td>2</td><td>2</td><td>2</td><td>2</td></tr></tbody></table></figure>



<p>5 rows × 35 columns</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Conditional Mutation with Base Pandas</h2>



<p><a href="#conditional-mutation-with-base-pandas"></a></p>



<p>Imagine we want to decide which employees are eligible for a bonus. As an example, we could use Numpy’s where function (<code>np.where</code>) to flag employees who haven’t been promoted in 3 or more years:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df['promotion'] = np.where(df['YearsSinceLastPromotion'] &gt;= 3, 1, 0)</pre>



<p>Alternatively, the same logic using <code>df.loc</code>:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df['promotion'] = 0
df.loc[df['YearsSinceLastPromotion'] &gt;= 3, 'promotion'] = 1</pre>



<p>Both approaches generate a new column <code>promotion</code> with a value of <code>1</code> for employees who haven’t been promoted in 3 or more years, and <code>0</code> otherwise.</p>



<p>These methods work well for simple logic. However, suppose we now want to promote employees who meet all the following conditions:</p>



<ul class="wp-block-list">
<li>Haven’t been promoted in <em>at least 3 years</em></li>



<li>Hold the position of <em>Manager</em></li>



<li>Have a performance rating of <em>4 or higher</em></li>
</ul>



<p>Using <code>np.where</code>, this looks like:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df['promotion'] = np.where(
    (df['YearsSinceLastPromotion'] &gt;= 3) &amp; 
    (df['JobRole'] == "Manager") &amp; 
    (df['PerformanceRating'] &gt;= 4),
    1, 0
    )</pre>



<p>Or using <code>df.loc</code>:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df['promotion'] = 0
df.loc[
    (df['YearsSinceLastPromotion'] &gt;= 3) &amp; 
    (df['JobRole'] == "Manager") &amp; 
    (df['PerformanceRating'] &gt;= 4),
    'promotion'
    ] = 1</pre>



<p>While both work, they’re increasingly verbose and harder to read and maintain, especially as the complexity of the conditions grows.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Using <code>pychemist</code> for Cleaner Mutation</h2>



<p><a href="#using-pychemist-for-cleaner-mutation"></a></p>



<p>Instead of repeating <code>df[...]</code> and writing complex boolean logic, we can use a custom accessor for <code>pandas</code> built on top of <code>df.query</code> and <code>df.loc</code>. This method enables readable, query-style conditional logic. We’ll use the <code>pychemist</code> package for this.</p>



<p>Install the package using pip:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">pip install pychemist</pre>



<p>Then, import the library to register the accessor and perform the conditional mutation:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">import pychemist

df = df.chem.mutate('YearsSinceLastPromotion &gt;= 3 &amp; JobRole == "Manager" &amp; PerformanceRating &gt;= 4', 'Promotion', 1, 0)</pre>



<h3 class="wp-block-heading">How <code>chem.mutate</code> works</h3>



<p><a href="#how-chemmutate-works"></a></p>



<p>The method requires:</p>



<ul class="wp-block-list">
<li><strong>Query string</strong>: A valid expression compatible with <code>df.query</code></li>



<li><strong>Column name</strong>: The name of the column to create or modify</li>



<li><strong>Value if True</strong>: The value to assign to rows that match the query</li>



<li><strong>Value if False (optional)</strong>: The value to assign to rows that <em>don’t</em> match</li>
</ul>



<p>This syntax is significantly more readable and easier to chain.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Advanced Example: Chain Multiple Mutations</h2>



<p><a href="#advanced-example-chain-multiple-mutations"></a></p>



<p>You can also chain multiple <code>mutate</code> calls together:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = (df
    .chem.mutate('YearsSinceLastPromotion &gt;= 3 &amp; JobRole == "Manager" &amp; PerformanceRating &gt;= 4', 'Promotion', 1, 0)
    .chem.mutate('WorkLifeBalance&lt;2 &amp; OverTime=="Yes"', 'HighPressure', 'Yes', 'No')
    )</pre>



<p>This generates two new columns:</p>



<ul class="wp-block-list">
<li><code>Promotion</code>: <code>1</code> for managers with a high performance rating (4 or higher) that haven’t received a promotion for a least 3 years, else <code>0</code></li>



<li><code>HighPressure</code>: <code>'Yes'</code> for employees who score low on work-life balance and work overtime, else <code>'No'</code></li>
</ul>



<p>This chaining allows your transformations to remain tidy and declarative.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Summary</h2>



<p><a href="#summary"></a></p>



<p>This tutorial demonstrated how to use <code>pychemist</code>’s custom <code>mutate</code> accessor to simplify data manipulation in pandas. It allows:</p>



<ul class="wp-block-list">
<li>More concise and readable conditional logic</li>



<li>Easy chaining of multiple transformations</li>



<li>Reduced risk of typos and parentheses mismatches</li>
</ul>



<p>By abstracting away the boilerplate of <code>df.loc</code> and <code>np.where</code>, <code>pychemist</code> helps make your data preparation code more expressive and maintainable.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://pychemist.com/mutate/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
