<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Guides</title>
	<atom:link href="https://pychemist.com/category/guides/feed/" rel="self" type="application/rss+xml" />
	<link>https://pychemist.com</link>
	<description>Pychemist</description>
	<lastBuildDate>Tue, 29 Jul 2025 10:10:36 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.8.2</generator>

<image>
	<url>https://pychemist.com/wp-content/uploads/2025/07/cropped-mini-logo1-01-32x32.png</url>
	<title>Guides</title>
	<link>https://pychemist.com</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Creating Lag and Lead Variables</title>
		<link>https://pychemist.com/creating-lag-and-lead-variables/</link>
					<comments>https://pychemist.com/creating-lag-and-lead-variables/#respond</comments>
		
		<dc:creator><![CDATA[Pychemist]]></dc:creator>
		<pubDate>Mon, 28 Jul 2025 18:35:57 +0000</pubDate>
				<category><![CDATA[Guides]]></category>
		<guid isPermaLink="false">https://pychemist.com/?p=254</guid>

					<description><![CDATA[Introduction In this tutorial, we examine how to create lagged and lead variables: essential tools for time series and panel [&#8230;]]]></description>
										<content:encoded><![CDATA[
<h2 class="wp-block-heading">Introduction</h2>



<p>In this tutorial, we examine how to create <strong>lagged</strong> and <strong>lead</strong> variables: essential tools for time series and panel data analysis. Whether you’re modeling financial trends or running regressions, such transformations are often essential to compute new variables, such as <em>Return on Assets</em> or <em>Revenue Growth</em>. We’ll also explore how the <code>Pychemist</code> library simplifies this task.</p>



<p>We use <code>pandas</code> and the <code>pychemist</code> library to demonstrate this. For this tutorial we rely on a dataset of financial information for 20 large US companies, scraped from <em>EDGAR</em>. Some missing values in the dataset have been imputed for demonstration purposes.</p>



<p>While pandas lets you create lagged values using <code>.shift()</code>, it doesn’t always behave as expected for grouped data or irregular time series. Pychemist provides a simpler and more reliable alternative through its <code>.chem</code>accessor.</p>



<h2 class="wp-block-heading">Step 1: Load the Data and Libraries</h2>



<p>We start by importing the necessary libraries and loading the dataset. <code>pandas</code> and <code>numpy</code> are used for working with panel datasets. Additionally, we import <code>pychemist</code> to download the financial dataset and to enable the custom pandas extensions (accessors) used in this example.</p>



<p><strong>Note:</strong> If you haven’t installed pychemist yet, you can do so with the following command:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">pip install pychemist</pre>



<p>Import the required libraries and load the dataset into a DataFrame (df):</p>



<pre class="urvanov-syntax-highlighter-plain-tag">import pandas as pd
import numpy as np
import pychemist

df = pychemist.load('financials')</pre>



<p>Let’s inspect the first few rows:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head()</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th></tr></thead><tbody><tr><td>0</td><td>11</td><td>AAPL</td><td>2020</td><td>5.953100e+10</td><td>3.385160e+11</td><td>NaN</td></tr><tr><td>1</td><td>12</td><td>AAPL</td><td>2021</td><td>5.525600e+10</td><td>3.238880e+11</td><td>NaN</td></tr><tr><td>2</td><td>13</td><td>AAPL</td><td>2022</td><td>5.741100e+10</td><td>3.510020e+11</td><td>NaN</td></tr><tr><td>3</td><td>14</td><td>AAPL</td><td>2023</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td></tr><tr><td>4</td><td>15</td><td>AAPL</td><td>2024</td><td>9.980300e+10</td><td>3.525830e+11</td><td>NaN</td></tr></tbody></table></figure>



<p>5 rows × 6 columns</p>



<h2 class="wp-block-heading">Step 2: Create a Lag Manually</h2>



<p>First, we sort the dataset by company (<code>ticker</code>) and time (<code>year</code>). Second, we create a lagged variable <code>total_assets_lag</code> using Pandas’ built-in <code>.shift()</code> method.</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = df['total_assets'].shift()</pre>



<p>Next, we inspect the DataFrame:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head(10)</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th></tr></thead><tbody><tr><td>0</td><td>11</td><td>AAPL</td><td>2020</td><td>5.953100e+10</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td></tr><tr><td>1</td><td>12</td><td>AAPL</td><td>2021</td><td>5.525600e+10</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td></tr><tr><td>2</td><td>13</td><td>AAPL</td><td>2022</td><td>5.741100e+10</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td></tr><tr><td>3</td><td>14</td><td>AAPL</td><td>2023</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td></tr><tr><td>4</td><td>15</td><td>AAPL</td><td>2024</td><td>9.980300e+10</td><td>3.525830e+11</td><td>NaN</td><td>3.527550e+11</td></tr><tr><td>5</td><td>27</td><td>ADBE</td><td>2020</td><td>2.591000e+09</td><td>2.076200e+10</td><td>9.030000e+09</td><td>3.525830e+11</td></tr><tr><td>6</td><td>28</td><td>ADBE</td><td>2021</td><td>2.951000e+09</td><td>2.428400e+10</td><td>1.117100e+10</td><td>2.076200e+10</td></tr><tr><td>7</td><td>29</td><td>ADBE</td><td>2022</td><td>5.260000e+09</td><td>2.724100e+10</td><td>1.286800e+10</td><td>2.428400e+10</td></tr><tr><td>8</td><td>30</td><td>ADBE</td><td>2023</td><td>4.822000e+09</td><td>2.716500e+10</td><td>1.578500e+10</td><td>2.724100e+10</td></tr><tr><td>9</td><td>31</td><td>ADBE</td><td>2024</td><td>4.756000e+09</td><td>2.977900e+10</td><td>1.760600e+10</td><td>2.716500e+10</td></tr></tbody></table></figure>



<p>10 rows × 7 columns</p>



<p><strong>Problem:</strong> We see that indeed a lagged version of <code>total assets</code> is added to the DataFrame. However, this method does <strong>not</strong> take into account whether the previous row belongs to the same company. It simply shifts row-wise, even across companies. We see for example that <code>total_assets_lag</code> for Adobe in 2020 equals Apple’s Total Assets for 2024. Obviously, this is not what we want.</p>



<h2 class="wp-block-heading">Step 3: Fix Grouping Issues Manually</h2>



<p>To address this, we can conditionally shift values only if the previous row belongs to the same <code>ticker</code>:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = np.where(
    df['ticker'] == df['ticker'].shift(),
    df['total_assets'].shift(),
    np.nan
)</pre>



<p>Inspection of the first 10 rows of the DataFrame indicates that this indeed resolved the issue.</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head(10)</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th></tr></thead><tbody><tr><td>0</td><td>11</td><td>AAPL</td><td>2020</td><td>5.953100e+10</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td></tr><tr><td>1</td><td>12</td><td>AAPL</td><td>2021</td><td>5.525600e+10</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td></tr><tr><td>2</td><td>13</td><td>AAPL</td><td>2022</td><td>5.741100e+10</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td></tr><tr><td>3</td><td>14</td><td>AAPL</td><td>2023</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td></tr><tr><td>4</td><td>15</td><td>AAPL</td><td>2024</td><td>9.980300e+10</td><td>3.525830e+11</td><td>NaN</td><td>3.527550e+11</td></tr><tr><td>5</td><td>27</td><td>ADBE</td><td>2020</td><td>2.591000e+09</td><td>2.076200e+10</td><td>9.030000e+09</td><td>NaN</td></tr><tr><td>6</td><td>28</td><td>ADBE</td><td>2021</td><td>2.951000e+09</td><td>2.428400e+10</td><td>1.117100e+10</td><td>2.076200e+10</td></tr><tr><td>7</td><td>29</td><td>ADBE</td><td>2022</td><td>5.260000e+09</td><td>2.724100e+10</td><td>1.286800e+10</td><td>2.428400e+10</td></tr><tr><td>8</td><td>30</td><td>ADBE</td><td>2023</td><td>4.822000e+09</td><td>2.716500e+10</td><td>1.578500e+10</td><td>2.724100e+10</td></tr><tr><td>9</td><td>31</td><td>ADBE</td><td>2024</td><td>4.756000e+09</td><td>2.977900e+10</td><td>1.760600e+10</td><td>2.716500e+10</td></tr></tbody></table></figure>



<p>10 rows × 7 columns</p>



<p>However, there are still cases where the calculation did not happen as expected. For example, take a look at the observations for Nvidia and Tesla:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.query('ticker=="NVDA" or ticker=="TSLA"')</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th></tr></thead><tbody><tr><td>71</td><td>209</td><td>NVDA</td><td>2020</td><td>3.047000e+09</td><td>1.329200e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>72</td><td>210</td><td>NVDA</td><td>2021</td><td>4.141000e+09</td><td>1.731500e+10</td><td>NaN</td><td>1.329200e+10</td></tr><tr><td>73</td><td>212</td><td>NVDA</td><td>2023</td><td>4.332000e+09</td><td>4.418700e+10</td><td>1.667500e+10</td><td>1.731500e+10</td></tr><tr><td>74</td><td>213</td><td>NVDA</td><td>2024</td><td>9.752000e+09</td><td>4.118200e+10</td><td>2.691400e+10</td><td>4.418700e+10</td></tr><tr><td>75</td><td>214</td><td>NVDA</td><td>2025</td><td>4.368000e+09</td><td>6.572800e+10</td><td>2.697400e+10</td><td>4.118200e+10</td></tr><tr><td>87</td><td>256</td><td>TSLA</td><td>2020</td><td>-9.760000e+08</td><td>3.430900e+10</td><td>2.146100e+10</td><td>NaN</td></tr><tr><td>88</td><td>258</td><td>TSLA</td><td>2022</td><td>7.210000e+08</td><td>6.213100e+10</td><td>3.153600e+10</td><td>3.430900e+10</td></tr><tr><td>89</td><td>259</td><td>TSLA</td><td>2023</td><td>5.519000e+09</td><td>8.233800e+10</td><td>5.382300e+10</td><td>6.213100e+10</td></tr><tr><td>90</td><td>260</td><td>TSLA</td><td>2024</td><td>1.255600e+10</td><td>1.066180e+11</td><td>8.146200e+10</td><td>8.233800e+10</td></tr></tbody></table></figure>



<p>9 rows × 7 columns</p>



<p>Even when years are missing, the current logic carries forward the last available value, which may be from two or more years ago. As a result of data for NVIDIA for 2022 being missing, the lagged value <code>total_assets_lag</code> for 2023 is now incorrectly set equal to the total assets for 2021. Similarly, the lagged total assets for Tesla for 2022 are incorrectly set equal to the value for 2020.</p>



<h2 class="wp-block-heading">Step 4: Handle Missing Years</h2>



<p>Let’s refine the condition further. Our condition should verify that the previous row belongs to the same <code>ticker</code> and that the <code>year</code> increment equals one:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.sort_values(['ticker', 'year'])
df['total_assets_lag'] = np.where(
    (df['ticker'] == df['ticker'].shift()) &amp; 
    (df['year'] - df['year'].shift() == 1),
    df['total_assets'].shift(),
    np.nan
)</pre>



<p>Let’s inspect the DataFrame again:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.query('ticker=="NVDA" or ticker=="TSLA"')</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th></tr></thead><tbody><tr><td>71</td><td>209</td><td>NVDA</td><td>2020</td><td>3.047000e+09</td><td>1.329200e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>72</td><td>210</td><td>NVDA</td><td>2021</td><td>4.141000e+09</td><td>1.731500e+10</td><td>NaN</td><td>1.329200e+10</td></tr><tr><td>73</td><td>212</td><td>NVDA</td><td>2023</td><td>4.332000e+09</td><td>4.418700e+10</td><td>1.667500e+10</td><td>NaN</td></tr><tr><td>74</td><td>213</td><td>NVDA</td><td>2024</td><td>9.752000e+09</td><td>4.118200e+10</td><td>2.691400e+10</td><td>4.418700e+10</td></tr><tr><td>75</td><td>214</td><td>NVDA</td><td>2025</td><td>4.368000e+09</td><td>6.572800e+10</td><td>2.697400e+10</td><td>4.118200e+10</td></tr><tr><td>87</td><td>256</td><td>TSLA</td><td>2020</td><td>-9.760000e+08</td><td>3.430900e+10</td><td>2.146100e+10</td><td>NaN</td></tr><tr><td>88</td><td>258</td><td>TSLA</td><td>2022</td><td>7.210000e+08</td><td>6.213100e+10</td><td>3.153600e+10</td><td>NaN</td></tr><tr><td>89</td><td>259</td><td>TSLA</td><td>2023</td><td>5.519000e+09</td><td>8.233800e+10</td><td>5.382300e+10</td><td>6.213100e+10</td></tr><tr><td>90</td><td>260</td><td>TSLA</td><td>2024</td><td>1.255600e+10</td><td>1.066180e+11</td><td>8.146200e+10</td><td>8.233800e+10</td></tr></tbody></table></figure>



<p>9 rows × 7 columns</p>



<p>This approach works, but the code becomes verbose and difficult to scale. If we want a 2-year lag, the condition becomes even more complex:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.sort_values(['ticker', 'year'])
df['total_assets_lag2'] = np.where(
    (df['ticker'] == df['ticker'].shift(2)) &amp; 
    (df['year'] - df['year'].shift(2) == 2),
    df['total_assets'].shift(2),
    np.nan
)</pre>



<h2 class="wp-block-heading">Step 5: Let Pychemist Do the Work</h2>



<p>To simplify and generalize this process, we can use <code>Pychemist</code>’s built-in lag function.</p>



<p>Note: this requires installing and importing the Pychemist library as done at the start of this tutorial.</p>



<p>We can now generate a lagged variable as follows:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.chem.lag('total_assets', 'ticker', 'year')</pre>



<p>This automatically handles company grouping and time consistency. The <code>chem.lag</code> method also handles important edge cases automatically. For instance, it only assigns lag values when both the grouping variable (e.g., <code>ticker</code>) matches <strong>and</strong> the <code>year</code> variable increments by exactly the lag interval. This means that if a year is missing in the dataset, the function will <strong>not</strong> incorrectly carry over data from a non-consecutive year, and will instead return <code>NaN</code> as expected. If the lagged column already exists, it will not be overwritten, and a warning will be issued to prevent accidental data loss.</p>



<h2 class="wp-block-heading">Step 6: Lag Multiple Columns</h2>



<p>To created multiple lagged variables at once, we can set a list of variables for which lagged variables should be computed. We set <code>replace=True</code> because the DataFrame already contains the lagged variable for total_assets. Without this argument the function would raise a warning that this variable already exists in the DataFrame. Alternatively, we could drop this column manually.</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', replace=True)</pre>



<p>We can see that the results are as expected:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.query('ticker=="NVDA" or ticker=="TSLA"')</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th><th>revenue_lag</th></tr></thead><tbody><tr><td>71</td><td>209</td><td>NVDA</td><td>2020</td><td>3.047000e+09</td><td>1.329200e+10</td><td>NaN</td><td>NaN</td><td>NaN</td></tr><tr><td>72</td><td>210</td><td>NVDA</td><td>2021</td><td>4.141000e+09</td><td>1.731500e+10</td><td>NaN</td><td>1.329200e+10</td><td>NaN</td></tr><tr><td>73</td><td>212</td><td>NVDA</td><td>2023</td><td>4.332000e+09</td><td>4.418700e+10</td><td>1.667500e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>74</td><td>213</td><td>NVDA</td><td>2024</td><td>9.752000e+09</td><td>4.118200e+10</td><td>2.691400e+10</td><td>4.418700e+10</td><td>1.667500e+10</td></tr><tr><td>75</td><td>214</td><td>NVDA</td><td>2025</td><td>4.368000e+09</td><td>6.572800e+10</td><td>2.697400e+10</td><td>4.118200e+10</td><td>2.691400e+10</td></tr><tr><td>87</td><td>256</td><td>TSLA</td><td>2020</td><td>-9.760000e+08</td><td>3.430900e+10</td><td>2.146100e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>88</td><td>258</td><td>TSLA</td><td>2022</td><td>7.210000e+08</td><td>6.213100e+10</td><td>3.153600e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>89</td><td>259</td><td>TSLA</td><td>2023</td><td>5.519000e+09</td><td>8.233800e+10</td><td>5.382300e+10</td><td>6.213100e+10</td><td>3.153600e+10</td></tr><tr><td>90</td><td>260</td><td>TSLA</td><td>2024</td><td>1.255600e+10</td><td>1.066180e+11</td><td>8.146200e+10</td><td>8.233800e+10</td><td>5.382300e+10</td></tr></tbody></table></figure>



<p>9 rows × 8 columns</p>



<p>To generate 2-year lags (or more), simply pass a ‘shift’ parameter:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.chem.lag(['total_assets', 'revenue'], 'ticker', 'year', 2)</pre>



<p>This results in the following DataFrame:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head()</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th><th>revenue_lag</th><th>total_assets_lag2</th><th>revenue_lag2</th></tr></thead><tbody><tr><td>0</td><td>11</td><td>AAPL</td><td>2020</td><td>5.953100e+10</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td></tr><tr><td>1</td><td>12</td><td>AAPL</td><td>2021</td><td>5.525600e+10</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td><td>NaN</td></tr><tr><td>2</td><td>13</td><td>AAPL</td><td>2022</td><td>5.741100e+10</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td><td>NaN</td></tr><tr><td>3</td><td>14</td><td>AAPL</td><td>2023</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td><td>NaN</td></tr><tr><td>4</td><td>15</td><td>AAPL</td><td>2024</td><td>9.980300e+10</td><td>3.525830e+11</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td><td>NaN</td></tr></tbody></table></figure>



<p>5 rows × 10 columns</p>



<h2 class="wp-block-heading">Step 7: Create Lead Variables</h2>



<p>Creating <strong>lead</strong> variables (i.e., future values) is just as easy. Simply use <code>.chem.lead</code> instead:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = df.chem.lead(['total_assets', 'revenue'], 'ticker', 'year')</pre>



<p>The DataFrame would look as follows:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head()</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th><th>revenue_lag</th><th>total_assets_lag2</th><th>revenue_lag2</th><th>total_assets_lead</th><th>revenue_lead</th></tr></thead><tbody><tr><td>0</td><td>11</td><td>AAPL</td><td>2020</td><td>5.953100e+10</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>3.238880e+11</td><td>NaN</td></tr><tr><td>1</td><td>12</td><td>AAPL</td><td>2021</td><td>5.525600e+10</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td><td>NaN</td><td>NaN</td><td>NaN</td><td>3.510020e+11</td><td>NaN</td></tr><tr><td>2</td><td>13</td><td>AAPL</td><td>2022</td><td>5.741100e+10</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td><td>NaN</td><td>3.385160e+11</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td></tr><tr><td>3</td><td>14</td><td>AAPL</td><td>2023</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td><td>NaN</td><td>3.238880e+11</td><td>NaN</td><td>3.525830e+11</td><td>NaN</td></tr><tr><td>4</td><td>15</td><td>AAPL</td><td>2024</td><td>9.980300e+10</td><td>3.525830e+11</td><td>NaN</td><td>3.527550e+11</td><td>NaN</td><td>3.510020e+11</td><td>NaN</td><td>NaN</td><td>NaN</td></tr></tbody></table></figure>



<p>5 rows × 12 columns</p>



<h2 class="wp-block-heading">Step 8: Compute Derived Metrics</h2>



<p>Now that we have lags, we can compute variables such as <em>Return on Assets (ROA)</em> or <em>Revenue Growth</em>:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df=df.eval("""
    roa=net_income / ((total_assets+total_assets_lag)/2)
    growth = (revenue-revenue_lag) / revenue_lag
    """)</pre>



<p>Inspect the results:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.query('ticker=="NVDA" or ticker=="TSLA"')</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>index</th><th>ticker</th><th>year</th><th>net_income</th><th>total_assets</th><th>revenue</th><th>total_assets_lag</th><th>revenue_lag</th><th>total_assets_lag2</th><th>revenue_lag2</th><th>total_assets_lead</th><th>revenue_lead</th><th>roa</th><th>growth</th></tr></thead><tbody><tr><td>71</td><td>209</td><td>NVDA</td><td>2020</td><td>3.047000e+09</td><td>1.329200e+10</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>1.731500e+10</td><td>NaN</td><td>NaN</td><td>NaN</td></tr><tr><td>72</td><td>210</td><td>NVDA</td><td>2021</td><td>4.141000e+09</td><td>1.731500e+10</td><td>NaN</td><td>1.329200e+10</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>0.270592</td><td>NaN</td></tr><tr><td>73</td><td>212</td><td>NVDA</td><td>2023</td><td>4.332000e+09</td><td>4.418700e+10</td><td>1.667500e+10</td><td>NaN</td><td>NaN</td><td>1.731500e+10</td><td>NaN</td><td>4.118200e+10</td><td>2.691400e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>74</td><td>213</td><td>NVDA</td><td>2024</td><td>9.752000e+09</td><td>4.118200e+10</td><td>2.691400e+10</td><td>4.418700e+10</td><td>1.667500e+10</td><td>NaN</td><td>NaN</td><td>6.572800e+10</td><td>2.697400e+10</td><td>0.228467</td><td>0.614033</td></tr><tr><td>75</td><td>214</td><td>NVDA</td><td>2025</td><td>4.368000e+09</td><td>6.572800e+10</td><td>2.697400e+10</td><td>4.118200e+10</td><td>2.691400e+10</td><td>4.418700e+10</td><td>1.667500e+10</td><td>NaN</td><td>NaN</td><td>0.081714</td><td>0.002229</td></tr><tr><td>87</td><td>256</td><td>TSLA</td><td>2020</td><td>-9.760000e+08</td><td>3.430900e+10</td><td>2.146100e+10</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td><td>NaN</td></tr><tr><td>88</td><td>258</td><td>TSLA</td><td>2022</td><td>7.210000e+08</td><td>6.213100e+10</td><td>3.153600e+10</td><td>NaN</td><td>NaN</td><td>3.430900e+10</td><td>2.146100e+10</td><td>8.233800e+10</td><td>5.382300e+10</td><td>NaN</td><td>NaN</td></tr><tr><td>89</td><td>259</td><td>TSLA</td><td>2023</td><td>5.519000e+09</td><td>8.233800e+10</td><td>5.382300e+10</td><td>6.213100e+10</td><td>3.153600e+10</td><td>NaN</td><td>NaN</td><td>1.066180e+11</td><td>8.146200e+10</td><td>0.076404</td><td>0.706716</td></tr><tr><td>90</td><td>260</td><td>TSLA</td><td>2024</td><td>1.255600e+10</td><td>1.066180e+11</td><td>8.146200e+10</td><td>8.233800e+10</td><td>5.382300e+10</td><td>6.213100e+10</td><td>3.153600e+10</td><td>NaN</td><td>NaN</td><td>0.132899</td><td>0.513517</td></tr></tbody></table></figure>



<p>9 rows × 14 columns</p>



<h2 class="wp-block-heading">Conclusion</h2>



<p>In this tutorial, we learned how to create <strong>lagged</strong> and <strong>lead</strong> variables manually using pandas, and why that approach often falls short, especially in grouped, irregular panel data.</p>



<p>Then, we saw how the <strong>Pychemist</strong> library solves these problems elegantly:</p>



<ul class="wp-block-list">
<li>Accurate grouping</li>



<li>Handles gaps in time</li>



<li>Clean, concise API</li>



<li>Easy multi-period shifts</li>



<li>Works seamlessly with pandas</li>
</ul>



<p>Whether you’re building financial models or prepping data for regression analysis, <code>Pychemist</code> streamlines your workflow and eliminates common mistakes.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://pychemist.com/creating-lag-and-lead-variables/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
		<item>
		<title>Mutate</title>
		<link>https://pychemist.com/mutate/</link>
					<comments>https://pychemist.com/mutate/#respond</comments>
		
		<dc:creator><![CDATA[Pychemist]]></dc:creator>
		<pubDate>Sun, 20 Jul 2025 21:38:18 +0000</pubDate>
				<category><![CDATA[Guides]]></category>
		<guid isPermaLink="false">https://pychemist.com/?p=73</guid>

					<description><![CDATA[When preparing data for analysis, it is often necessary to create new variables or modify existing values: whether to fix [&#8230;]]]></description>
										<content:encoded><![CDATA[
<p>When preparing data for analysis, it is often necessary to create new variables or modify existing values: whether to fix data entry errors, derive variables based on existing ones, or flag subsets of data. In <em>pandas</em>, this is typically done using <code>df.loc</code> or <code>np.where</code>, but these methods can lead to verbose, repetitive, and hard-to-read code.</p>



<p>In this tutorial, we introduce a more readable and expressive alternative using a custom pandas accessor provided by the <code>pychemist</code> library. By leveraging <code>df.query</code> under the hood, the <code>.chem.mutate</code> accessor allows you to perform chained, conditional assignments in a cleaner and more maintainable way.</p>



<p>To demonstrate how to modify variables, we’ll use the <strong>IBM HR Analytics Employee Attrition &amp; Performance</strong> dataset, available on Kaggle:<br><a href="https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset" target="_blank" rel="noreferrer noopener">https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset</a></p>



<p>We start by importing the required libraries and loading the dataset into a DataFrame:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">import pandas as pd
import numpy as np

df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')</pre>



<p>This creates a new DataFrame <code>df</code>.</p>



<p>We can examine the first five rows of the dataset to get a sense of the variables, their names, and the types of values they contain:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df.head()</pre>



<figure class="wp-block-table"><table class="has-fixed-layout"><thead><tr><th></th><th>Age</th><th>Attrition</th><th>BusinessTravel</th><th>DailyRate</th><th>Department</th><th>DistanceFromHome</th><th>Education</th><th>EducationField</th><th>EmployeeCount</th><th>EmployeeNumber</th><th>EnvironmentSatisfaction</th><th>Gender</th><th>HourlyRate</th><th>JobInvolvement</th><th>JobLevel</th><th>JobRole</th><th>JobSatisfaction</th><th>MaritalStatus</th><th>MonthlyIncome</th><th>MonthlyRate</th><th>NumCompaniesWorked</th><th>Over18</th><th>OverTime</th><th>PercentSalaryHike</th><th>PerformanceRating</th><th>RelationshipSatisfaction</th><th>StandardHours</th><th>StockOptionLevel</th><th>TotalWorkingYears</th><th>TrainingTimesLastYear</th><th>WorkLifeBalance</th><th>YearsAtCompany</th><th>YearsInCurrentRole</th><th>YearsSinceLastPromotion</th><th>YearsWithCurrManager</th></tr></thead><tbody><tr><td>0</td><td>41</td><td>Yes</td><td>Travel_Rarely</td><td>1102</td><td>Sales</td><td>1</td><td>2</td><td>Life Sciences</td><td>1</td><td>1</td><td>2</td><td>Female</td><td>94</td><td>3</td><td>2</td><td>Sales Executive</td><td>4</td><td>Single</td><td>5993</td><td>19479</td><td>8</td><td>Y</td><td>Yes</td><td>11</td><td>3</td><td>1</td><td>80</td><td>0</td><td>8</td><td>0</td><td>1</td><td>6</td><td>4</td><td>0</td><td>5</td></tr><tr><td>1</td><td>49</td><td>No</td><td>Travel_Frequently</td><td>279</td><td>Research &amp; Development</td><td>8</td><td>1</td><td>Life Sciences</td><td>1</td><td>2</td><td>3</td><td>Male</td><td>61</td><td>2</td><td>2</td><td>Research Scientist</td><td>2</td><td>Married</td><td>5130</td><td>24907</td><td>1</td><td>Y</td><td>No</td><td>23</td><td>4</td><td>4</td><td>80</td><td>1</td><td>10</td><td>3</td><td>3</td><td>10</td><td>7</td><td>1</td><td>7</td></tr><tr><td>2</td><td>37</td><td>Yes</td><td>Travel_Rarely</td><td>1373</td><td>Research &amp; Development</td><td>2</td><td>2</td><td>Other</td><td>1</td><td>4</td><td>4</td><td>Male</td><td>92</td><td>2</td><td>1</td><td>Laboratory Technician</td><td>3</td><td>Single</td><td>2090</td><td>2396</td><td>6</td><td>Y</td><td>Yes</td><td>15</td><td>3</td><td>2</td><td>80</td><td>0</td><td>7</td><td>3</td><td>3</td><td>0</td><td>0</td><td>0</td><td>0</td></tr><tr><td>3</td><td>33</td><td>No</td><td>Travel_Frequently</td><td>1392</td><td>Research &amp; Development</td><td>3</td><td>4</td><td>Life Sciences</td><td>1</td><td>5</td><td>4</td><td>Female</td><td>56</td><td>3</td><td>1</td><td>Research Scientist</td><td>3</td><td>Married</td><td>2909</td><td>23159</td><td>1</td><td>Y</td><td>Yes</td><td>11</td><td>3</td><td>3</td><td>80</td><td>0</td><td>8</td><td>3</td><td>3</td><td>8</td><td>7</td><td>3</td><td>0</td></tr><tr><td>4</td><td>27</td><td>No</td><td>Travel_Rarely</td><td>591</td><td>Research &amp; Development</td><td>2</td><td>1</td><td>Medical</td><td>1</td><td>7</td><td>1</td><td>Male</td><td>40</td><td>3</td><td>1</td><td>Laboratory Technician</td><td>2</td><td>Married</td><td>3468</td><td>16632</td><td>9</td><td>Y</td><td>No</td><td>12</td><td>3</td><td>4</td><td>80</td><td>1</td><td>6</td><td>3</td><td>3</td><td>2</td><td>2</td><td>2</td><td>2</td></tr></tbody></table></figure>



<p>5 rows × 35 columns</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Conditional Mutation with Base Pandas</h2>



<p><a href="#conditional-mutation-with-base-pandas"></a></p>



<p>Imagine we want to decide which employees are eligible for a bonus. As an example, we could use Numpy’s where function (<code>np.where</code>) to flag employees who haven’t been promoted in 3 or more years:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df['promotion'] = np.where(df['YearsSinceLastPromotion'] &gt;= 3, 1, 0)</pre>



<p>Alternatively, the same logic using <code>df.loc</code>:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df['promotion'] = 0
df.loc[df['YearsSinceLastPromotion'] &gt;= 3, 'promotion'] = 1</pre>



<p>Both approaches generate a new column <code>promotion</code> with a value of <code>1</code> for employees who haven’t been promoted in 3 or more years, and <code>0</code> otherwise.</p>



<p>These methods work well for simple logic. However, suppose we now want to promote employees who meet all the following conditions:</p>



<ul class="wp-block-list">
<li>Haven’t been promoted in <em>at least 3 years</em></li>



<li>Hold the position of <em>Manager</em></li>



<li>Have a performance rating of <em>4 or higher</em></li>
</ul>



<p>Using <code>np.where</code>, this looks like:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df['promotion'] = np.where(
    (df['YearsSinceLastPromotion'] &gt;= 3) &amp; 
    (df['JobRole'] == "Manager") &amp; 
    (df['PerformanceRating'] &gt;= 4),
    1, 0
    )</pre>



<p>Or using <code>df.loc</code>:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df['promotion'] = 0
df.loc[
    (df['YearsSinceLastPromotion'] &gt;= 3) &amp; 
    (df['JobRole'] == "Manager") &amp; 
    (df['PerformanceRating'] &gt;= 4),
    'promotion'
    ] = 1</pre>



<p>While both work, they’re increasingly verbose and harder to read and maintain, especially as the complexity of the conditions grows.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Using <code>pychemist</code> for Cleaner Mutation</h2>



<p><a href="#using-pychemist-for-cleaner-mutation"></a></p>



<p>Instead of repeating <code>df[...]</code> and writing complex boolean logic, we can use a custom accessor for <code>pandas</code> built on top of <code>df.query</code> and <code>df.loc</code>. This method enables readable, query-style conditional logic. We’ll use the <code>pychemist</code> package for this.</p>



<p>Install the package using pip:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">pip install pychemist</pre>



<p>Then, import the library to register the accessor and perform the conditional mutation:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">import pychemist

df = df.chem.mutate('YearsSinceLastPromotion &gt;= 3 &amp; JobRole == "Manager" &amp; PerformanceRating &gt;= 4', 'Promotion', 1, 0)</pre>



<h3 class="wp-block-heading">How <code>chem.mutate</code> works</h3>



<p><a href="#how-chemmutate-works"></a></p>



<p>The method requires:</p>



<ul class="wp-block-list">
<li><strong>Query string</strong>: A valid expression compatible with <code>df.query</code></li>



<li><strong>Column name</strong>: The name of the column to create or modify</li>



<li><strong>Value if True</strong>: The value to assign to rows that match the query</li>



<li><strong>Value if False (optional)</strong>: The value to assign to rows that <em>don’t</em> match</li>
</ul>



<p>This syntax is significantly more readable and easier to chain.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Advanced Example: Chain Multiple Mutations</h2>



<p><a href="#advanced-example-chain-multiple-mutations"></a></p>



<p>You can also chain multiple <code>mutate</code> calls together:</p>



<pre class="urvanov-syntax-highlighter-plain-tag">df = (df
    .chem.mutate('YearsSinceLastPromotion &gt;= 3 &amp; JobRole == "Manager" &amp; PerformanceRating &gt;= 4', 'Promotion', 1, 0)
    .chem.mutate('WorkLifeBalance&lt;2 &amp; OverTime=="Yes"', 'HighPressure', 'Yes', 'No')
    )</pre>



<p>This generates two new columns:</p>



<ul class="wp-block-list">
<li><code>Promotion</code>: <code>1</code> for managers with a high performance rating (4 or higher) that haven’t received a promotion for a least 3 years, else <code>0</code></li>



<li><code>HighPressure</code>: <code>'Yes'</code> for employees who score low on work-life balance and work overtime, else <code>'No'</code></li>
</ul>



<p>This chaining allows your transformations to remain tidy and declarative.</p>



<hr class="wp-block-separator has-alpha-channel-opacity"/>



<h2 class="wp-block-heading">Summary</h2>



<p><a href="#summary"></a></p>



<p>This tutorial demonstrated how to use <code>pychemist</code>’s custom <code>mutate</code> accessor to simplify data manipulation in pandas. It allows:</p>



<ul class="wp-block-list">
<li>More concise and readable conditional logic</li>



<li>Easy chaining of multiple transformations</li>



<li>Reduced risk of typos and parentheses mismatches</li>
</ul>



<p>By abstracting away the boilerplate of <code>df.loc</code> and <code>np.where</code>, <code>pychemist</code> helps make your data preparation code more expressive and maintainable.</p>
]]></content:encoded>
					
					<wfw:commentRss>https://pychemist.com/mutate/feed/</wfw:commentRss>
			<slash:comments>0</slash:comments>
		
		
			</item>
	</channel>
</rss>
