简体   繁体   中英

How can I conditionally transform a pandas dataframe column

I have 2 columns I want to loop through, 'Volume_hedge' and 'Unit_hedge'. For each row, if the data in 'Unit_hedge' says "Thousands of Barrels per Day", I want to divide the number in "Volume_hedge" (which is in the same row as the 'Unit_hedge' that equals "Thousands of Barrels per Day") by 1000.

I've tried looping through both columns enumerated and an if statement afterwards. Like I said, I works for the first 2 rows but not for the rest.

df2 = DataFrame(x)
columns_to_select = ['Volume_hedge', 'Unit_hedge']
for i, row in enumerate(columns_to_select):
    if df2['Unit_hedge'].loc[i] == 'Thousands of Barrels per Day':
        new_row = df2['Volume_hedge'].loc[i] / 1000
    else:
        none
    df2['Volume_hedge'].loc[i] = new_row
print(df2[columns_to_select].loc[0:8])

Expected results:

  Volume_hedge                    Unit_hedge
0         0.03  Thousands of Barrels per Day
1        0.024  Thousands of Barrels per Day
2        0.024  Thousands of Barrels per Day
3        0.024  Thousands of Barrels per Day
4        0.024  Thousands of Barrels per Day
5        0.024  Thousands of Barrels per Day
6        0.024  Thousands of Barrels per Day
7     32850000                   (MMBtu/Bbl)
8      4404000                   (MMBtu/Bbl)

Actual Results:

 Volume_hedge                    Unit_hedge
0         0.03  Thousands of Barrels per Day
1        0.024  Thousands of Barrels per Day
2           24  Thousands of Barrels per Day
3           24  Thousands of Barrels per Day
4           24  Thousands of Barrels per Day
5           24  Thousands of Barrels per Day
6           24  Thousands of Barrels per Day
7     32850000                   (MMBtu/Bbl)
8      4404000                   (MMBtu/Bbl)

You should use np.select here:

import numpy as np

df2["Volume_hedge"] = np.select(
    [df2["Unit_hedge"].eq("Thousands of Barrels per Day")], 
    [df2["Volume_hedge"].div(1000)], 
    df2["Volume_hedge"]
)

This will divide all rows where Unit_hedge equals "Thousands of Barrels per Day" by 1000, and leave all the other rows the same.

This also has the advantage of not being done iteratively, which is faster when using pandas and numpy

Columns to select is a two element list. When you enumerate it, i will vary from 0 to 1. This will only apply the function to the first two rows.

If you want to iterate through the rows, you should instead use the iterrows function. Do something like,

for i, row in df2.iterrows():
    if row['Unit_hedge'] == 'Thousands of Barrels per Day':
        new_row = row['Volume_hedge'] / 1000
    df2['Volume_hedge'].iloc[i] = new_row

However, using apply rather than looping through each row is a better bet because iterating is very slow. Also setting column values while iterating through a dataframe is not preferred

df['volume_hedge'][df['Unit_hedge'] == 'Thousands of Barrels per Day'] = 
df['volume_hedge'][df['Unit_hedge'] == 'Thousands of Barrels per Day']/1000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM