寫一個function對一個Pandas中的多列進行計算 dataframe

Question

我有以下dataframe（真實的有更多的列和行，因此僅以此為例）：

 {'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
 'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
 'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
 'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
 'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
 'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
 'volume': {0: 23, 1: 23, 2: 23, 3: 23},
 'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}

我想寫一個 function 來對特定列的 dataframe 執行計算。 計算在下面的代碼中。 因為我只想將代碼應用於特定的列，所以我設置了一個列列表，並且由於有一個預定義的“因素”我們需要在計算中考慮在內，所以我也進行了設置:

cols = ['taste', 'smell', 'shape']
factor = 72

def multiply_columns(row):
    return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)

然后，我將 function 應用到 dataframe，我想用新值覆蓋原始列值，所以我這樣做：

for cols in df.columns:
    df[cols] = df[cols].apply(multiply_columns)

但我收到以下錯誤：

~\AppData\Local\Temp/ipykernel_8544/3939806184.py in multiply_columns(row)
      3 
      4 def multiply_columns(row):
----> 5     return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
      6 
      7 

TypeError: string indices must be integers

但是我在計算中使用的值不是字符串：

sample        object
sample id      int64
replicate      int64
taste        float64
smell        float64
shape        float64
volume         int64
weight       float64
dtype: object

所需的 output 將是：

{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
 'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
 'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
 'taste': {0: 0.0074, 1: 0.028366667, 2: 0.2183, 3: 3.08333e-05},
 'smell': {0: 0.123333333, 1: 0.141833333, 2: 0.01295, 3: 0.032683333},
 'shape': {0: 2.46667e-05, 1: 0.001233333, 2: 0.00074, 3: 0.067833333},
 'volume': {0: 23, 1: 23, 2: 23, 3: 23},
 'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}

任何人都可以告訴我我的方式的錯誤

Answer 1

這有幾個問題。

如果你想索引行中的元素，你使用的索引是一個字符串（列名）而不是 integer（如索引）。 要獲取您感興趣的列名的索引，您可以使用：

cols = ['taste', 'smell', 'shape']
cols_idx = [df.columns.get_loc(col) for col in cols]

但是，如果我理解您的問題，您可以直接在列上執行此操作，並理解該操作將在每一行上執行。 查看對我有用的測試用例：

import pandas as pd

df = pd.DataFrame({'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
 'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
 'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
 'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
 'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
 'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
 'volume': {0: 23, 1: 23, 2: 23, 3: 23},
 'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}})

cols = ['taste', 'smell', 'shape']

factor = 72

for col in cols:
    df[col] = ((df[col] / df['volume']) * (factor * df['volume'] / df['weight']) / 1000)

請注意，您的行

for cols in df.columns:

指示您應該在每一列上運行此操作（cols 成為索引，不再是您的列表）。

Answer 2

您還必須將該列傳遞給 function。

cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row,col):
    return ((row[col]/ row['volume']) * (factor * row['volume'] / row['weight']) / 1000)

for col in cols:
    df[col] = df.apply(lambda x:multiply_columns(x,col),axis=1)

另外，我得到的 output 與您想要的 output 有點不同，即使我使用了相同的公式。

 sample sample id   replicate   taste   smell   shape   volume  weight
0   orange  1   1   0.00720000000   0.12000000000   0.00002400000   23  12.00000000000
1   orange  1   2   0.25476923077   1.27384615385   0.01107692308   23  1.30000000000
2   banana  5   1   1.06200000000   0.06300000000   0.00360000000   23  2.40000000000
3   banana  5   2   0.00011250000   0.11925000000   0.24750000000   23  3.20000000000

寫一個function對一個Pandas中的多列進行計算 dataframe

問題描述

2 個解決方案

解決方案1
1 已采納 2022-09-28 18:20:40

解決方案2
1 2022-09-28 18:23:16

寫一個function對一個Pandas中的多列進行計算 dataframe

問題描述

2 個解決方案

解決方案1 1 已采納 2022-09-28 18:20:40

解決方案2 1 2022-09-28 18:23:16

解決方案1
1 已采納 2022-09-28 18:20:40

解決方案2
1 2022-09-28 18:23:16