根據名稱pandas python對某些列進行乘法和求和

Question

我有一個小樣本數據集：

import pandas as pd
d = {
  'measure1_x': [10,12,20,30,21],
  'measure2_x':[11,12,10,3,3],
  'measure3_x':[10,0,12,1,1],
  'measure1_y': [1,2,2,3,1],
  'measure2_y':[1,1,1,3,3],
  'measure3_y':[1,0,2,1,1]
}
df = pd.DataFrame(d)
df = df.reindex_axis([
    'measure1_x','measure2_x', 'measure3_x','measure1_y','measure2_y','measure3_y'
], axis=1)

看起來像：

      measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y
          10          11          10           1           1           1
          12          12           0           2           1           0
          20          10          12           2           1           2
          30           3           1           3           3           1
          21           3           1           1           3           1

我創建了幾乎相同的列名，除了'_x'和'_y'以幫助確定哪一對應該相乘：我想在忽略'_x'和'_y'時將該對與相同的列名稱相乘，然后我想要總和數字來得到一個總數，請記住我的實際數據集是巨大的，並且列不是這個完美的順序所以這個命名是一種識別正確對的乘法方法：

total = measure1_x * measure1_y + measure2_x * measure2_y + measure3_x * measure3_y

如此理想的輸出：

measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y   total

 10          11          10           1           1           1           31 
 12          12           0           2           1           0           36 
 20          10          12           2           1           2           74
 30           3           1           3           3           1          100
 21           3           1           1           3           1           31

我的嘗試和思考過程，但不能繼續語法：

#first identify the column names that has '_x' and '_y', then identify if 
#the column names are the same after removing '_x' and '_y', if the pair has 
#the same name then multiply them, do that for all pairs and sum the results 
#up to get the total number

for colname in df.columns:
if "_x".lower() in colname.lower() or "_y".lower() in colname.lower():
    if "_x".lower() in colname.lower():  
        colnamex = colname
    if "_y".lower() in colname.lower():
        colnamey = colname

    #if colnamex[:-2] are the same for colnamex and colnamey then multiply and sum

Answer 1

使用df.columns.str.split生成新的MultiIndex
使用帶有axis和level參數的prod
使用sum與axis參數
使用assign創建新列

df.assign(
    Total=df.set_axis(
        df.columns.str.split('_', expand=True),
        axis=1, inplace=False
    ).prod(axis=1, level=0).sum(1)
)

   measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y  Total
0          10          11          10           1           1           1     31
1          12          12           0           2           1           0     36
2          20          10          12           2           1           2     74
3          30           3           1           3           3           1    100
4          21           3           1           1           3           1     31

將數據`'meausre[i]_[j]'`限制為僅顯示為`'meausre[i]_[j]'`

df.assign(
    Total=df.filter(regex='^measure\d+_\w+$').pipe(
        lambda d: d.set_axis(
            d.columns.str.split('_', expand=True),
            axis=1, inplace=False
        )
    ).prod(axis=1, level=0).sum(1)
)

調試

看看這是否能為您提供正確的總計

d_ = df.copy()
d_.columns = d_.columns.str.split('_', expand=True)

d_.prod(axis=1, level=0).sum(1)

0     31
1     36
2     74
3    100
4     31
dtype: int64

Answer 2

`filter` + `np.einsum`

以為我這次嘗試的東西有點不同 -

分別獲取_x和_y列
做一個產品總和。 使用einsum （和快速）很容易指定。

df = df.sort_index(axis=1) # optional, do this if your columns aren't sorted

i = df.filter(like='_x') 
j = df.filter(like='_y')
df['Total'] = np.einsum('ij,ij->i', i, j) # (i.values * j).sum(axis=1)

df
   measure1_x  measure2_x  measure3_x  measure1_y  measure2_y  measure3_y  Total
0          10          11          10           1           1           1     31
1          12          12           0           2           1           0     36
2          20          10          12           2           1           2     74
3          30           3           1           3           3           1    100
4          21           3           1           1           3           1     31

一個稍微強大的版本，它過濾掉非數字列並事先執行斷言 -

df = df.sort_index(axis=1).select_dtypes(exclude=[object])
i = df.filter(regex='.*_x') 
j = df.filter(regex='.*_y')

assert i.shape == j.shape

df['Total'] = np.einsum('ij,ij->i', i, j)

如果斷言失敗，則假設1）您的列是數字的，2）x和y列的數量相等，正如您的問題所暗示的那樣，不適用於您的實際數據集。

根據名稱pandas python對某些列進行乘法和求和

問題描述

2 個解決方案

解決方案1
3 2018-05-16 17:59:51

將數據`'meausre[i]_[j]'`限制為僅顯示為`'meausre[i]_[j]'`

調試

解決方案2
3 已采納 2018-05-16 18:07:36

`filter` + `np.einsum`

根據名稱pandas python對某些列進行乘法和求和

問題描述

2 個解決方案

解決方案1 3 2018-05-16 17:59:51

將數據'meausre[i]_[j]'限制為僅顯示為'meausre[i]_[j]'

調試

解決方案2 3 已采納 2018-05-16 18:07:36

filter + np.einsum

解決方案1
3 2018-05-16 17:59:51

將數據`'meausre[i]_[j]'`限制為僅顯示為`'meausre[i]_[j]'`

解決方案2
3 已采納 2018-05-16 18:07:36

`filter` + `np.einsum`