根據 pandas 中的條件填充 dataframe 列

Question

我有兩個數據框，如下所示

         df_input                                    df_output
id       POLL_X  POLL_Y  POLL_Z ..     id   Pass_01  Pass_02  Pass_03 .....
110101       1       2       4       110101             
110102       2       1       3       110102

並且要求是根據 df_input 中的值填寫 df_ouput

            df_input                                    df_output
  id   POLL_X  POLL_Y  POLL_Z ....       id   Pass_01  Pass_02  Pass_03 .....
110101     1       2       3            110101     X       Y         Z  
110102     2       1       3            110102     Y       X         Z

所以基本上來自 df_input 的列值將是 df_output 中的單元格值，而匹配並基於 df_input.id == df_output.id

我正在嘗試如下

def function1(df_input, number):
       dfwithCols = df_input[df_input.columns[pd.Series(df_input.columns).str.startswith('POLL_')]]
       list_cols = dfwithCols .columns[(dfwithCols == float(number)).iloc[0]]
       colValue = (dfReduced == float(index)).idxmax(axis=1)[0]
       return colValue

--驅動功能--

for i in range(1,number_of_columnswithPass):
      df_output['Pass_'+i] = function1(df_input,i)

number_of_columnsiwthPass 是一個常量，它給出了名稱為 pass 的列的總數。

我也不能遍歷每一行，因為這將花費大量時間，必須基於列或基於 lambda

兩個數據框中還有其他列，df_input.id == df_output.id 也必須匹配

總列數可以在40左右，一些測試值包括POLL_DNW, POLL_DO, POLL_DOES, POLL_SIG:2
所以我必須在 '_' 和 01,02,03,04----10,11,--21,---39,40 之類的列號之后取任何內容

Answer 1

我假設一開始你的df_output有正確的列名（因為它們應該在填充之后）。

完成你的任務：

import re （稍后會用到）。

根據源行定義以下 function 生成輸出行：

 def genRow(row): ind = [] vals = [] for k, v in row.iteritems(): mtch = re.match('POLL_(.+)', k) if mtch: ind.append('Pass_' + str(v).zfill(2)) vals.append(mtch.group(1)) else: ind.append(k) vals.append(v) return pd.Series(vals, index=ind).rename(row.name)

請注意，此 function 用相應的Pass_...列“替換”任何POLL_...列，並使其他列保持原樣。

應用它：

 df_output = df_input.apply(genRow, axis=1).reindex(columns=df_output.columns)

腳步：

df_input.apply(...) - 生成“初步” output DataFrame。 請注意，現在列順序是按字母順序排列的。
reindex(...) - 使用 df_output 中的列名重新索引上述DataFrame ，提供正確的列順序。
df_output = - 用上述結果覆蓋df_output 。

編輯

如果您的輸入 DataFrame 在POLL_...列中包含重復值，則需要稍作修改。 這種情況會導致 output 行，其中這兩個（或更多）元素具有相同的索引，因此如果包含這樣的行，則無法構造整個 DataFrame。

補救措施是將這些元素“壓縮”成一個元素，其原始索引和所有值都轉換為字符串，例如包含逗號分隔的原始值列表。

為此，請將 genRow function 中的最后一行更改為：

out = pd.Series(vals, index=ind).rename(row.name)
return out.groupby(lambda key: key).apply(lambda lst: (', '.join(sorted(lst))))

Answer 2

從 df_input 和 output 創建兩個數據幀，將它們合並並返回 pivot 以獲得最終的 dataframe：

#create first dataframe
    res1 = pd.wide_to_long(df,
                       stubnames='Pass',
                       i='id',
                       j='letters',
                       sep='_',
                       suffix='[A-Z]').reset_index()
                          )
res1

     id letters Pass
0   110101  X   1
1   110102  X   2
2   110101  Y   2
3   110102  Y   1
4   110101  Z   4
5   110102  Z   3

#create second dataframe
res2 = (df1
        .melt('id')
        .drop('value', axis=1)
        .assign(numbers=lambda x: x.variable.str.split('_').str.get(-1))
        .astype( {'numbers': int})
       )

res2

      id    variable    numbers
0   110101  Pass_01       1
1   110102  Pass_01       1
2   110101  Pass_02       2
3   110102  Pass_02       2
4   110101  Pass_03       3
5   110102  Pass_03       3

#merge the two dataframes, and pivot to get ur final output

outcome = (res1
           .merge(res2,
                  left_on=['id','Pass'],
                  right_on=['id','numbers'],
                  how='right')
           .pivot(columns='variable',values='letters',index='id')
           .bfill()
           .reset_index()
           .rename_axis(columns=None)
          )

outcome

      id    Pass_01 Pass_02 Pass_03
0   110101     X       Y       Z
1   110102     Y       X       Z

Answer 3

您可以使用stack 、 unstack和一些set_index和reset_index來處理邏輯中的哪些列。

df2 = (df1.set_index('id') #set index any columns not in the logic of pass
          # remove the prefix Pass_
          .rename(columns=lambda col: col.replace('Pass_', ''))
          # stack the datafrae to make it a serie and sort the passes
          .stack().sort_values()
          # next two method exchange the old pass index to the new pass index
          .reset_index(level=1, name='Pass')
          .set_index('Pass', append=True)['level_1']
          # from the serie to the dataframe shape
          .unstack()
          # rename the columns with the prefix pass_
          .rename(columns=lambda col: f'Pass_{col:02}')
          # rename axis to none
          .rename_axis(None, axis=1)
          # but back the id as a column
          .reset_index())

print (df2)
       id Pass_01 Pass_02 Pass_03
0  110101       X       Y       Z
1  110102       Y       X       Z

注意：如果您不想在流程中包含其他列，則首先將它們設置為帶有 id 的索引，例如set_index(['id','col1', ...])

根據 pandas 中的條件填充 dataframe 列

問題描述

3 個解決方案

解決方案1
1 已采納 2020-04-19 12:36:05

編輯

解決方案2
0 2020-04-19 12:42:26

解決方案3
0 2020-04-19 13:08:11

根據 pandas 中的條件填充 dataframe 列

問題描述

3 個解決方案

解決方案1 1 已采納 2020-04-19 12:36:05

編輯

解決方案2 0 2020-04-19 12:42:26

解決方案3 0 2020-04-19 13:08:11

解決方案1
1 已采納 2020-04-19 12:36:05

解決方案2
0 2020-04-19 12:42:26

解決方案3
0 2020-04-19 13:08:11