基于 Python Pandas 中的几个查找表创建一个新列

Question

我有一个大熊猫数据df_orig （ df_orig ）和几个查找表（也是数据df_orig ），它们对应于df_orig每个段。

这是df_orig的一小部分：

segment score1 score2 
 B3         0   700
 B1         0   120
 B1       400   950
 B1       100   220
 B1       200   320
 B1       650   340
 B5       300   400
 B5         0   320
 B1         0   240
 B1       100   360
 B1       940   700
 B3       100   340

这里有一个完整的 B5 段查找表，称为thresholds_b5 _b5（大数据集中的每个段都有一个查找表）：

score1 score2   
990     220
980     280
970     200
960     260
950     260
940     200
930     240
920     220
910     220
900     220
850     120
800     220
750     220
700     120
650     200
600     220
550     220
500     240
400     240
300     260
200     300
100     320
  0     400

我想在我的大型数据集中创建一个类似于此 SQL 逻辑的新列：

case when segment = 'B5' then
   case when score1 = 990 and score2 >= 220 then 1
   case when score1 = 980 and score2 >= 280 then 1
   .
   .
   .
   else 0
case when segment = 'B1' then
.
.
.
else 0 end as indicator

我能够使用基于此问题的解决方案的循环获得正确的输出：

df_b5 = df_orig[df_orig.loc[:,'segment'] == 'B5']

for i,row in enumerate(thresholds_b5):

    value1 = thresholds_b5.iloc[i,0]
    value2 = thresholds_b5.iloc[i,1]

    df_b5.loc[(df_b5['score1'] == value1) & (df_b5['score2'] >= value2), 'indicator'] = 1

但是，我需要另一个循环来为每个段运行它，然后将所有结果数据帧附加到一起，这有点混乱。 此外，虽然我现在只有三个细分市场（B1、B3、B5），但未来我将拥有 20 多个细分市场。

有没有办法更简洁，最好没有循环？ 有人警告我，数据帧上的循环往往很慢，鉴于我的数据集的大小，我认为速度很重要。

Answer 1

如果您可以提前对 DataFrame 进行排序，那么您可以使用pandas 0.19 中的新asof join替换循环示例：

# query
df_b5 = df_orig.query('segment == "B5"')

# sort ahead of time
df_b5.sort_values('score2', inplace=True)
threshold_b5.sort_values('score2', inplace=True)

# set the default indicator as 1
threshold_b5['indicator'] = 1

# join the tables
df = pd.merge_asof(df_b5, threshold_b5, on='score2', by='score1')

# fill missing indicators as 0
df.indicator = np.int64(df.indicator.fillna(0.0))

这是我得到的：

  segment  score1  score2  indicator
0      B5       0     320          0
1      B5     300     400          1

如果您需要原始订单，则将索引保存在df_orig的新列中，然后通过该列重新使用最终的 DataFrame。

大熊猫0.19.2添加了多个by参数，所以你可以concat与所有的阈值的segment为每一个列集，然后调用：

pd.merge_asof(df_orig, thresholds, on='score2', by=['segment', 'score1'])

基于 Python Pandas 中的几个查找表创建一个新列

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-10-12 18:44:45

基于 Python Pandas 中的几个查找表创建一个新列

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-10-12 18:44:45

解决方案1
2 已采纳 2016-10-12 18:44:45