有没有更快的方法将熊猫数据帧分成两个互补的部分？

Question

Good evening all,大家晚上好

I have a situation where I need to split a dataframe into two complementary parts based on the value of one feature.我有一种情况，我需要根据一个特征的值将数据帧分成两个互补的部分。

What I mean by this is that for every row in dataframe 1, I need a complementary row in dataframe 2 that takes on the opposite value of that specific feature.我的意思是，对于数据帧 1 中的每一行，我需要数据帧 2 中的一个互补行，该行具有该特定特征的相反值。

In my source dataframe, the feature I'm referring to is stored under column "773", and it can take on values of either 0.0 or 1.0 .在我的源数据框中，我所指的特征存储在“773”列下，它可以采用0.0或1.0 的值。

I came up with the following code that does this sufficiently, but it is remarkably slow.我想出了以下代码，足以完成此操作，但速度非常慢。 It takes about a minute to split 10,000 rows, even on my all-powerful EC2 instance.拆分 10,000 行大约需要一分钟，即使在我功能强大的 EC2 实例上也是如此。

data = chunk.iloc[:,1:776]
listy1 = []
listy2 = []

for i in range(0,len(data)):
    random_row = data.sample(n=1).iloc[0]
    listy1.append(random_row.tolist())

    if random_row["773"] == 0.0:
        x = data[data["773"] == 1.0].sample(n=1).iloc[0]
        listy2.append(x.tolist())

    else: 
        x = data[data["773"] == 0.0].sample(n=1).iloc[0]
        listy2.append(x.tolist())

df1 = pd.DataFrame(listy1)
df2 = pd.DataFrame(listy2)

Note: I don't care about duplicate rows, because this data is being used to train a model that compares two objects to tell which one is "better."注意：我不关心重复的行，因为这些数据被用来训练一个模型，该模型比较两个对象来判断哪个“更好”。

Do you have some insight into why this is so slow, or any suggestions as to make this faster?你对为什么这么慢有一些见解，或者有什么建议可以让它更快？

Answer 1

A key concept in efficient numpy / scipy / pandas coding is using library-shipped vectorized functions whenever possible.高效numpy / scipy / pandas编码的一个关键概念是尽可能使用库提供的矢量化函数。 Try to process multiple rows at once instead of iterate explicitly over rows.尝试一次处理多行而不是显式迭代行。 ie avoid for loops and .iterrows() .即避免for循环和.iterrows() 。

The implementation provided is a little subtle in terms of indexing, but the vectorization thinking should be straightforward as follows:提供的实现在索引方面有点微妙，但矢量化思想应该很简单，如下所示：

Draw the main dataset at once.一次绘制主数据集。
The complementary dataset: draw the 0-rows at once, the complementary 1-rows at once, and then put them into the corresponding rows at once.互补数据集：一次绘制0行，一次绘制互补1行，然后将它们一次放入相应的行中。

Code :代码：

import pandas as pd
import numpy as np
from datetime import datetime

np.random.seed(52)  # reproducibility
n = 10000
df = pd.DataFrame(
    data={
        "773": [0,1]*int(n/2),
        "dummy1": list(range(n)),
        "dummy2": list(range(0, 10*n, 10))
    }
)

t0 = datetime.now()
print("Program begins...")
# 1. draw the main dataset
draw_idx = np.random.choice(n, n)  # repeatable draw
df_main = df.iloc[draw_idx, :].reset_index(drop=True)

# 2. draw the complementary dataset

# (1) count number of 1's and 0's
n_1 = np.count_nonzero(df["773"][draw_idx].values)
n_0 = n - n_1

# (2) split data for drawing
df_0 = df[df["773"] == 0].reset_index(drop=True)
df_1 = df[df["773"] == 1].reset_index(drop=True)

# (3) draw n_1 indexes in df_0 and n_0 indexes in df_1
idx_0 = np.random.choice(len(df_0), n_1)
idx_1 = np.random.choice(len(df_1), n_0)

# (4) broadcast the drawn rows into the complementary dataset
df_comp = df_main.copy()
mask_0 = (df_main["773"] == 0).values
df_comp.iloc[mask_0 ,:] = df_1.iloc[idx_1, :].values  # df_1 into mask_0
df_comp.iloc[~mask_0 ,:] = df_0.iloc[idx_0, :].values  # df_0 into ~mask_0

print(f"Program ends in {(datetime.now() - t0).total_seconds():.3f}s...")

Check查看

print(df_main.head(5))
   773  dummy1  dummy2
0    0      28     280
1    1      11     110
2    1      13     130
3    1      23     230
4    0      86     860

print(df_comp.head(5))
   773  dummy1  dummy2
0    1      19     190
1    0      74     740
2    0      28     280  <- this row is complementary to df_main
3    0      60     600
4    1      37     370

Efficiency gain : 14.23s -> 0.011s (ca. 128x)效率增益：14.23s -> 0.011s (ca. 128x)

有没有更快的方法将熊猫数据帧分成两个互补的部分？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-10-04 03:13:35

有没有更快的方法将熊猫数据帧分成两个互补的部分？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-10-04 03:13:35

解决方案1
1 已采纳 2020-10-04 03:13:35