简体   繁体   English

有没有更快的方法将熊猫数据帧分成两个互补的部分?

[英]Is there a faster way to split a pandas dataframe into two complementary parts?

Good evening all,大家晚上好

I have a situation where I need to split a dataframe into two complementary parts based on the value of one feature.我有一种情况,我需要根据一个特征的值将数据帧分成两个互补的部分。

What I mean by this is that for every row in dataframe 1, I need a complementary row in dataframe 2 that takes on the opposite value of that specific feature.我的意思是,对于数据帧 1 中的每一行,我需要数据帧 2 中的一个互补行,该行具有该特定特征的相反值。

In my source dataframe, the feature I'm referring to is stored under column "773", and it can take on values of either 0.0 or 1.0 .在我的源数据框中,我所指的特征存储在“773”列下,它可以采用0.01.0 的值。

I came up with the following code that does this sufficiently, but it is remarkably slow.我想出了以下代码,足以完成此操作,但速度非常慢。 It takes about a minute to split 10,000 rows, even on my all-powerful EC2 instance.拆分 10,000 行大约需要一分钟,即使在我功能强大的 EC2 实例上也是如此。

data = chunk.iloc[:,1:776]
listy1 = []
listy2 = []

for i in range(0,len(data)):
    random_row = data.sample(n=1).iloc[0]
    listy1.append(random_row.tolist())

    if random_row["773"] == 0.0:
        x = data[data["773"] == 1.0].sample(n=1).iloc[0]
        listy2.append(x.tolist())

    else: 
        x = data[data["773"] == 0.0].sample(n=1).iloc[0]
        listy2.append(x.tolist())

df1 = pd.DataFrame(listy1)
df2 = pd.DataFrame(listy2)

Note: I don't care about duplicate rows, because this data is being used to train a model that compares two objects to tell which one is "better."注意:我不关心重复的行,因为这些数据被用来训练一个模型,该模型比较两个对象来判断哪个“更好”。

Do you have some insight into why this is so slow, or any suggestions as to make this faster?你对为什么这么慢有一些见解,或者有什么建议可以让它更快?

A key concept in efficient numpy / scipy / pandas coding is using library-shipped vectorized functions whenever possible.高效numpy / scipy / pandas编码的一个关键概念是尽可能使用库提供的矢量化函数。 Try to process multiple rows at once instead of iterate explicitly over rows.尝试一次处理多行而不是显式迭代行。 ie avoid for loops and .iterrows() .即避免for循环和.iterrows()

The implementation provided is a little subtle in terms of indexing, but the vectorization thinking should be straightforward as follows:提供的实现在索引方面有点微妙,但矢量化思想应该很简单,如下所示:

  1. Draw the main dataset at once.一次绘制主数据集。
  2. The complementary dataset: draw the 0-rows at once, the complementary 1-rows at once, and then put them into the corresponding rows at once.互补数据集:一次绘制0行,一次绘制互补1行,然后将它们一次放入相应的行中。

Code :代码

import pandas as pd
import numpy as np
from datetime import datetime

np.random.seed(52)  # reproducibility
n = 10000
df = pd.DataFrame(
    data={
        "773": [0,1]*int(n/2),
        "dummy1": list(range(n)),
        "dummy2": list(range(0, 10*n, 10))
    }
)

t0 = datetime.now()
print("Program begins...")
# 1. draw the main dataset
draw_idx = np.random.choice(n, n)  # repeatable draw
df_main = df.iloc[draw_idx, :].reset_index(drop=True)

# 2. draw the complementary dataset

# (1) count number of 1's and 0's
n_1 = np.count_nonzero(df["773"][draw_idx].values)
n_0 = n - n_1

# (2) split data for drawing
df_0 = df[df["773"] == 0].reset_index(drop=True)
df_1 = df[df["773"] == 1].reset_index(drop=True)

# (3) draw n_1 indexes in df_0 and n_0 indexes in df_1
idx_0 = np.random.choice(len(df_0), n_1)
idx_1 = np.random.choice(len(df_1), n_0)

# (4) broadcast the drawn rows into the complementary dataset
df_comp = df_main.copy()
mask_0 = (df_main["773"] == 0).values
df_comp.iloc[mask_0 ,:] = df_1.iloc[idx_1, :].values  # df_1 into mask_0
df_comp.iloc[~mask_0 ,:] = df_0.iloc[idx_0, :].values  # df_0 into ~mask_0

print(f"Program ends in {(datetime.now() - t0).total_seconds():.3f}s...")

Check查看

print(df_main.head(5))
   773  dummy1  dummy2
0    0      28     280
1    1      11     110
2    1      13     130
3    1      23     230
4    0      86     860

print(df_comp.head(5))
   773  dummy1  dummy2
0    1      19     190
1    0      74     740
2    0      28     280  <- this row is complementary to df_main
3    0      60     600
4    1      37     370

Efficiency gain : 14.23s -> 0.011s (ca. 128x)效率增益:14.23s -> 0.011s (ca. 128x)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM