适用于熊猫的替代方法-基于多个列创建新列

Question

I have a Pandas dataframe and I would like to add a new column based on the values of the other columns. 我有一个Pandas数据框，我想根据其他列的值添加一个新列。 A minimal example illustrating my usecase is below. 下面是一个说明我的用例的最小示例。

df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])
df

    a   b   c
---------------
0   4   5   19
1   1   2   0
2   2   5   9
3   8   2   5

x = df.sample(n=2)
x

    a   b   c
---------------
3   8   2   5
1   1   2   0

def get_new(row):
    a, b, c = row
    return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)

y = x.apply(lambda row: get_new(row), axis=1)
x['new'] = y
x

    a   b   c   new
--------------------
3   8   2   5   0
1   1   2   0   5

Note: The original dataframe has ~4 million rows and ~6 columns. 注意：原始数据帧具有约400万行和约6列。 The number of rows in the sample might vary between 50 and 500. I am running on a 64-bit machine with 8 GB RAM. 示例中的行数可能在50到500之间变化。我在具有8 GB RAM的64位计算机上运行。

The above works, except that it is quite slow (takes about 15 seconds for me). 上面的作品，除了它非常慢（对我来说大约需要15秒）。 I also tried using x.itertuples() instead of apply and there is not much of an improvement in this case. 我也尝试使用x.itertuples()代替apply ，在这种情况下没有太多改进。

It seems that apply(with axis=1) is slow since it does not make use of the vectorized operations. 似乎apply（with axis = 1）很慢，因为它没有利用向量化操作。 Is there some way I could achieve this in a faster way? 有什么办法可以更快地实现这一目标吗？
Can the filtering(in the get_new function) be modified or made more efficient compared to using conditional boolean variables, as I currently have? 与使用条件布尔变量（如我目前所拥有的）相比，可以（在get_new函数中）对过滤进行修改或提高效率吗？
Can I in some way use numpy here for some speedup? 我可以以某种方式在此处使用numpy进行加速吗？

Edit: df.sample() is also quite slow and I cannot use .iloc or .loc since I am further modifying the sample and do not wish for this to affect the original dataframe. 编辑： df.sample()也很慢，我无法使用.iloc或.loc因为我正在进一步修改样本，并且不希望这样做影响原始数据帧。

Answer 1

I see a reasonable performance improvement by using .loc rather than chained indexing: 通过使用.loc而不是链式索引，我看到了合理的性能改进：

import random, pandas as pd, numpy as np

df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])

df = pd.concat([df]*1000000)

x = df.sample(n=2)

def get_new(row):
    a, b, c = row
    return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)

def get_new2(row):
    a, b, c = row
    return random.choice(df.loc[(df['a'] != a) & (df['b'] == b) & (df['c'] != c), 'c'].values)


%timeit x.apply(lambda row: get_new(row), axis=1)   # 159ms
%timeit x.apply(lambda row: get_new2(row), axis=1)  # 119ms

适用于熊猫的替代方法-基于多个列创建新列

问题描述

1 个解决方案

解决方案1
1 2018-03-01 19:08:07

适用于熊猫的替代方法-基于多个列创建新列

问题描述

1 个解决方案

解决方案1 1 2018-03-01 19:08:07

解决方案1
1 2018-03-01 19:08:07