简体   繁体   English

如何根据 Pandas 数据框中的其他行创建新列?

[英]How create a new column based on other rows in pandas dataframe?

I have a data frame with 200k rows and i try to add columns based on other rows with some conditions.我有一个包含 200k 行的数据框,我尝试根据某些条件添加基于其他行的列。 I tried to achieve it but take a lot of time(2 hours).我试图实现它,但花了很多时间(2 小时)。

Here is my code :这是我的代码:

for index in dataset.index:
    A_id = dataset.loc[index, 'A_id']
    B_id = dataset.loc[index, 'B_id']
    C_date = dataset.loc[index, 'C_date']
    subset = dataset[
        (dataset['A_id'] == A_id) & (dataset['B_id'] == B_id) & (
                dataset['C_date'] < C_date)]
    dataset.at[index, 'D_mean'] = subset['D'].mean()
    dataset.at[index, 'E_mean'] = subset['E'].mean()

My data frame looks this:我的数据框看起来像这样:

A = [1, 2, 1, 2, 1, 2]
B = [10, 20, 10, 20, 10, 20]
C = ["22-02-2019", "28-02-19", "07-03-2019", "14-03-2019", "21-12-2019", "11-10-2019"]
D = [10, 12, 21, 81, 20, 1]
E = [7, 10, 14, 31, 61, 9]

dataset = pd.DataFrame({
    'A_id': A,
    'B_id': B,
    'C_date': C,
    'D': D,
    'E': E,
})

dataset.C_date = pd.to_datetime(dataset.C_date)
dataset
Out[27]: 
   A_id  B_id     C_date   D   E
0     1    10 2019-02-22  10   7
1     2    20 2019-02-28  12  10
2     1    10 2019-07-03  21  14
3     2    20 2019-03-14  81  31
4     1    10 2019-12-21  20  61
5     2    20 2019-11-10   1   9

I would like to have this result in better effective way than my solution :我希望以比我的解决方案更有效的方式获得此结果:

   A_id  B_id     C_date   D   E  D_mean  E_mean
0     1    10 2019-02-22  10   7     NaN     NaN
1     2    20 2019-02-28  12  10     NaN     NaN
2     1    10 2019-07-03  21  14    10.0     7.0
3     2    20 2019-03-14  81  31    12.0    10.0
4     1    10 2019-12-21  20  61    15.5    10.5
5     2    20 2019-11-10   1   9    46.5    20.5

Do you have an idea ?你有想法吗 ?

We can use a combination of functions to achieve this, most notable the pd.DataFrame.rolling to calculate the moving average.我们可以使用组合函数来实现这一点,最值得注意的是pd.DataFrame.rolling来计算移动平均线。

def custom_agg(group):
    cols = ['D', 'E']
    for col in cols:
        name = '{}_mean'.format(col)
        group[name] = group[col].shift() \
                                .rolling(len(group[col]), min_periods=2) \
                                .mean() \
                                .fillna(group[col].iloc[0])
        group[name].iloc[0] = pd.np.nan
    return group

dataset.groupby(['A_id', 'B_id'], as_index=False).apply(custom_agg)

   A_id  B_id     C_date   D   E  D_mean  E_mean
0     1    10 2019-02-22  10   7     NaN     NaN
1     2    20 2019-02-28  12  10     NaN     NaN
2     1    10 2019-07-03  21  14    10.0     7.0
3     2    20 2019-03-14  81  31    12.0    10.0
4     1    10 2019-12-21  20  61    15.5    10.5
5     2    20 2019-11-10   1   9    46.5    20.5

There might be an even more elegant way of doing this, however you should already see a performance increase using this method.可能有一种更优雅的方法来执行此操作,但是您应该已经看到使用此方法提高了性能。 Just make sure the C_date column is sorted ahead of time since it is a moving average.只需确保C_date列提前排序,因为它是移动平均线。

I suspected that your creation of subset in the loop was expensive, and my testing revealed that your algorithm was running at about ~11,000 indices per minute.我怀疑您在循环中创建子集的成本很高,我的测试表明您的算法以每分钟约 11,000 个索引的速度运行。 I came up with an alternative algorithm that pre-sorts the data so that computing the subset becomes trivial, and running over a 200k-row dataset of random data takes under 5 minutes.我想出了一种替代算法,该算法对数据进行预排序,以便计算子集变得微不足道,并且运行 200k 行随机数据数据集不到 5 分钟。

dataset.sort_values(by=['A_id', 'B_id', 'C_date'], inplace=True)
dataset.reset_index(drop=True, inplace=True)

last_A = None
last_B = None
first_index = -1
for index in dataset.index:
    A_id = dataset.loc[index, 'A_id']
    B_id = dataset.loc[index, 'B_id']
    C_date = dataset.loc[index, 'C_date']

    if (last_A != A_id) | (last_B != B_id):
        first_index = index
        last_A = A_id
        last_B = B_id

    subset = dataset[first_index:index]
    dataset.at[index, 'D_mean'] = subset['D'].mean()
    dataset.at[index, 'E_mean'] = subset['E'].mean()

Here's one way to do using .apply :这是使用.apply的一种方法:

dataset[['D_mean', 'E_mean']] = (dataset
                                .apply(lambda df: dataset[(dataset['A_id'] == df['A_id']) & 
                                                          (dataset['B_id'] == df['B_id']) & 
                                                          (dataset['C_date'] < df['C_date'])
                                                          ][['D','E']].mean(axis=0), axis=1)

   A_id  B_id     C_date   D   E  D_mean  E_mean
0     1    10 2019-02-22  10   7     NaN     NaN
1     2    20 2019-02-28  12  10     NaN     NaN
2     1    10 2019-07-03  21  14    10.0     7.0
3     2    20 2019-03-14  81  31    12.0    10.0
4     1    10 2019-12-21  20  61    15.5    10.5
5     2    20 2019-11-10   1   9    46.5    20.5

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何根据 pandas dataframe 中其他列中的子字符串创建新列? - How to create new column based on substrings in other column in a pandas dataframe? 如何基于一组行在Pandas DataFrame中新建列 - How to create a new column in Pandas DataFrame based on a group of rows 基于数据框的其他列创建一个新的熊猫数据框列 - Create a new pandas dataframe column based on other column of the dataframe Pandas:根据 DataFrame 中的其他列在 DataFrame 中创建新列 - Pandas: Create new column in DataFrame based on other column in DataFrame 如何基于来自熊猫中其他数据框的多个条件在数据框中创建新的布尔列 - How to create a new boolean column in a dataframe based on multiple conditions from other dataframe in pandas Pandas 数据框根据另一列的条件创建新行 - Pandas dataframe create new rows based on condition from another column Pandas:根据我的 dataframe 中的其他值列表创建一个新列 - Pandas: Create a new column based on a list of other values in my dataframe 如何基于另一个DataFrame中的列在Pandas DataFrame中创建新列? - How to create a new column in a Pandas DataFrame based on a column in another DataFrame? 根据其他列中的“NaN”值在 Pandas Dataframe 中创建一个新列 - Create a new column in Pandas Dataframe based on the 'NaN' values in other columns 基于其他列在 Pandas DataFrame 中创建新列 - Create new column in Pandas DataFrame based on other columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM