简体   繁体   English

Python中子数据帧中数据帧拆分的优化运行时间

[英]Optimization RunTime of DataFrame Split in Sub DataFrames in Python

I do have a pandas DF (df_main) which I try to split into different subsets.我确实有一个 Pandas DF (df_main),我尝试将其拆分为不同的子集。 The dataset look something like this:数据集看起来像这样:

a b c d e f

1 1 1 2 1 2   1.

2 3 2 1 2 1   2.

3 1 3 1 3 1   3.

3 2 1 3 4 1   4.

3 1 3 4 2 1   5.

2 1 2 3 4 2   6.

1 2 3 4 5 3   7.

I want to split the complete df based on the element of column a and it's following element into 3 subsets.我想根据a 列的元素将完整的df拆分为 3 个子集。

Subset 1: increasing values of col(a) , so 1., 2., 3.子集 1:增加col(a)值,因此 1., 2., 3.

Subset 2: value of col(a) stays constant so 3., 4., 5.子集 2: col(a)值保持不变,因此 3., 4., 5.

Subset 3: decreasing value of col (a) so 5., 6., 7.子集 3:减少col (a)值,所以 5., 6., 7.

My code looks at the moment like this:我的代码现在看起来像这样:

df1_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])
df2_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])
df3_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])

for j in range(len(df_main['a'])):
    if df_main['a'][j] == df_main['a'][j + 1]:
        df1_new = df1_new.append(df_main.iloc[j])
    if df_main['a'][j] > df_main['a'][j + 1]:
        df2_new = df2_new.append(df_main.iloc[j])
    if df_main['a'][j] < df_main['a'][j + 1]:
        df3_new = df3_new.append(df_main.iloc[j])

Due to the fact, that the df_main has a length of 1 353 419 rows, it needs (atm) around 15hours to complete a run.由于 df_main 的长度为 1 353 419 行,因此完成一次运行需要(atm)大约 15 小时。

Are there any options to optimise the time it needs to run through the df and splits its?是否有任何选项可以优化运行 df 并拆分它所需的时间?

I have red a bit about numpy vectorization, but I am not sure, if this would be a proper workaround here.我对 numpy 矢量化有一些了解,但我不确定这是否是一个合适的解决方法。

The pattern, based on incremetenting, decremeting and constant values could be seen here可以在此处看到基于递增、递减和常量值的模式

在此处输入图片说明

Use Series.gt , Series.lt and Series.eq along with Series.shift to create boolean masks m1 , m2 and m3 , then use these masks to filter/split the dataframe in the corresponding categories increasing , decreasing and constant :使用Series.gtSeries.ltSeries.eq以及Series.shift创建布尔掩码m1m2m3 ,然后使用这些掩码过滤/拆分相应类别中的数据帧increasingdecreasingconstant

s1, s2 = df['a'].shift(), df['a'].shift(-1)

m1 = df['a'].gt(s1) | df['a'].lt(s2)
m2 = df['a'].lt(s1) | df['a'].gt(s2)
m3 = df['a'].eq(s1) | df['a'].eq(s2)

incr, decr, const = df[m1], df[m2], df[m3]

Result:结果:

print(incr)
   a  b  c  d  e  f  g
0  1  1  1  2  1  2  1
1  2  3  2  1  2  1  2
2  3  1  3  1  3  1  2

print(decr)
   a  b  c  d  e  f  g
4  3  1  3  4  2  1  4
5  2  1  2  3  4  2  1
6  1  2  3  4  5  3  1

print(const)
   a  b  c  d  e  f  g
2  3  1  3  1  3  1  2
3  3  2  1  3  4  1  3
4  3  1  3  4  2  1  4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM