[英]Optimization RunTime of DataFrame Split in Sub DataFrames in Python
I do have a pandas DF (df_main) which I try to split into different subsets.我确实有一个 Pandas DF (df_main),我尝试将其拆分为不同的子集。 The dataset look something like this:
数据集看起来像这样:
a b c d e f
1 1 1 2 1 2 1.
2 3 2 1 2 1 2.
3 1 3 1 3 1 3.
3 2 1 3 4 1 4.
3 1 3 4 2 1 5.
2 1 2 3 4 2 6.
1 2 3 4 5 3 7.
I want to split the complete df
based on the element of column a and it's following element into 3 subsets.我想根据a 列的元素将完整的
df
拆分为 3 个子集。
Subset 1: increasing values of col(a)
, so 1., 2., 3.子集 1:增加
col(a)
值,因此 1., 2., 3.
Subset 2: value of col(a)
stays constant so 3., 4., 5.子集 2:
col(a)
值保持不变,因此 3., 4., 5.
Subset 3: decreasing value of col (a)
so 5., 6., 7.子集 3:减少
col (a)
值,所以 5., 6., 7.
My code looks at the moment like this:我的代码现在看起来像这样:
df1_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])
df2_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])
df3_new = pd.DataFrame(columns=['a', 'b', 'c', 'd', 'e', 'f'])
for j in range(len(df_main['a'])):
if df_main['a'][j] == df_main['a'][j + 1]:
df1_new = df1_new.append(df_main.iloc[j])
if df_main['a'][j] > df_main['a'][j + 1]:
df2_new = df2_new.append(df_main.iloc[j])
if df_main['a'][j] < df_main['a'][j + 1]:
df3_new = df3_new.append(df_main.iloc[j])
Due to the fact, that the df_main has a length of 1 353 419 rows, it needs (atm) around 15hours to complete a run.由于 df_main 的长度为 1 353 419 行,因此完成一次运行需要(atm)大约 15 小时。
Are there any options to optimise the time it needs to run through the df and splits its?是否有任何选项可以优化运行 df 并拆分它所需的时间?
I have red a bit about numpy vectorization, but I am not sure, if this would be a proper workaround here.我对 numpy 矢量化有一些了解,但我不确定这是否是一个合适的解决方法。
The pattern, based on incremetenting, decremeting and constant values could be seen here可以在此处看到基于递增、递减和常量值的模式
Use Series.gt
, Series.lt
and Series.eq
along with Series.shift
to create boolean masks m1
, m2
and m3
, then use these masks to filter/split the dataframe in the corresponding categories increasing
, decreasing
and constant
:使用
Series.gt
、 Series.lt
和Series.eq
以及Series.shift
创建布尔掩码m1
、 m2
和m3
,然后使用这些掩码过滤/拆分相应类别中的数据帧increasing
、 decreasing
和constant
:
s1, s2 = df['a'].shift(), df['a'].shift(-1)
m1 = df['a'].gt(s1) | df['a'].lt(s2)
m2 = df['a'].lt(s1) | df['a'].gt(s2)
m3 = df['a'].eq(s1) | df['a'].eq(s2)
incr, decr, const = df[m1], df[m2], df[m3]
Result:结果:
print(incr)
a b c d e f g
0 1 1 1 2 1 2 1
1 2 3 2 1 2 1 2
2 3 1 3 1 3 1 2
print(decr)
a b c d e f g
4 3 1 3 4 2 1 4
5 2 1 2 3 4 2 1
6 1 2 3 4 5 3 1
print(const)
a b c d e f g
2 3 1 3 1 3 1 2
3 3 2 1 3 4 1 3
4 3 1 3 4 2 1 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.