[英]Fastest way to transform dataframe
I have a dataframe with 3 columns: 我有一个3列的数据框:
reading_df: reading_df:
c1 c2 c3
1 1 0.104986
1 1 0.628024
0 0 0.507727
1 1 0.445931
0 1 0.867830
1 1 0.455478
1 0 0.271283
0 1 0.759124
1 0 0.382079
0 1 0.572290
For each element in 3 column (c3) i must find how many items(rows) are: 对于第3列(c3)中的每个元素,我必须找到多少个项目(行):
For example the answer writing in column c4 例如,在c4列中编写答案
c1 c2 c3 c4
1 1 0.104986 0
1 1 0.628024 2
0 0 0.507727 0
1 1 0.445931 0
0 1 0.867830 2
1 1 0.455478 1
1 0 0.271283 0
0 1 0.759124 1
1 0 0.382079 1
0 1 0.572290 0
I transform dataframe into numpy array and use map function with labmda to have best performance. 我将数据帧转换为numpy数组,并将映射函数与labmda结合使用以具有最佳性能。
reading_df['c4']=np.zeros(df.shape[0])
X=np.array(reading_df)
c1=0
c2=1
c3=2
dT=0.3
res_map = map(lambda el: len( X[
( X[:,n_time] > (el[n_time]-dT) )
& ( X[:,n_time] < (el[n_time]) )
& ( X[:,n_feature2] == (el[n_feature2]) )
& ( X[:,n_feature1] == (el[n_feature1]) )
][:,n_time]), X)
But when i try to transform map object res_map
into list: 但是当我尝试将地图对象
res_map
转换为列表时:
result=list(res_map)
result_dataframe=pd.DataFrame({'c4':result })
my code become very slow. 我的代码变得非常慢。 And work very long time for big dataframe with more than 1*10^6 elements.
对于具有1 * 10 ^ 6个以上元素的大数据帧,它需要花费很长时间。
Which function i must use? 我必须使用哪个功能? And which the best practices to make python work faster?
哪些最佳实践可以使python更快地工作?
Not sure what the exact logic is behind your question, but I think you want to groupby
and than calculate the diff
不知道问题背后的确切逻辑是什么,但我认为您想
groupby
并计算diff
If I understand your problem correctly its a many-to-many
comparison within each group of c1
and c2
. 如果我正确理解了您的问题,则它在
c1
和c2
每个组中many-to-many
比较。
Here's a start for your prolem which you can build on: 这是您可以建立的问题的起点:
# first calculate the difference between rows in c3 column while applying groupby
df['difference'] = df.groupby(['c1', 'c2']).c3.diff()
# then add a count column which counts the size of each group
df['count'] = df.groupby(['c1', 'c2']).c1.transform('count')
# after that create a conditional field based on the values in the other columns
df['c4'] = np.where((df.c1 == df.c2) & (df.difference < 0.3), 1, 0)
Hope this helps in terms of speed (vectorization) and closer to solve your problem. 希望这对速度(向量化)有帮助,并能更进一步解决您的问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.