转换数据框的最快方法

Question

I have a dataframe with 3 columns: 我有一个3列的数据框：

reading_df: reading_df：

    c1  c2  c3
    1   1   0.104986
    1   1   0.628024
    0   0   0.507727
    1   1   0.445931
    0   1   0.867830
    1   1   0.455478
    1   0   0.271283
    0   1   0.759124
    1   0   0.382079
    0   1   0.572290

For each element in 3 column (c3) i must find how many items(rows) are: 对于第3列（c3）中的每个元素，我必须找到多少个项目（行）：

have same values for c1 对于c1具有相同的值
have same values for c2 c2的值相同
differens between values in c3 in given row and each row must be less whan 0.3 给定行中c3中的值之间的差异，并且每行必须小于0.3

For example the answer writing in column c4 例如，在c4列中编写答案

   c1  c2  c3        c4
    1   1   0.104986  0
    1   1   0.628024  2
    0   0   0.507727  0
    1   1   0.445931  0
    0   1   0.867830  2
    1   1   0.455478  1
    1   0   0.271283  0
    0   1   0.759124  1
    1   0   0.382079  1
    0   1   0.572290  0

I transform dataframe into numpy array and use map function with labmda to have best performance. 我将数据帧转换为numpy数组，并将映射函数与labmda结合使用以具有最佳性能。

reading_df['c4']=np.zeros(df.shape[0])

X=np.array(reading_df)

c1=0
c2=1
c3=2
dT=0.3

res_map =  map(lambda el: len( X[

    ( X[:,n_time] > (el[n_time]-dT) ) 

    & ( X[:,n_time] < (el[n_time])  )

    & ( X[:,n_feature2] == (el[n_feature2]) )

    & ( X[:,n_feature1] == (el[n_feature1]) )

                                    ][:,n_time]), X)

But when i try to transform map object res_map into list: 但是当我尝试将地图对象res_map转换为列表时：

result=list(res_map)
result_dataframe=pd.DataFrame({'c4':result })

my code become very slow. 我的代码变得非常慢。 And work very long time for big dataframe with more than 1*10^6 elements. 对于具有1 * 10 ^ 6个以上元素的大数据帧，它需要花费很长时间。

Which function i must use? 我必须使用哪个功能？ And which the best practices to make python work faster? 哪些最佳实践可以使python更快地工作？

Answer 1

Not sure what the exact logic is behind your question, but I think you want to groupby and than calculate the diff 不知道问题背后的确切逻辑是什么，但我认为您想groupby并计算diff

If I understand your problem correctly its a many-to-many comparison within each group of c1 and c2 . 如果我正确理解了您的问题，则它在c1和c2每个组中many-to-many比较。

Here's a start for your prolem which you can build on: 这是您可以建立的问题的起点：

# first calculate the difference between rows in c3 column while applying groupby
df['difference'] = df.groupby(['c1', 'c2']).c3.diff()

# then add a count column which counts the size of each group
df['count'] = df.groupby(['c1', 'c2']).c1.transform('count')

# after that create a conditional field based on the values in the other columns
df['c4'] = np.where((df.c1 == df.c2) & (df.difference < 0.3), 1, 0)

Hope this helps in terms of speed (vectorization) and closer to solve your problem. 希望这对速度（向量化）有帮助，并能更进一步解决您的问题。

转换数据框的最快方法

问题描述

1 个解决方案

解决方案1
0 2019-02-21 13:51:07

转换数据框的最快方法

问题描述

1 个解决方案

解决方案1 0 2019-02-21 13:51:07

解决方案1
0 2019-02-21 13:51:07