Fastest way to transform dataframe

Question

I have a dataframe with 3 columns:

reading_df:

    c1  c2  c3
    1   1   0.104986
    1   1   0.628024
    0   0   0.507727
    1   1   0.445931
    0   1   0.867830
    1   1   0.455478
    1   0   0.271283
    0   1   0.759124
    1   0   0.382079
    0   1   0.572290

For each element in 3 column (c3) i must find how many items(rows) are:

have same values for c1
have same values for c2
differens between values in c3 in given row and each row must be less whan 0.3

For example the answer writing in column c4

   c1  c2  c3        c4
    1   1   0.104986  0
    1   1   0.628024  2
    0   0   0.507727  0
    1   1   0.445931  0
    0   1   0.867830  2
    1   1   0.455478  1
    1   0   0.271283  0
    0   1   0.759124  1
    1   0   0.382079  1
    0   1   0.572290  0

I transform dataframe into numpy array and use map function with labmda to have best performance.

reading_df['c4']=np.zeros(df.shape[0])

X=np.array(reading_df)

c1=0
c2=1
c3=2
dT=0.3

res_map =  map(lambda el: len( X[

    ( X[:,n_time] > (el[n_time]-dT) ) 

    & ( X[:,n_time] < (el[n_time])  )

    & ( X[:,n_feature2] == (el[n_feature2]) )

    & ( X[:,n_feature1] == (el[n_feature1]) )

                                    ][:,n_time]), X)

But when i try to transform map object res_map into list:

result=list(res_map)
result_dataframe=pd.DataFrame({'c4':result })

my code become very slow. And work very long time for big dataframe with more than 1*10^6 elements.

Which function i must use? And which the best practices to make python work faster?

Answer 1

Not sure what the exact logic is behind your question, but I think you want to groupby and than calculate the diff

If I understand your problem correctly its a many-to-many comparison within each group of c1 and c2 .

Here's a start for your prolem which you can build on:

# first calculate the difference between rows in c3 column while applying groupby
df['difference'] = df.groupby(['c1', 'c2']).c3.diff()

# then add a count column which counts the size of each group
df['count'] = df.groupby(['c1', 'c2']).c1.transform('count')

# after that create a conditional field based on the values in the other columns
df['c4'] = np.where((df.c1 == df.c2) & (df.difference < 0.3), 1, 0)

Hope this helps in terms of speed (vectorization) and closer to solve your problem.

Fastest way to transform dataframe

Question

1 answers

solution1
0 2019-02-21 13:51:07

Fastest way to transform dataframe

Question

1 answers

solution1 0 2019-02-21 13:51:07

solution1
0 2019-02-21 13:51:07