简体   繁体   中英

For loop equivalent to reduce execution time in pandas data frame operation

I have written a for loop like this:

for i in newc2sdf.Source.unique():
    ydf=newc2sdf[newc2sdf.Source==i]
    for j in newc2sdf.Destination.unique():
        ydf1=ydf[ydf.Destination==j]

As I have so many unique records, It takes huge time to execute.

I will do some basic operations from the ydf1 and it will return a single value and I will append the value in a list.

And I want to calculate the sum of values from another column where the source and destinations will be unique.

I have another column called timestamp (ex: 2016-08-01 00:10:01) and it's in numpy.datetime64 format, I want the sum of those where the timestamp will be 5 minutes more than the minimum timestamp for a particular source to destination.

Is there any alternatives to reduce the execution time.

Given the following sample dataframe:

newc2sdf = pd.DataFrame([['Home','Seattle',3],['Vacation','San Francisco',74],['Work','Portland',9],
                        ['Vacation','Seattle',24],['Work','Portland',4],['Home','Seattle',5],
                        ['Work','Portland',31],['Vacation','San Francisco',19],['Work','San Francisco',38],
                        ['Home','Seattle',85],['Work','San Francisco',32],['Vacation','Seattle',73]],
                        columns=['Source','Destination','Value'])

Which gives:

      Source    Destination  Value
0       Home        Seattle      3
1   Vacation  San Francisco     74
2       Work       Portland      9
3   Vacation        Seattle     24
4       Work       Portland      4
5       Home        Seattle      5
6       Work       Portland     31
7   Vacation  San Francisco     19
8       Work  San Francisco     38
9       Home        Seattle     85
10      Work  San Francisco     32
11  Vacation        Seattle     73

To calculate "the sum of values from another column where the source and destinations will be unique", I would imagine you are looking for groupby() and agg() :

newc2sdf.groupby(['Source','Destination']).agg({'Value': 'sum'}))

Yields:

                        Value
Source   Destination         
Home     Seattle           93
Vacation San Francisco     93
         Seattle           97
Work     Portland          44
         San Francisco     70

And finally, if you wanted to store this column of values to a list:

newc2sdf.groupby(['Source','Destination']).agg({'Value': 'sum'})['Value'].tolist()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM