简体   繁体   中英

Fastest way to set a value in pandas

In [118]: %timeit df['A'].ix[df['Id']=='000f00003'] = 3
10 loops, best of 3: 54.9 ms per loop

In [119]: %timeit df.loc[df['Id']=='000f00003','A'] = 4
10 loops, best of 3: 55.4 ms per loop

In [126]: %timeit df.ix[df['Id']=='000f00003','A'] = 5
10 loops, best of 3: 55.8 ms per loop

I'm using an operation that does this setting of values ~20k times. I'm trying to find a better way than either of the three options above. Is there a faster way to set a variable than this given the filtering I need to do to set it?

I do know the fastest way is something vectorized but I don't think I can vectorize this. Basically I need to get a slice of the DataFrame(50microseconds around a specified time), find the rows that match my criteria(3 columns I am filtering on), then I update the rows with the data I find, like above.

It looks like you're running into an issue with slow setting of values using slicing and conditionals. I ran into something similar and found that using the where() operator can be much, much faster.

Of course you don't show your data so this may or may not apply and I apologize if not, but for a large dataframe I'm dealing with I see a speedup of, well, 24 million times!

 %timeit a[np.isnan(a)]=df2 1 loops, best of 3: 1 s per loop def time1(): b = a.where(np.isfinite(a),df2) a=b %timeit time1 10000000 loops, best of 3: 41.5 ns per loop 

When I dug into the profiling it looks like the difference is that the first loop spends large amounts of time in setitem and __check__setitem__copy then collect. When I refactored my code to use the second approach that entire part of the code was so fast it hardly registered.

I think the important thing here is that the second approach, despite looking a bit silly assigning to b then back to a, separates the identification of the locations to set from the actual setting operation. This appears to be because .where() returns a whole subframe of the same size as the original and can thus be set to the original all at once. Note that if you eliminate assigning to b and then back to a by using inplace=True most of the gains go away!

I've made things simpler in showing them, but "a" is actually a multiindex slicing on multiple axes in my application and tests, as is df2.

Also, note that .where replaces where the logic is false so I inverted the logic from np.isnan to np.isfinite going from one to the other

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM