We have a pandas DataFrame df
and a set of values set_vals
.
For a particular column (let's say 'name'
), I would now like to compute a new column which is True
whenever the value of df['name']
is in set_vals
and False
otherwise.
One way to do this is to write:
df['name'].apply(lambda x: x in set_vals)
but when both df
and set_vals
become large this method is very slow. Is there a more efficient way of creating this new column?
The real problem is the complexity of df['name'].apply(lambda x: x in set_vals)
is O(M*N) where M is the length of df
and N is the length of set_vals
if set_vals
is a list (or another type for which the search complexity is linear).
The complexity can be improved to O(M) if set_vals
is hashed (turned into dict
type) and the search complexity will be O(1).
I found the solution for you, it is called MapReduce.
You can read about it HERE
In general, it is a programming model for processing big data in parallel on multiple nodes.
There is a video that explains and shows an example for MapReduce: MapReduce Video
It is a complex problem with a simple solution, you can try to run multiple threads with this for loop:
let's say [0:i], [i+1:j], [j+1,k]
etc.
Here is a very good explanation of how to do multiple threads
Also, if you are interested in more details about performance and efficiency check this out.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.