简体   繁体   中英

Check for a column in pandas dataframe for all elements if they are in a set of values

We have a pandas DataFrame df and a set of values set_vals .

For a particular column (let's say 'name' ), I would now like to compute a new column which is True whenever the value of df['name'] is in set_vals and False otherwise.

One way to do this is to write:

df['name'].apply(lambda x: x in set_vals)

but when both df and set_vals become large this method is very slow. Is there a more efficient way of creating this new column?

The real problem is the complexity of df['name'].apply(lambda x: x in set_vals) is O(M*N) where M is the length of df and N is the length of set_vals if set_vals is a list (or another type for which the search complexity is linear).

The complexity can be improved to O(M) if set_vals is hashed (turned into dict type) and the search complexity will be O(1).

I found the solution for you, it is called MapReduce.

You can read about it HERE

In general, it is a programming model for processing big data in parallel on multiple nodes.

There is a video that explains and shows an example for MapReduce: MapReduce Video

It is a complex problem with a simple solution, you can try to run multiple threads with this for loop:

let's say [0:i], [i+1:j], [j+1,k] etc.

Here is a very good explanation of how to do multiple threads

Also, if you are interested in more details about performance and efficiency check this out.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM