简体   繁体   中英

Find total number of unique elements in two columns combined

I have a dataframe names_df with 800 million rows with two columns. firstname and lastname. I need to find the total number of unique names from the two columns combined.

           first_name last_name
0          john       doe
1          jane       doe
2          doe        john
3          doe        jane
:
799999999  Levi       Ackerman
800000000  Gojo       Satoru

I can simply do:

unique_names = np.concatenate((names_df.first_name.unique(), names_df.last_name.unique()), axis=None)
unique_names=set(unique_names.tolist())
print(len(unique_names))

However, this takes a lot of time and is inefficient, what is a more efficient of finding the total number of unique values from the two columns combined? the unique_names would look like this =

>>>print(unique_names)
>>> {'john','jane','doe','levi','ackerman','Gojo','satoru'}

use this(This is faster than your method):

set(names_df['first_name'].unique().tolist()+names_df['last_name'].unique().tolist())

Edit upon edited question If you want all unique values from multiple columns you can use:

names= names_df[["first_name", "last_name"]].values.ravel()
unique_names =  pd.unique(names)
n_unique_names = len(n_unique_names)

Please see here


Old answer:

You could use drop_duplicates() from pandas and then look at the shape of the returned pandas.DataFrame. If / How much faster is to be seen though.

IF you have a multicore machine then, sort the data and divide it into 26 batches use python multiprocessing library . Get the unique from each batch and then you can merge all those unique df.

Creating a temporary third column and computing the length of its unique values should be less computationally expensive.

names_df['full_name'] = names_df.first_name + names_df.last_name
total_unique_length = len(names_df.full_name.unique())
names_df = names_df.drop(columns='full_name')
print(total_unique_length)

EDIT - based on your edit, you just want unique names from both lists. If there is a Jane Doe and a John Doe, you want [John, Jane, Doe]

In this case it's much easier.

total_unique_length = len(names_df.first_name.unique()) + len(names_df.last_name.unique())
print(total_unique_length)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM