Find total number of unique elements in two columns combined

Question

I have a dataframe names_df with 800 million rows with two columns. firstname and lastname. I need to find the total number of unique names from the two columns combined.

           first_name last_name
0          john       doe
1          jane       doe
2          doe        john
3          doe        jane
:
799999999  Levi       Ackerman
800000000  Gojo       Satoru

I can simply do:

unique_names = np.concatenate((names_df.first_name.unique(), names_df.last_name.unique()), axis=None)
unique_names=set(unique_names.tolist())
print(len(unique_names))

However, this takes a lot of time and is inefficient, what is a more efficient of finding the total number of unique values from the two columns combined? the unique_names would look like this =

>>>print(unique_names)
>>> {'john','jane','doe','levi','ackerman','Gojo','satoru'}

Answer 1

use this(This is faster than your method):

set(names_df['first_name'].unique().tolist()+names_df['last_name'].unique().tolist())

Answer 2

Edit upon edited question If you want all unique values from multiple columns you can use:

names= names_df[["first_name", "last_name"]].values.ravel()
unique_names =  pd.unique(names)
n_unique_names = len(n_unique_names)

Please see here

Old answer:

You could use drop_duplicates() from pandas and then look at the shape of the returned pandas.DataFrame. If / How much faster is to be seen though.

Answer 3

IF you have a multicore machine then, sort the data and divide it into 26 batches use python multiprocessing library . Get the unique from each batch and then you can merge all those unique df.

Answer 4

Creating a temporary third column and computing the length of its unique values should be less computationally expensive.

names_df['full_name'] = names_df.first_name + names_df.last_name
total_unique_length = len(names_df.full_name.unique())
names_df = names_df.drop(columns='full_name')
print(total_unique_length)

EDIT - based on your edit, you just want unique names from both lists. If there is a Jane Doe and a John Doe, you want [John, Jane, Doe]

In this case it's much easier.

total_unique_length = len(names_df.first_name.unique()) + len(names_df.last_name.unique())
print(total_unique_length)

Find total number of unique elements in two columns combined

Question

4 answers

solution1
1 2021-04-01 13:48:18

solution2
0 2021-04-01 13:46:42

solution3
0 2021-04-01 13:49:39

solution4
0 2021-04-01 14:15:34

Find total number of unique elements in two columns combined

Question

4 answers

solution1 1 2021-04-01 13:48:18

solution2 0 2021-04-01 13:46:42

solution3 0 2021-04-01 13:49:39

solution4 0 2021-04-01 14:15:34

solution1
1 2021-04-01 13:48:18

solution2
0 2021-04-01 13:46:42

solution3
0 2021-04-01 13:49:39

solution4
0 2021-04-01 14:15:34