I have a dataframe names_df with 800 million rows with two columns. firstname and lastname. I need to find the total number of unique names from the two columns combined.
first_name last_name
0 john doe
1 jane doe
2 doe john
3 doe jane
:
799999999 Levi Ackerman
800000000 Gojo Satoru
I can simply do:
unique_names = np.concatenate((names_df.first_name.unique(), names_df.last_name.unique()), axis=None)
unique_names=set(unique_names.tolist())
print(len(unique_names))
However, this takes a lot of time and is inefficient, what is a more efficient of finding the total number of unique values from the two columns combined? the unique_names would look like this =
>>>print(unique_names)
>>> {'john','jane','doe','levi','ackerman','Gojo','satoru'}
use this(This is faster than your method):
set(names_df['first_name'].unique().tolist()+names_df['last_name'].unique().tolist())
Edit upon edited question If you want all unique values from multiple columns you can use:
names= names_df[["first_name", "last_name"]].values.ravel()
unique_names = pd.unique(names)
n_unique_names = len(n_unique_names)
Old answer:
You could use drop_duplicates() from pandas and then look at the shape of the returned pandas.DataFrame. If / How much faster is to be seen though.
IF you have a multicore machine then, sort the data and divide it into 26 batches use python multiprocessing library . Get the unique from each batch and then you can merge all those unique df.
Creating a temporary third column and computing the length of its unique values should be less computationally expensive.
names_df['full_name'] = names_df.first_name + names_df.last_name
total_unique_length = len(names_df.full_name.unique())
names_df = names_df.drop(columns='full_name')
print(total_unique_length)
EDIT - based on your edit, you just want unique names from both lists. If there is a Jane Doe and a John Doe, you want [John, Jane, Doe]
In this case it's much easier.
total_unique_length = len(names_df.first_name.unique()) + len(names_df.last_name.unique())
print(total_unique_length)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.