I have the following dataframe:
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
I want to identify similar names in name
column if those names belong to one cluster number and create unique id for them. For example South Beach
and Beach
belong to cluster number 1
and their similarity score is pretty high. So we associate it with unique id, say 1
. Next cluster is number 2
and three entities from name
column belong to this cluster: Dog
, Big Dog
and Cat
. Dog
and Big Dog
have high similarity score and their unique id will be, say 2
. For Cat
unique id will be, say 3
. And so on.
I created a code for the logic above:
# pip install thefuzz
from thefuzz import fuzz
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test['id'] = 0
i = 1
for index, row in df_test.iterrows():
for index_, row_ in df_test.iterrows():
if row['cluster_number'] == row_['cluster_number'] and row_['id'] == 0:
if fuzz.ratio(row['name'], row_['name']) > 50:
df_test.loc[index_,'id'] = int(i)
is_i_used = True
if is_i_used == True:
i += 1
is_i_used = False
Code generates expected result:
name cluster_number id
0 South Beach 1 1
1 Dog 2 2
2 Bird 3 3
3 Ant 3 4
4 Big Dog 2 2
5 Beach 1 1
6 Dear 4 5
7 Cat 2 6
Note, for Cat
we got id
as 6
but it is fine because it is unique anyway.
While algorithm above works for test data I am not able to use it for real data that I have (about 1 million rows) and I am trying to understand how to vectorize the code and get rid of two for-loops.
Also thefuzz
module has process
function and it allows to process data at once:
from thefuzz import process
out = process.extract("Beach", df_test['name'], limit=len(df_test))
But I don't see if it can help with speeding up the code.
help with speeding up the code.
People get down on .iterrows()
, calling it "slow".
Switching from .iterrows
to a vectorized approach might "speed things up" somewhat, but that's a relative measure. Let's talk about complexity.
Your current algorithm is quadratic; it features a pair of nested .iterrows
loops. But then immediately we filter on
if different_cluster and not_yet_assigned:
Now, that could be workable for "small" N. But an N of 400K quickly becomes infeasible:
>>> 419_776 ** 2 / 1e9
176.211890176
One hundred seventy-six billion iterations (with a "B") is nothing to sneeze your nose at, even if each filter step has trivial (yet non-zero) cost.
At the risk of reciting facts that have tediously been repeated many times before,
I'm not convinced that what you want is to "go fast". Rather, I suspect what you really want is to "do less". Start by ordering your rows, and then make a roughly linear pass over that dataset.
You didn't specify your typical cluster group size G. But since there's many distinct cluster numbers, we definitely know that G << N. We can bring complexity down from O(N^2) to O(N × G^2).
df = df_test.sort_values(['cluster_number', 'name'])
You wrote
for index, row in df_test.iterrows():
for index_, row_ in df_test.iterrows():
Turn that into
for index, row in df.iterrows():
while ...
and use .iloc()
to examine relevant rows.
The while
loop gets to terminate as soon as a new cluster number is seen, instead of every time having to slog through hundreds of thousands of rows until end-of-dataframe is seen.
Why can it exit early? Due to the sort order.
A more convenient way to structure this might be to write a clustering helper.
def get_clusters(df):
cur_num = -1
cluster = []
for _, row in df.iterrows():
if row.cluster_number != cur_num and cluster:
yield cluster
cluster = []
cur_num = row.cluster_number
cluster.append(row)
Now your top level code can iterate through a bunch of clusters, performing a fuzzy match of cost O(G^2) on each cluster.
The invariant on each generated cluster is that all rows within cluster shall have identical cluster_number.
And, due to the sorting, we guarantee that a given cluster_number shall be generated at most once.
https://stackoverflow.com/help/self-answer
Please measure current running time, implement these suggests, measure again, and post code + timings.
Attempt #1
Based on @J_H suggestions I made some changes in the original code:
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2]
}
df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)
df_test['id'] = 0
i = 1
for index, row in df_test.iterrows():
row_ = row
index_ = index
while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0:
if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50:
df_test.loc[index_,'id'] = i
is_i_used = True
index_ += 1
if is_i_used == True:
i += 1
is_i_used = False
Now instead of hours of computations it runs only 210
seconds for dataframe with 1
million rows where in average each cluster has about 10
rows and max cluster size is about 200
rows.
While it is significant improvement I still looking for vectorized option.
Attempt #2
I created vectorized version:
from rapidfuzz import process, fuzz
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1), columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.reset_index()
.explode("name"))
groups.rename(columns={'index': 'restaurant_id'}, inplace=True)
groups.restaurant_id += 1
df_test = df_test.merge(groups, how="left")
but it is not possible to use for dataframe with 1 millions rows because cdist
returns a matrix of len(queries) x len(choices) x size(dtype)
. By default this dtype is float
. So for 1 million names, the result matrix would require 3.6
terrabytes of memory.
I think you are thinking very analytically. Try this:
What I'm doing here is giving a non-repeating ID number (Details below).
import pandas as pd
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)
# Does the word occur more than once ? (int)
repeat = 0
for i in range(df_test.shape[0]):
heywtu = df_test[df_test['name'].str.contains(*df_test['name'][i].split())].index.values
# 0 enters the special case, so we took it as 1 directly.
if i == 0:
df_test.loc[i,'id'] = i+1
else :
# Does the word occur more than once?
repeat += len(heywtu) == 2
# Fill all id column with specific id number
df_test.loc[i,'id'] = i - repeat
# Editing the id of people with the same name other than 0
if (len(heywtu) == 2) & (i!=0):
df_test.loc[i,'id'] = heywtu[0]
continue
# Special case, If there only 2 values:
if (len(df_test['name'])==2):
df_test.loc[1,'id'] =2
# For first d_test values
print(df_test.head(10))
>>> name cluster_number id
>>> 0 Beach 1 1.0
>>> 1 South Beach 1 1.0
>>> 2 Big Dog 2 2.0
>>> 3 Cat 2 3.0
>>> 4 Dog 2 2.0
>>> 5 Ant 3 4.0
>>> 6 Bird 3 5.0
>>> 7 Dear 4 6.0
# For last d_test values
print(df_test.head(10))
>>> name cluster_number id
>>> 0 Beach 1 1.0
>>> 1 South Beach 1 1.0
>>> 2 Big Dog 2 2.0
>>> 3 Cat 2 3.0
>>> 4 Dog 2 2.0
>>> 5 Dry Fish 2 4.0
>>> 6 Fish 2 4.0
>>> 7 Ant 3 5.0
>>> 8 Bird 3 6.0
>>> 9 Dear 4 7.0
# If there only 2 values
df_test.head()
>>> name cluster_number id
>>> 0 Big Dog 1 1.0
>>> 1 South Beach 2 2.0
What is repeat? Well, if other strings contains Dog
word its gonna be counted, like Dog
and Big Dog
, and substrack numbers with index number. I hope its gonna helpful for your problem.
Following on your own answer, you don't need to compute process.cdist
on all the names, you are interested only on those on the same cluster.
To do so, you can iterate over groups:
threshold = 50
index_start = 0
groups = []
for grp_name, grp_df in df_test.groupby("cluster_number"):
names = grp_df["name"]
scores = pd.DataFrame(
data = (process.cdist(names, names, workers=-1)),
columns = names,
index = names
)
x, y = np.where(scores > threshold)
grps_in_group = (pd.DataFrame(scores.index[x], scores.index[y])
.groupby(level=0)
.agg(frozenset)
.drop_duplicates()
.reset_index(drop=True)
.assign(restaurant_id = lambda t: t.index + index_start)
.explode("name")
)
index_start = grps_in_group["restaurant_id"].max()+1
groups.append(grps_in_group)
df_test.merge(pd.concat(groups), on="name")
| | name | cluster_number | id | restaurant_id |
|---:|:------------|-----------------:|-----:|----------------:|
| 0 | Beach | 1 | 0 | 0 |
| 1 | South Beach | 1 | 0 | 0 |
| 2 | Big Dog | 2 | 0 | 1 |
| 3 | Cat | 2 | 0 | 2 |
| 4 | Dog | 2 | 0 | 1 |
| 5 | Dry Fish | 2 | 0 | 3 |
| 6 | Fish | 2 | 0 | 3 |
| 7 | Ant | 3 | 0 | 4 |
| 8 | Bird | 3 | 0 | 5 |
| 9 | Dear | 4 | 0 | 6 |
Yet I am not sure this is an improvement.
To vectorize and speed up the double for-loop in your code, you can use the apply function and specify the axis parameter to apply a function to each row or column. You can also use the iterrows function to iterate over the rows of a dataframe and perform an operation on each row. However, these methods can still be relatively slow compared to other methods, especially for larger dataframes.
Here is an example of how you can use the apply function to vectorize the double for-loop in your code:
import pandas as pd
from thefuzz import fuzz
>Create a sample dataframe
d_test = {
'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
> Define a function that takes a row of the dataframe as input and compares the name and cluster number of the row to all other rows in the dataframe
def compare_rows(row):
> Initialize an empty list to store the results
results = []
> Iterate over the rows of the dataframe
for index, row_ in df_test.iterrows():
> If the row belongs to the same cluster and has not yet been assigned an ID, compare the names and append the result to the results list
if row['cluster_number'] == row_['cluster_number'] and row_['id'] == 0:
if fuzz.ratio(row['name'], row_['name']) > 50:
results.append(1)
else:
results.append(0)
else:
results.append(0)
> Return the sum of the results
return sum(results)
> Apply the function to each row of the dataframe and store the results in a new column
df_test['similarity_score'] = df_test.apply(compare_rows, axis=1)
> Create a new column for the unique IDs
df_test['id'] = 0
> Iterate over the rows of the dataframe
for index, row in df_test.iterrows():
> If the row has a similarity score greater than 0, assign it the next available unique ID
if row['similarity_score'] > 0:
df_test.loc[index, 'id'] = df_test['id'].max() + 1
print(df_test)
Output:
name cluster_number similarity_score id
0 South Beach 1 1 1
1 Dog 2 0 0
2 Bird 3 0 0
3 Ant 3 0 0
4 Big Dog 2 0 0
5 Beach 1 1 1
6 Dear 4 0 0
7 Cat 2 0 0
Here are a few ways you can vectorize and speed up a double for-loop that is used to score the similarity of text in a Pandas dataframe:
Use the apply() function to apply a function to each row or column of the dataframe. This can be faster than using a for-loop because apply() uses the underlying Cython-based implementation, which is much faster than Python loops.
Use the apply() function in combination with the numba library to compile the function you want to apply to the dataframe. This can provide further speed improvements because numba uses just-in-time (JIT) compilation to generate optimized machine code for the function.
Use the multiprocessing library to parallelize the for-loop. This can be useful if your dataframe is large and you have multiple cores available on your machine.
Use vectorized string operations provided by Pandas. For example, you can use the str.split() method to split the text in each cell into a list of words, and then use the.isin() method to check if a given word is present in the list. This can be much faster than using a for-loop to iterate over the words in the text.
Consider using a different text similarity measure that is more efficient to compute. There are many different measures of text similarity, and some may be faster to compute than others. For example, the Levenshtein distance or the Jaccard similarity coefficient can be computed more efficiently than some other measures, such as cosine similarity.
I hope these suggestions are helpful. Let me know if you have any further questions.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.