简体   繁体   中英

How to vectorize and speed-up double for-loop for pandas dataframe when doing text similarity scoring

I have the following dataframe:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)

I want to identify similar names in name column if those names belong to one cluster number and create unique id for them. For example South Beach and Beach belong to cluster number 1 and their similarity score is pretty high. So we associate it with unique id, say 1 . Next cluster is number 2 and three entities from name column belong to this cluster: Dog , Big Dog and Cat . Dog and Big Dog have high similarity score and their unique id will be, say 2 . For Cat unique id will be, say 3 . And so on.

I created a code for the logic above:

# pip install thefuzz
from thefuzz import fuzz

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}

df_test = pd.DataFrame(d_test)

df_test['id'] = 0

i = 1
for index, row in df_test.iterrows():
    for index_, row_ in df_test.iterrows():
        if row['cluster_number'] == row_['cluster_number'] and row_['id'] == 0:
            if fuzz.ratio(row['name'], row_['name']) > 50:
                df_test.loc[index_,'id'] = int(i)
                is_i_used = True
    if is_i_used == True:
        i += 1
        is_i_used = False
                           

Code generates expected result:

    name        cluster_number id
0   South Beach 1              1
1   Dog         2              2
2   Bird        3              3
3   Ant         3              4
4   Big Dog     2              2
5   Beach       1              1
6   Dear        4              5
7   Cat         2              6

Note, for Cat we got id as 6 but it is fine because it is unique anyway.

While algorithm above works for test data I am not able to use it for real data that I have (about 1 million rows) and I am trying to understand how to vectorize the code and get rid of two for-loops.

Also thefuzz module has process function and it allows to process data at once:

from thefuzz import process
out = process.extract("Beach", df_test['name'], limit=len(df_test))

But I don't see if it can help with speeding up the code.

tl;dr: Avoid O(N^2) running time if N is big.

help with speeding up the code.

People get down on .iterrows() , calling it "slow".

Switching from .iterrows to a vectorized approach might "speed things up" somewhat, but that's a relative measure. Let's talk about complexity.

time complexity

Your current algorithm is quadratic; it features a pair of nested .iterrows loops. But then immediately we filter on

 if different_cluster and not_yet_assigned:

Now, that could be workable for "small" N. But an N of 400K quickly becomes infeasible:

>>> 419_776 ** 2 / 1e9
176.211890176

One hundred seventy-six billion iterations (with a "B") is nothing to sneeze your nose at, even if each filter step has trivial (yet non-zero) cost.

At the risk of reciting facts that have tediously been repeated many times before,

  • sorting costs O(N log N), and
  • N log N is very signicantly less than N^2

I'm not convinced that what you want is to "go fast". Rather, I suspect what you really want is to "do less". Start by ordering your rows, and then make a roughly linear pass over that dataset.

You didn't specify your typical cluster group size G. But since there's many distinct cluster numbers, we definitely know that G << N. We can bring complexity down from O(N^2) to O(N × G^2).


df = df_test.sort_values(['cluster_number', 'name'])

You wrote

for index, row in df_test.iterrows():
    for index_, row_ in df_test.iterrows():

Turn that into

for index, row in df.iterrows():
    while ...

and use .iloc() to examine relevant rows.

The while loop gets to terminate as soon as a new cluster number is seen, instead of every time having to slog through hundreds of thousands of rows until end-of-dataframe is seen.

Why can it exit early? Due to the sort order.


A more convenient way to structure this might be to write a clustering helper.

def get_clusters(df):
    cur_num = -1
    cluster = []
    for _, row in df.iterrows():
        if row.cluster_number != cur_num and cluster:
            yield cluster
            cluster = []
        cur_num = row.cluster_number
        cluster.append(row)

Now your top level code can iterate through a bunch of clusters, performing a fuzzy match of cost O(G^2) on each cluster.

The invariant on each generated cluster is that all rows within cluster shall have identical cluster_number.

And, due to the sorting, we guarantee that a given cluster_number shall be generated at most once.


https://stackoverflow.com/help/self-answer

Please measure current running time, implement these suggests, measure again, and post code + timings.

Attempt #1

Based on @J_H suggestions I made some changes in the original code:

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat', 'Fish', 'Dry Fish'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2, 2, 2]
}

df_test = pd.DataFrame(d_test)

df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)

df_test['id'] = 0

i = 1
for index, row in df_test.iterrows():
    row_ = row
    index_ = index
    
    while index_ < len(df_test) and df_test.loc[index, 'cluster_number'] == df_test.loc[index_, 'cluster_number'] and df_test.loc[index_, 'id'] == 0:
        if row['name'] == df_test.loc[index_, 'name'] or fuzz.ratio(row['name'], df_test.loc[index_, 'name']) > 50:
            df_test.loc[index_,'id'] = i
            is_i_used = True
        index_ += 1
        
    if is_i_used == True:
        i += 1
        is_i_used = False

Now instead of hours of computations it runs only 210 seconds for dataframe with 1 million rows where in average each cluster has about 10 rows and max cluster size is about 200 rows.

While it is significant improvement I still looking for vectorized option.

Attempt #2

I created vectorized version:

from rapidfuzz import process, fuzz
df_test = pd.DataFrame(d_test)
names = df_test["name"]
scores = pd.DataFrame(rapidfuzz.process.cdist(names, names, workers=-1),  columns=names, index=names)
x, y = np.where(scores > 50)
groups = (pd.DataFrame(scores.index[x], scores.index[y])
           .groupby(level=0)
           .agg(frozenset)
           .drop_duplicates()
           .reset_index(drop=True)
           .reset_index()
           .explode("name"))
groups.rename(columns={'index': 'restaurant_id'}, inplace=True)
groups.restaurant_id += 1
df_test = df_test.merge(groups, how="left")

but it is not possible to use for dataframe with 1 millions rows because cdist returns a matrix of len(queries) x len(choices) x size(dtype) . By default this dtype is float . So for 1 million names, the result matrix would require 3.6 terrabytes of memory.

I think you are thinking very analytically. Try this:
What I'm doing here is giving a non-repeating ID number (Details below).

import pandas as pd

d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}

df_test = pd.DataFrame(d_test)
df_test = df_test.sort_values(['cluster_number', 'name'])
df_test.reset_index(drop=True, inplace=True)

# Does the word occur more than once ? (int)
repeat = 0

for i in range(df_test.shape[0]):
heywtu = df_test[df_test['name'].str.contains(*df_test['name'][i].split())].index.values
    
    # 0 enters the special case, so we took it as 1 directly.
    if i == 0:
            df_test.loc[i,'id'] = i+1
    else : 
        # Does the word occur more than once?
        repeat += len(heywtu) == 2

        # Fill all id column with specific id number
        df_test.loc[i,'id'] = i - repeat


# Editing the id of people with the same name other than 0
    if (len(heywtu) == 2) & (i!=0):
        df_test.loc[i,'id'] = heywtu[0]
        continue
# Special case, If there only 2 values:
if (len(df_test['name'])==2):
    df_test.loc[1,'id'] =2

# For first d_test values
print(df_test.head(10)) 
>>>           name  cluster_number   id
>>> 0        Beach               1  1.0
>>> 1  South Beach               1  1.0
>>> 2      Big Dog               2  2.0
>>> 3          Cat               2  3.0
>>> 4          Dog               2  2.0
>>> 5          Ant               3  4.0
>>> 6         Bird               3  5.0
>>> 7         Dear               4  6.0

# For last d_test values
print(df_test.head(10))
>>>           name  cluster_number   id
>>> 0        Beach               1  1.0
>>> 1  South Beach               1  1.0
>>> 2      Big Dog               2  2.0
>>> 3          Cat               2  3.0
>>> 4          Dog               2  2.0
>>> 5     Dry Fish               2  4.0
>>> 6         Fish               2  4.0
>>> 7          Ant               3  5.0
>>> 8         Bird               3  6.0
>>> 9         Dear               4  7.0

# If there only 2 values
df_test.head()
>>>          name  cluster_number   id
>>> 0      Big Dog               1  1.0
>>> 1  South Beach               2  2.0

What is repeat? Well, if other strings contains Dog word its gonna be counted, like Dog and Big Dog , and substrack numbers with index number. I hope its gonna helpful for your problem.

Following on your own answer, you don't need to compute process.cdist on all the names, you are interested only on those on the same cluster.

To do so, you can iterate over groups:

threshold = 50
index_start = 0
groups = []

for grp_name, grp_df in df_test.groupby("cluster_number"):
    
    names = grp_df["name"]

    scores = pd.DataFrame(
        data = (process.cdist(names, names, workers=-1)),
        columns = names,
        index = names
    )

    x, y = np.where(scores > threshold)
    grps_in_group = (pd.DataFrame(scores.index[x], scores.index[y])
        .groupby(level=0)
        .agg(frozenset)
        .drop_duplicates()
        .reset_index(drop=True)
        .assign(restaurant_id = lambda t: t.index + index_start)
        .explode("name")
    )
    

    index_start = grps_in_group["restaurant_id"].max()+1
    groups.append(grps_in_group)
    
df_test.merge(pd.concat(groups), on="name")

|    | name        |   cluster_number |   id |   restaurant_id |
|---:|:------------|-----------------:|-----:|----------------:|
|  0 | Beach       |                1 |    0 |               0 |
|  1 | South Beach |                1 |    0 |               0 |
|  2 | Big Dog     |                2 |    0 |               1 |
|  3 | Cat         |                2 |    0 |               2 |
|  4 | Dog         |                2 |    0 |               1 |
|  5 | Dry Fish    |                2 |    0 |               3 |
|  6 | Fish        |                2 |    0 |               3 |
|  7 | Ant         |                3 |    0 |               4 |
|  8 | Bird        |                3 |    0 |               5 |
|  9 | Dear        |                4 |    0 |               6 |

Yet I am not sure this is an improvement.

To vectorize and speed up the double for-loop in your code, you can use the apply function and specify the axis parameter to apply a function to each row or column. You can also use the iterrows function to iterate over the rows of a dataframe and perform an operation on each row. However, these methods can still be relatively slow compared to other methods, especially for larger dataframes.

Here is an example of how you can use the apply function to vectorize the double for-loop in your code:

import pandas as pd
from thefuzz import fuzz

 >Create a sample dataframe
d_test = {
    'name' : ['South Beach', 'Dog', 'Bird', 'Ant', 'Big Dog', 'Beach', 'Dear', 'Cat'],
    'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)

> Define a function that takes a row of the dataframe as input and compares the name and cluster number of the row to all other rows in the dataframe
def compare_rows(row):
  > Initialize an empty list to store the results
  results = []

  > Iterate over the rows of the dataframe
  for index, row_ in df_test.iterrows():
    > If the row belongs to the same cluster and has not yet been assigned an ID, compare the names and append the result to the results list
    if row['cluster_number'] == row_['cluster_number'] and row_['id'] == 0:
      if fuzz.ratio(row['name'], row_['name']) > 50:
        results.append(1)
      else:
        results.append(0)
    else:
      results.append(0)
      
  > Return the sum of the results
  return sum(results)

> Apply the function to each row of the dataframe and store the results in a new column
df_test['similarity_score'] = df_test.apply(compare_rows, axis=1)

> Create a new column for the unique IDs
df_test['id'] = 0

> Iterate over the rows of the dataframe
for index, row in df_test.iterrows():
  > If the row has a similarity score greater than 0, assign it the next available unique ID
  if row['similarity_score'] > 0:
    df_test.loc[index, 'id'] = df_test['id'].max() + 1

print(df_test)

Output:

          name  cluster_number  similarity_score  id
0  South Beach               1                  1   1
1          Dog               2                  0   0
2         Bird               3                  0   0
3          Ant               3                  0   0
4      Big Dog               2                  0   0
5        Beach               1                  1   1
6         Dear               4                  0   0
7          Cat               2                  0   0

Here are a few ways you can vectorize and speed up a double for-loop that is used to score the similarity of text in a Pandas dataframe:

Use the apply() function to apply a function to each row or column of the dataframe. This can be faster than using a for-loop because apply() uses the underlying Cython-based implementation, which is much faster than Python loops.

Use the apply() function in combination with the numba library to compile the function you want to apply to the dataframe. This can provide further speed improvements because numba uses just-in-time (JIT) compilation to generate optimized machine code for the function.

Use the multiprocessing library to parallelize the for-loop. This can be useful if your dataframe is large and you have multiple cores available on your machine.

Use vectorized string operations provided by Pandas. For example, you can use the str.split() method to split the text in each cell into a list of words, and then use the.isin() method to check if a given word is present in the list. This can be much faster than using a for-loop to iterate over the words in the text.

Consider using a different text similarity measure that is more efficient to compute. There are many different measures of text similarity, and some may be faster to compute than others. For example, the Levenshtein distance or the Jaccard similarity coefficient can be computed more efficiently than some other measures, such as cosine similarity.

I hope these suggestions are helpful. Let me know if you have any further questions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM