模糊搜索 pyspark dataframe

Question

I have a large csv file (>96 million rows) and seven columns.我有一个大文件 csv（> 9600 万行）和七列。 I want to do a fuzzy search on one of the columns and retrieve the records with the highest similarity to the input string.我想对其中一列进行模糊搜索，并检索与输入字符串相似度最高的记录。 The file is managed by spark and I load it via pyspark into some dataframe. Now I want to use something like fuzzywuzzy to extract the rows matching the best.该文件由 spark 管理，我通过 pyspark 将其加载到一些 dataframe 中。现在我想使用 fuzzywuzzy 之类的东西来提取最匹配的行。

But the fuzzywuzzy function extract returns something that I cannot work with:但是 fuzzywuzzy function 提取返回了一些我无法使用的东西：

process.extract("appel", df.select(df['lowercase']), limit=10)

Result: [(Column<'lowercase'>, 44)]结果： [(Column<'lowercase'>, 44)]

df is the pyspark dataframe (load using spark.read.csv), the column I want to search on is 'lowercase' and I want to retrieve all other columns for the respective rows plus the similarity score. df 是 pyspark dataframe（使用 spark.read.csv 加载），我要搜索的列是“小写”，我想检索各个行的所有其他列以及相似性分数。

Any suggestions?有什么建议么？

Answer 1

You can try using other python libraries like Rapidfuzz , which computes fuzzy string match taking an input string and list of strings, as input.您可以尝试使用其他 python 库，例如Rapidfuzz ，它以输入字符串和字符串列表作为输入来计算模糊字符串匹配。 You can use choose your desired string match algorithm to compute appropriate matches.您可以使用选择所需的字符串匹配算法来计算适当的匹配。

The code would look something like this:代码看起来像这样：

# pip install rapidfuzz
from rapidfuzz import fuzz, process

input_string = 'example string'
query_list = df.select(df['lowercase'])

results = process.extract(input_string, query_list, scorer=fuzz.token_ratio, limit=1)

# output format = [('string1', confidence_score, index_in_list), ...]

Answer 2

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from fuzzywuzzy import fuzz
    
    
def match_string(s1, s2):
    val = fuzz.token_sort_ratio(s1, s2)
    return val
    
MatchUDF = udf(match_string, StringType())
    
scores_df = df.withColumn("similarity_score", MatchUDF(F.col("name_1"), F.col("name_2")))\
            .withColumn("run_date", F.current_date())
    
scores_df.show()

模糊搜索 pyspark dataframe

问题描述

2 个解决方案

解决方案1
0 2022-09-27 20:26:05

解决方案2
0 2023-01-28 18:46:28

模糊搜索 pyspark dataframe

问题描述

2 个解决方案

解决方案1 0 2022-09-27 20:26:05

解决方案2 0 2023-01-28 18:46:28

解决方案1
0 2022-09-27 20:26:05

解决方案2
0 2023-01-28 18:46:28