简体   繁体   中英

Python string matching with Spark dataframe

I have a spark dataframe

id  | city  | fruit | quantity
-------------------------
0   |  CA   | apple | 300
1   |  CA   | appel | 100
2   |  CA   | orange| 20
3   |  CA   | berry | 10

I want to get rows where fruits are apple or orange . So I use Spark SQL:

SELECT * FROM table WHERE fruit LIKE '%apple%' OR fruit LIKE '%orange%';

It returns

id  | city  | fruit | quantity
-------------------------
0   |  CA   | apple | 300
2   |  CA   | orange| 20

But it is supposed to return

id  | city  | fruit | quantity
-------------------------
0   |  CA   | apple | 300
1   |  CA   | appel | 100
2   |  CA   | orange| 20

as row 1 is just a misspelling.

So I plan on using fuzzywuzzy for string matching. I know that

import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

print(fuzz.partial_ratio('apple', 'apple')) -> 100
print(fuzz.partial_ratio('apple', 'appel')) -> 83

But am not sure how to apply this to a column in dataframe to get relevant rows

Since you are interested in implementing fuzzy matching as a filter, you must first decide on a threshold of how similar you would like the matches.

Approach 1

For your fuzzywuzzy import this could be 80 for the purpose of this demonstration (adjust based on your needs). You could then implement a udf to apply your imported fuzzy logic code eg

from pyspark.sql import functions as F
from pyspark.sql import types as T

F.udf(T.BooleanType())
def is_fuzzy_match(field_value,search_value, threshold=80):
    from fuzzywuzzy import fuzz
    return fuzz.partial_ratio(field_value, search_value) > threshold

Then apply your udf as a filter on your dataframe

df = (
    df.where(
          is_fuzzy_match(F.col("fruit"),F.lit("apple"))  | 
          is_fuzzy_match(F.col("fruit"),F.lit("orange"))  
    )
)

Approach 2: Recommended

However, udfs can be expensive when executed on spark and spark has already implemented the levenshtein function which is also useful here. You may start reading more about how the levenshtein distance accomplishes fuzzy matching .

With this approach your code code look like using a threshold of 3 below

from pyspark.sql import functions as F

df = df.where(
    (
        F.levenshtein(
            F.col("fruit"),
            F.lit("apple")
        ) < 3
     ) |
     (
        F.levenshtein(
            F.col("fruit"),
            F.lit("orange")
        ) < 3
     ) 
)

df.show()
+---+----+------+--------+
| id|city| fruit|quantity|
+---+----+------+--------+
|  0|  CA| apple|     300|
|  1|  CA| appel|     100|
|  2|  CA|orange|      20|
+---+----+------+--------+

For debugging purposes the result of the levenshtein has been included below

df.withColumn("diff",
    F.levenshtein(
        F.col("fruit"),
        F.lit("apple")
    )
).show()
+---+----+------+--------+----+
| id|city| fruit|quantity|diff|
+---+----+------+--------+----+
|  0|  CA| apple|     300|   0|
|  1|  CA| appel|     100|   2|
|  2|  CA|orange|      20|   5|
|  3|  CA| berry|      10|   5|
+---+----+------+--------+----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM