简体   繁体   中英

PySpark-how to replace similar strings in same column?

I have a dataframe column with matching strings how can I replace them with similar one which I see first in column? (I tried Levenshtein distance and fuzzywuzzy but only getting ratios it's not replacing the values).

Key  Value
1     A
1     AA
1     A,AAB
1     AAB
2     B
2     BA

Output should be

Key  Value
1     A
1     A
1     A
1     A
2     B
2     B

Everytime I am getting same result as input.

Extract the first alphanumeric character using regex.

df=df.withColumn('New_Value',regexp_extract(col('Value'), '(^[\w])', 1)) 


+---+-----+---------+
|Key|Value|New_Value|
+---+-----+---------+
|  1|    A|        A|
|  1|   AA|        A|
|  1|A,AAB|        A|
|  1|  AAB|        A|
|  2|    B|        B|
|  2|   BA|        B|
+---+-----+---------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM