PySpark-how to replace similar strings in same column?

Question

I have a dataframe column with matching strings how can I replace them with similar one which I see first in column? (I tried Levenshtein distance and fuzzywuzzy but only getting ratios it's not replacing the values).

Key  Value
1     A
1     AA
1     A,AAB
1     AAB
2     B
2     BA

Output should be

Key  Value
1     A
1     A
1     A
1     A
2     B
2     B

Everytime I am getting same result as input.

Answer 1

Extract the first alphanumeric character using regex.

df=df.withColumn('New_Value',regexp_extract(col('Value'), '(^[\w])', 1)) 


+---+-----+---------+
|Key|Value|New_Value|
+---+-----+---------+
|  1|    A|        A|
|  1|   AA|        A|
|  1|A,AAB|        A|
|  1|  AAB|        A|
|  2|    B|        B|
|  2|   BA|        B|
+---+-----+---------+

PySpark-how to replace similar strings in same column?

Question

1 answers

solution1
0 2022-02-02 21:19:29

PySpark-how to replace similar strings in same column?

Question

1 answers

solution1 0 2022-02-02 21:19:29

solution1
0 2022-02-02 21:19:29