Tokenizing and ranking a string column into multiple columns in PySpark

Question

I have a PySpark dataframe that has a string column which contains a comma separated, unsorted list of values (up to 5 values), like this:

+----+----------------------+
|col1|col2                  |
+----+----------------------+
|1   | 'b1, a1, c1'         |
|2   | 'a2, b2'             |
|3   | 'e3, d3, a3, c3, b3' |
+----+----------------------+

I want to tokenize col2 and then rank them based on a criteria and create 5 new different columns out of col2 , possibly with null values if the tokenization returns less than 5 values. The ranking is simple: If the token is in set1, put it in the first new column (col3), else if it is in set2, put it in the second new column (col4) and so on.

Let's say:

set1 = ['a1', 'a2', 'a3', 'a4', 'a5'], 
set2 = ['b1', 'b2', 'b3', 'b4', 'b5'], 
set3 = ['c1', 'c2', 'c3', 'c4', 'c5'], 
set4 = ['d1', 'd2', 'd3', 'd4', 'd5'], 
set5 = ['e1', 'e2', 'e3', 'e4', 'e5']

Then applying the change on the dataframe above will result in the following dataframe:

+----+----+----+----+----+----+
|col1|col3|col4|col5|col6|col7|
+----+----+----+----+----+----+
|1   |'a1'|'b1'|'c1'|null|null|
|2   |'a2'|'b2'|null|null|null|
|3   |'a3'|'b3'|'c3'|'d3'|'e3'|
+----+----+----+----+----+----+

I know how to do tokenization:

df.withColumn('col2', split('col2', ', ')) \
  .select(col('col1'), *[col('col2')[i].alias('col' + str(i + 3)) for i in range(0, 5)]) \
  .show()

but can't figure out how to perform ranking before creating the new columns. Any help would be much appreciated.

Answer 1

I found a solution for this. We can use a udf that sorts the list of strings in that column based on the sets. Then apply the tokenization on top of the udf function and create different columns from it.

set1 = set(['a1', 'a2', 'a3', 'a4', 'a5'])
set2 = set(['b1', 'b2', 'b3', 'b4', 'b5'])
set3 = set(['c1', 'c2', 'c3', 'c4', 'c5'])
set4 = set(['d1', 'd2', 'd3', 'd4', 'd5'])
set5 = set(['e1', 'e2', 'e3', 'e4', 'e5'])

def sortCategories(x):
    resultArray = ['unknown' for i in range(5)]
    tokens = x.split(',')
    for token in tokens:
        if token in set1:
            resultArray[0] = token
        elif token in set2:
            resultArray[1] = token
        elif token in set3:
            resultArray[2] = token
        elif token in set4:
            resultArray[3] = token
        elif token in set5:
            resultArray[4] = token
    return ','.join(resultArray)

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
orderUdfString = udf(lambda s: sortCategories(s), StringType())
df = df.withColumn('col2', orderUdfString('col2'))
df = df.withColumn('col_temp', split('col2', ',')) \
  .select([col(c) for c in df.columns] + [col('col_temp')[i].alias('col' + str(i + 1)) for i in range(0, 5)])

Tokenizing and ranking a string column into multiple columns in PySpark

Question

1 answers

solution1
0 ACCPTED 2020-09-09 02:17:16

Tokenizing and ranking a string column into multiple columns in PySpark

Question

1 answers

solution1 0 ACCPTED 2020-09-09 02:17:16

solution1
0 ACCPTED 2020-09-09 02:17:16