简体   繁体   中英

Tokenizing and ranking a string column into multiple columns in PySpark

I have a PySpark dataframe that has a string column which contains a comma separated, unsorted list of values (up to 5 values), like this:

+----+----------------------+
|col1|col2                  |
+----+----------------------+
|1   | 'b1, a1, c1'         |
|2   | 'a2, b2'             |
|3   | 'e3, d3, a3, c3, b3' |
+----+----------------------+

I want to tokenize col2 and then rank them based on a criteria and create 5 new different columns out of col2 , possibly with null values if the tokenization returns less than 5 values. The ranking is simple: If the token is in set1, put it in the first new column (col3), else if it is in set2, put it in the second new column (col4) and so on.

Let's say:

set1 = ['a1', 'a2', 'a3', 'a4', 'a5'], 
set2 = ['b1', 'b2', 'b3', 'b4', 'b5'], 
set3 = ['c1', 'c2', 'c3', 'c4', 'c5'], 
set4 = ['d1', 'd2', 'd3', 'd4', 'd5'], 
set5 = ['e1', 'e2', 'e3', 'e4', 'e5']

Then applying the change on the dataframe above will result in the following dataframe:

+----+----+----+----+----+----+
|col1|col3|col4|col5|col6|col7|
+----+----+----+----+----+----+
|1   |'a1'|'b1'|'c1'|null|null|
|2   |'a2'|'b2'|null|null|null|
|3   |'a3'|'b3'|'c3'|'d3'|'e3'|
+----+----+----+----+----+----+

I know how to do tokenization:

df.withColumn('col2', split('col2', ', ')) \
  .select(col('col1'), *[col('col2')[i].alias('col' + str(i + 3)) for i in range(0, 5)]) \
  .show()

but can't figure out how to perform ranking before creating the new columns. Any help would be much appreciated.

I found a solution for this. We can use a udf that sorts the list of strings in that column based on the sets. Then apply the tokenization on top of the udf function and create different columns from it.

set1 = set(['a1', 'a2', 'a3', 'a4', 'a5'])
set2 = set(['b1', 'b2', 'b3', 'b4', 'b5'])
set3 = set(['c1', 'c2', 'c3', 'c4', 'c5'])
set4 = set(['d1', 'd2', 'd3', 'd4', 'd5'])
set5 = set(['e1', 'e2', 'e3', 'e4', 'e5'])

def sortCategories(x):
    resultArray = ['unknown' for i in range(5)]
    tokens = x.split(',')
    for token in tokens:
        if token in set1:
            resultArray[0] = token
        elif token in set2:
            resultArray[1] = token
        elif token in set3:
            resultArray[2] = token
        elif token in set4:
            resultArray[3] = token
        elif token in set5:
            resultArray[4] = token
    return ','.join(resultArray)

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
orderUdfString = udf(lambda s: sortCategories(s), StringType())
df = df.withColumn('col2', orderUdfString('col2'))
df = df.withColumn('col_temp', split('col2', ',')) \
  .select([col(c) for c in df.columns] + [col('col_temp')[i].alias('col' + str(i + 1)) for i in range(0, 5)])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM