Transform rows and column and create a similarity dataframe using pyspark

Question

I already have a normalized dataset:

 df = spark.createDataFrame([('red apple', 'ripe banana', 0.3),
            ('red apple', 'hot pepper', 0.4),
            ('red apple','sweet kiwi', 0.5),
            ('ripe banana','hot pepper', 0.6),
            ('ripe banana','sweet kiwi', 0.7),
            ('hot pepper','sweet kiwi', 0.8)], ["phrase1", "phrase2", 'common_persent'])
df.show()

+-----------+-----------+--------------+
|    phrase1|    phrase2|common_persent|
+-----------+-----------+--------------+
|  red apple|ripe banana|           0.3|
|  red apple| hot pepper|           0.4|
|  red apple| sweet kiwi|           0.5|
|ripe banana| hot pepper|           0.6|
|ripe banana| sweet kiwi|           0.7|
| hot pepper| sweet kiwi|           0.8|
+-----------+-----------+--------------+

I want to create a matrix of similarity using pyspark. The result should look like this:

+-----------+---------+-----------+----------+----------+
|    phrases|red apple|ripe banana|hot pepper|sweet kiwi|
+-----------+---------+-----------+----------+----------+
|  red apple|      1.0|        0.3|       0.4|       0.5|
|ripe banana|      0.3|        1.0|       0.6|       0.7|
| hot pepper|      0.4|        0.6|       1.0|       0.8|
| sweet kiwi|      0.5|        0.7|       0.8|       1.0|
+-----------+---------+-----------+----------+----------+

Thus, there are units on the diagonal, and above / below the values from the columns average . My problem is that, I don't quite understand how to make the units on the diagonal.

Answer 1

You can simply pivot your dataframe.

from pyspark.sql import functions as f

df.groupBy('phrase1').pivot('phrase2').agg(f.first('common_persent')).show()

+-----------+----------+-----------+----------+
|    phrase1|hot pepper|ripe banana|sweet kiwi|
+-----------+----------+-----------+----------+
|  red apple|       0.4|        0.3|       0.5|
|ripe banana|       0.6|       null|       0.7|
| hot pepper|      null|       null|       0.8|
+-----------+----------+-----------+----------+

Answer 2

To transform your dataframe into a similarity matrix, you have to follow several steps:

first, create the list of permutations of phrase1/phrase2 by creating an array and then using explode on this array
then, pivot your dataframe as explained in Lamanus' answer
finally, fill null values with 1.0 using fillna
optionally, reorder your phrases column

The complete code is as follows:

from pyspark.sql import functions as F

result = df.withColumn(
  'permutations',
  F.array(
    F.struct(F.col('phrase1').alias('first'), F.col('phrase2').alias('second')),
    F.struct(F.col('phrase2').alias('first'), F.col('phrase1').alias('second'))
  )
) \
  .drop('phrase1', 'phrase2') \
  .select(F.explode('permutations'), F.col('common_persent')) \
  .select("col.*", "common_persent") \
  .groupBy(F.col('first').alias('phrases')) \
  .pivot('second') \
  .agg(F.first('common_persent')) \
  .fillna(1.0) \
  .orderBy('phrases')

With your df dataframe, it gives you the following result dataframe:

+-----------+----------+---------+-----------+----------+
|phrases    |hot pepper|red apple|ripe banana|sweet kiwi|
+-----------+----------+---------+-----------+----------+
|hot pepper |1.0       |0.4      |0.6        |0.8       |
|red apple  |0.4       |1.0      |0.3        |0.5       |
|ripe banana|0.6       |0.3      |1.0        |0.7       |
|sweet kiwi |0.8       |0.5      |0.7        |1.0       |
+-----------+----------+---------+-----------+----------+

Transform rows and column and create a similarity dataframe using pyspark

Question

2 answers

solution1
0 2021-11-20 05:17:53

solution2
0 2021-11-20 07:47:55

Transform rows and column and create a similarity dataframe using pyspark

Question

2 answers

solution1 0 2021-11-20 05:17:53

solution2 0 2021-11-20 07:47:55

solution1
0 2021-11-20 05:17:53

solution2
0 2021-11-20 07:47:55