简体   繁体   中英

Transform rows and column and create a similarity dataframe using pyspark

I already have a normalized dataset:

 df = spark.createDataFrame([('red apple', 'ripe banana', 0.3),
            ('red apple', 'hot pepper', 0.4),
            ('red apple','sweet kiwi', 0.5),
            ('ripe banana','hot pepper', 0.6),
            ('ripe banana','sweet kiwi', 0.7),
            ('hot pepper','sweet kiwi', 0.8)], ["phrase1", "phrase2", 'common_persent'])
df.show()

+-----------+-----------+--------------+
|    phrase1|    phrase2|common_persent|
+-----------+-----------+--------------+
|  red apple|ripe banana|           0.3|
|  red apple| hot pepper|           0.4|
|  red apple| sweet kiwi|           0.5|
|ripe banana| hot pepper|           0.6|
|ripe banana| sweet kiwi|           0.7|
| hot pepper| sweet kiwi|           0.8|
+-----------+-----------+--------------+

I want to create a matrix of similarity using pyspark. The result should look like this:

+-----------+---------+-----------+----------+----------+
|    phrases|red apple|ripe banana|hot pepper|sweet kiwi|
+-----------+---------+-----------+----------+----------+
|  red apple|      1.0|        0.3|       0.4|       0.5|
|ripe banana|      0.3|        1.0|       0.6|       0.7|
| hot pepper|      0.4|        0.6|       1.0|       0.8|
| sweet kiwi|      0.5|        0.7|       0.8|       1.0|
+-----------+---------+-----------+----------+----------+

Thus, there are units on the diagonal, and above / below the values from the columns average . My problem is that, I don't quite understand how to make the units on the diagonal.

You can simply pivot your dataframe.

from pyspark.sql import functions as f

df.groupBy('phrase1').pivot('phrase2').agg(f.first('common_persent')).show()

+-----------+----------+-----------+----------+
|    phrase1|hot pepper|ripe banana|sweet kiwi|
+-----------+----------+-----------+----------+
|  red apple|       0.4|        0.3|       0.5|
|ripe banana|       0.6|       null|       0.7|
| hot pepper|      null|       null|       0.8|
+-----------+----------+-----------+----------+

To transform your dataframe into a similarity matrix, you have to follow several steps:

  • first, create the list of permutations of phrase1/phrase2 by creating an array and then using explode on this array
  • then, pivot your dataframe as explained in Lamanus' answer
  • finally, fill null values with 1.0 using fillna
  • optionally, reorder your phrases column

The complete code is as follows:

from pyspark.sql import functions as F

result = df.withColumn(
  'permutations',
  F.array(
    F.struct(F.col('phrase1').alias('first'), F.col('phrase2').alias('second')),
    F.struct(F.col('phrase2').alias('first'), F.col('phrase1').alias('second'))
  )
) \
  .drop('phrase1', 'phrase2') \
  .select(F.explode('permutations'), F.col('common_persent')) \
  .select("col.*", "common_persent") \
  .groupBy(F.col('first').alias('phrases')) \
  .pivot('second') \
  .agg(F.first('common_persent')) \
  .fillna(1.0) \
  .orderBy('phrases')

With your df dataframe, it gives you the following result dataframe:

+-----------+----------+---------+-----------+----------+
|phrases    |hot pepper|red apple|ripe banana|sweet kiwi|
+-----------+----------+---------+-----------+----------+
|hot pepper |1.0       |0.4      |0.6        |0.8       |
|red apple  |0.4       |1.0      |0.3        |0.5       |
|ripe banana|0.6       |0.3      |1.0        |0.7       |
|sweet kiwi |0.8       |0.5      |0.7        |1.0       |
+-----------+----------+---------+-----------+----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM