I already have a normalized dataset:
df = spark.createDataFrame([('red apple', 'ripe banana', 0.3),
('red apple', 'hot pepper', 0.4),
('red apple','sweet kiwi', 0.5),
('ripe banana','hot pepper', 0.6),
('ripe banana','sweet kiwi', 0.7),
('hot pepper','sweet kiwi', 0.8)], ["phrase1", "phrase2", 'common_persent'])
df.show()
+-----------+-----------+--------------+
| phrase1| phrase2|common_persent|
+-----------+-----------+--------------+
| red apple|ripe banana| 0.3|
| red apple| hot pepper| 0.4|
| red apple| sweet kiwi| 0.5|
|ripe banana| hot pepper| 0.6|
|ripe banana| sweet kiwi| 0.7|
| hot pepper| sweet kiwi| 0.8|
+-----------+-----------+--------------+
I want to create a matrix of similarity using pyspark. The result should look like this:
+-----------+---------+-----------+----------+----------+
| phrases|red apple|ripe banana|hot pepper|sweet kiwi|
+-----------+---------+-----------+----------+----------+
| red apple| 1.0| 0.3| 0.4| 0.5|
|ripe banana| 0.3| 1.0| 0.6| 0.7|
| hot pepper| 0.4| 0.6| 1.0| 0.8|
| sweet kiwi| 0.5| 0.7| 0.8| 1.0|
+-----------+---------+-----------+----------+----------+
Thus, there are units on the diagonal, and above / below the values from the columns average
. My problem is that, I don't quite understand how to make the units on the diagonal.
You can simply pivot your dataframe.
from pyspark.sql import functions as f
df.groupBy('phrase1').pivot('phrase2').agg(f.first('common_persent')).show()
+-----------+----------+-----------+----------+
| phrase1|hot pepper|ripe banana|sweet kiwi|
+-----------+----------+-----------+----------+
| red apple| 0.4| 0.3| 0.5|
|ripe banana| 0.6| null| 0.7|
| hot pepper| null| null| 0.8|
+-----------+----------+-----------+----------+
To transform your dataframe into a similarity matrix, you have to follow several steps:
explode
on this arraynull
values with 1.0
using fillna
phrases
column The complete code is as follows:
from pyspark.sql import functions as F
result = df.withColumn(
'permutations',
F.array(
F.struct(F.col('phrase1').alias('first'), F.col('phrase2').alias('second')),
F.struct(F.col('phrase2').alias('first'), F.col('phrase1').alias('second'))
)
) \
.drop('phrase1', 'phrase2') \
.select(F.explode('permutations'), F.col('common_persent')) \
.select("col.*", "common_persent") \
.groupBy(F.col('first').alias('phrases')) \
.pivot('second') \
.agg(F.first('common_persent')) \
.fillna(1.0) \
.orderBy('phrases')
With your df
dataframe, it gives you the following result
dataframe:
+-----------+----------+---------+-----------+----------+
|phrases |hot pepper|red apple|ripe banana|sweet kiwi|
+-----------+----------+---------+-----------+----------+
|hot pepper |1.0 |0.4 |0.6 |0.8 |
|red apple |0.4 |1.0 |0.3 |0.5 |
|ripe banana|0.6 |0.3 |1.0 |0.7 |
|sweet kiwi |0.8 |0.5 |0.7 |1.0 |
+-----------+----------+---------+-----------+----------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.