Recently I've been working on a project using pyspark and I come across this problem that I don't know how to solve. Basically the manipulation involves 3 files, each of which looks like below,
File 1: mapping one id set(idset1) to another id set(idset2)
lines look like
[000001, 15120001]
[000002, 15120002]
...
File 2: mapping ids in idset2 to items contains in each of ids in idset2
lines look like
[15120001, 600001]
[15120001, 600002]
[15120002, 601988]
...
File3 : a series of numbers correspond to each item in each id
lines look like
[600001, 1.11, 1.12, 1.32, 1.42, ..., 1.51]
[600002, 5.12, 5.21, 5.23, 5.21, ..., 5.21]
[601988, 52.1, 52.1, 52.2, 52.4, ..., 52.1]
...
What I need to do is to get something like
[000001, (1.11+5.12)/2,(1.12+5.21)/2,...,(1.51+5.21)/2]
[000002, 52.1, 52.1, 52.2, 52.4, ..., 52.1]
...
ie map ids in idset1 to equally weighted average of items in ids of idset2 that corresponding to each id in idset1.
If someone understand what I mean, please help me with this. By the way, ids are not auto incremental , they are pre-assigned. Thanks to all who try to help me in advance.
Lets start with creating example data. I assume all ids are actually strings but it doesn't really affect further computations.
rdd1 = sc.parallelize([["000001", "15120001"], ["000002", "15120002"]])
rdd2 = sc.parallelize([
["15120001", "600001"], ["15120001", "600002"],
["15120002", "601988"]
])
rdd3 = sc.parallelize([
["600001", 1.11, 1.12, 1.32, 1.42, 1.51],
["600002", 5.12, 5.21, 5.23, 5.21, 5.21],
["601988", 52.1, 52.1, 52.2, 52.4, 52.1]
])
Next lets convert all RDDs
to DataFrames
:
df1 = rdd1.toDF(("id1", "id2"))
df2 = rdd2.toDF(("id2_", "item_id"))
n_features = len(rdd3.first()) - 1
feature_names = ["x_{0}".format(i) for i in range(n_features)]
df3 = rdd3.toDF(["item_id_"] + feature_names)
Join data:
from pyspark.sql.functions import col
combined = (df1
.join(df2, col("id2") == col("id2_"))
.join(df3, col("item_id") == col("item_id_")))
and aggregate:
from pyspark.sql.functions import avg
exprs = [avg(x).alias(x) for x in feature_names]
aggregated = combined.groupBy(col("id1")).agg(*exprs)
aggregated.show()
## +------+-----+-----+------------------+-----+----+
## | id1| x_0| x_1| x_2| x_3| x_4|
## +------+-----+-----+------------------+-----+----+
## |000001|3.115|3.165|3.2750000000000004|3.315|3.36|
## |000002| 52.1| 52.1| 52.2| 52.4|52.1|
## +------+-----+-----+------------------+-----+----+
Aggregated data can be converted back to RDD
if needed:
aggregated.map(tuple).collect()
## [('000001', 3.115, 3.165, 3.2750000000000004, 3.315, 3.36),
## ('000002', 52.1, 52.1, 52.2, 52.4, 52.1)]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.