[英]Split pyspark rdd in key and value as key,0 and value,1
I have an rdd as below:我有一个如下的rdd:
['1:2','3:4','5:1 2 3']
and want to split it like this:并想像这样拆分它:
[1,0], [2,1], [3,0], [4,1], [5,0],[1,1],[2,1],[3,1]
Logic - x:y Left side of colon should make x,0 and right side of colon will make y,1.逻辑 - x:y 冒号的左侧应为 x,0,冒号的右侧应为 y,1。
x: y a b c
If right side of colon contains multiple value seperated by space then all value should make (y,1) (a,1) (b,1) (c,1)
如果冒号右侧包含多个由空格分隔的值,则所有值应为
(y,1) (a,1) (b,1) (c,1)
How can i get above result in pyspark.我怎样才能在 pyspark 中获得以上结果。
You can achieve this using below您可以使用以下方法实现此目的
from pyspark.sql import SparkSession
data = ['1:2', '3:4', '5:1 2 3']
spark = SparkSession.builder.master("local[4]").appName("Q71346701") \
.getOrCreate()
def generate_output(row):
final_elements = []
items = row.split(':')
for idx, elm in enumerate(items):
inner_list = elm.split(' ')
if len(inner_list) == 1:
final_elements.append((int(elm), idx))
else:
for el in inner_list:
final_elements.append((int(el), 1))
return final_elements
rdd = spark.sparkContext.parallelize(data)
final_rdd = rdd.flatMap(generate_output)
print(final_rdd.collect())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.