简体   繁体   English

在 key 和 value 中拆分 pyspark rdd 作为 key,0 和 value,1

[英]Split pyspark rdd in key and value as key,0 and value,1

I have an rdd as below:我有一个如下的rdd:

['1:2','3:4','5:1 2 3']

and want to split it like this:并想像这样拆分它:

[1,0], [2,1], [3,0], [4,1], [5,0],[1,1],[2,1],[3,1]

Logic - x:y Left side of colon should make x,0 and right side of colon will make y,1.逻辑 - x:y 冒号的左侧应为 x,0,冒号的右侧应为 y,1。

x: y a b c

If right side of colon contains multiple value seperated by space then all value should make (y,1) (a,1) (b,1) (c,1)如果冒号右侧包含多个由空格分隔的值,则所有值应为(y,1) (a,1) (b,1) (c,1)

How can i get above result in pyspark.我怎样才能在 pyspark 中获得以上结果。

You can achieve this using below您可以使用以下方法实现此目的

from pyspark.sql import SparkSession

data = ['1:2', '3:4', '5:1 2 3']
spark = SparkSession.builder.master("local[4]").appName("Q71346701") \
    .getOrCreate()

def generate_output(row):
    final_elements = []
    items = row.split(':')
    for idx, elm in enumerate(items):
        inner_list = elm.split(' ')
        if len(inner_list) == 1:
            final_elements.append((int(elm), idx))
        else:
            for el in inner_list:
                final_elements.append((int(el), 1))
    return final_elements


rdd = spark.sparkContext.parallelize(data)
final_rdd = rdd.flatMap(generate_output)
print(final_rdd.collect())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM