簡體   English   中英

pySpark如何在(鍵,元組)RDD(python)中訪問元組中的值

[英]pySpark how to access the values in a tuple in a (key,tuple) RDD (python)

我正在嘗試訪問PipelineRDD中包含的值這是我開始的內容:

1. RDD =(鍵,代碼,值)

data = [(11720, (u'I50800', 0.08229813664596274)), (11720, (u'I50801', 0.03076923076923077))]

*強調文字* 2。 我需要它按第一個值分組並將其轉到(key,tuple)其中tuple =(code,value)

testFeatures = lab_FeatureTuples = labEvents.select('ITEMID','SUBJECT_ID','NORM_ITEM_CNT')\\ .orderBy('SUBJECT_ID','ITEMID')\\ .rdd.map(lambda(ITEMID,SUBJECT_ID,NORM_ITEM_CNT),(SUBJECT (ITEMID,NORM_ITEM_CNT)))。.groupByKey()

testFeatures =  [(11720, [(u'I50800', 0.08229813664596274)),  (u'I50801', 0.03076923076923077)])]

在元組=(代碼,值)上,我想得到以下內容:

從中創建一個sparseVector,以便將其用於SVM模型

result.take(1)

這是一種實現方法:

import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)

data = [(11720, (u'I50800', 0.08229813664596274)), 
        (11720, (u'I50801', 0.03076923076923077))]
rdd = sc.parallelize(data)

df = sqlc.createDataFrame(rdd,  ['idx', 'tuple'])
df.show()

給,

+-----+--------------------+
|  idx|               tuple|
+-----+--------------------+
|11720|[I50800,0.0822981...|
|11720|[I50801,0.0307692...|
+-----+--------------------+

現在定義pyspark用戶定義的功能:

extract_tuple_0 = sf.udf(lambda x: x[0], returnType=sparktypes.StringType())
extract_tuple_1 = sf.udf(lambda x: x[1], returnType=sparktypes.FloatType())
df = df.withColumn('tup0', extract_tuple_0(sf.col('tuple')))

df = df.withColumn('tup1', extract_tuple_1(sf.col('tuple')))
df.show()

得到:

+-----+--------------------+----------+------+
|  idx|               tuple|      tup1|  tup0|
+-----+--------------------+----------+------+
|11720|[I50800,0.0822981...|0.08229814|I50800|
|11720|[I50801,0.0307692...|0.03076923|I50801|
+-----+--------------------+----------+------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM