[英]pySpark how to access the values in a tuple in a (key,tuple) RDD (python)
我正在嘗試訪問PipelineRDD中包含的值這是我開始的內容:
1. RDD =(鍵,代碼,值)
data = [(11720, (u'I50800', 0.08229813664596274)), (11720, (u'I50801', 0.03076923076923077))]
*強調文字* 2。 我需要它按第一個值分組並將其轉到(key,tuple)其中tuple =(code,value)
testFeatures = lab_FeatureTuples = labEvents.select('ITEMID','SUBJECT_ID','NORM_ITEM_CNT')\\ .orderBy('SUBJECT_ID','ITEMID')\\ .rdd.map(lambda(ITEMID,SUBJECT_ID,NORM_ITEM_CNT),(SUBJECT (ITEMID,NORM_ITEM_CNT)))。.groupByKey()
testFeatures = [(11720, [(u'I50800', 0.08229813664596274)), (u'I50801', 0.03076923076923077)])]
在元組=(代碼,值)上,我想得到以下內容:
從中創建一個sparseVector,以便將其用於SVM模型
result.take(1)
這是一種實現方法:
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)
data = [(11720, (u'I50800', 0.08229813664596274)),
(11720, (u'I50801', 0.03076923076923077))]
rdd = sc.parallelize(data)
df = sqlc.createDataFrame(rdd, ['idx', 'tuple'])
df.show()
給,
+-----+--------------------+
| idx| tuple|
+-----+--------------------+
|11720|[I50800,0.0822981...|
|11720|[I50801,0.0307692...|
+-----+--------------------+
現在定義pyspark用戶定義的功能:
extract_tuple_0 = sf.udf(lambda x: x[0], returnType=sparktypes.StringType())
extract_tuple_1 = sf.udf(lambda x: x[1], returnType=sparktypes.FloatType())
df = df.withColumn('tup0', extract_tuple_0(sf.col('tuple')))
df = df.withColumn('tup1', extract_tuple_1(sf.col('tuple')))
df.show()
得到:
+-----+--------------------+----------+------+
| idx| tuple| tup1| tup0|
+-----+--------------------+----------+------+
|11720|[I50800,0.0822981...|0.08229814|I50800|
|11720|[I50801,0.0307692...|0.03076923|I50801|
+-----+--------------------+----------+------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.