简体   繁体   中英

pySpark how to access the values in a tuple in a (key,tuple) RDD (python)

I am trying to access the values contained in an PipelineRDD Here is what I started with:

1. RDD = (key,code,value)

data = [(11720, (u'I50800', 0.08229813664596274)), (11720, (u'I50801', 0.03076923076923077))]

*emphasized text*2. I needed it to group by the first value and turn it to (key,tuple ) where tuple = (code,value)

testFeatures = lab_FeatureTuples = labEvents.select('ITEMID', 'SUBJECT_ID','NORM_ITEM_CNT')\\ .orderBy('SUBJECT_ID','ITEMID')\\ .rdd.map(lambda (ITEMID,SUBJECT_ID,NORM_ITEM_CNT):(SUBJECT_ID,(ITEMID,NORM_ITEM_CNT)))\\ .groupByKey()

testFeatures =  [(11720, [(u'I50800', 0.08229813664596274)),  (u'I50801', 0.03076923076923077)])]

On the tuple = (code,value), I want to get the following :

Create a sparseVector out of it so I can use it for the SVM model

result.take(1)

Here is one way to do it:

import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
sc = pyspark.SparkContext()
sqlc = pyspark.SQLContext(sc)

data = [(11720, (u'I50800', 0.08229813664596274)), 
        (11720, (u'I50801', 0.03076923076923077))]
rdd = sc.parallelize(data)

df = sqlc.createDataFrame(rdd,  ['idx', 'tuple'])
df.show()

gives,

+-----+--------------------+
|  idx|               tuple|
+-----+--------------------+
|11720|[I50800,0.0822981...|
|11720|[I50801,0.0307692...|
+-----+--------------------+

now define pyspark user defined fuctions:

extract_tuple_0 = sf.udf(lambda x: x[0], returnType=sparktypes.StringType())
extract_tuple_1 = sf.udf(lambda x: x[1], returnType=sparktypes.FloatType())
df = df.withColumn('tup0', extract_tuple_0(sf.col('tuple')))

df = df.withColumn('tup1', extract_tuple_1(sf.col('tuple')))
df.show()

gives:

+-----+--------------------+----------+------+
|  idx|               tuple|      tup1|  tup0|
+-----+--------------------+----------+------+
|11720|[I50800,0.0822981...|0.08229814|I50800|
|11720|[I50801,0.0307692...|0.03076923|I50801|
+-----+--------------------+----------+------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM