简体   繁体   中英

Getting null values when converting pyspark.rdd.PipelinedRDD object into Pyspark dataframe

My dataset has one column called 'eventAction'.

It has values like 'conversion', 'purchase', 'check-out', etc.. I want to convert this column in such a way that it maps conversion to 1 and all other categories to 0.

I used lambda function in this way:

e1 = event1.rdd.map(lambda x: 1 if x.eventAction == 'conversion' else 0)

where event1 is the name of my spark dataframe.

When printing e1 I get this:

print(e1.take(5))
[0, 0, 0, 0, 0]

So I think the lambda function worked properly. Now when I am converting to pyspark dataframe, I get null values as shown:

schema1 = StructType([StructField('conversion',IntegerType(),True)])
df = spark.createDataFrame(data=[e1],schema=schema1)
df.printSchema()
df.show()

1

It will be great if you can help me with this.

Thanks!

spark.createDataFrame expects an RDD of Row , not an RDD of integers. You need to map the RDD to Row objects before converting to dataframe. Note that there is no need to add square brackets around e1 .

from pyspark.sql import Row

e1 = event1.rdd.map(lambda x: 1 if x.eventAction == 'conversion' else 0).map(lambda x: Row(x))
schema1 = StructType([StructField('conversion',IntegerType(),True)])
df = spark.createDataFrame(data=e1,schema=schema1)

That said, what you're trying to do should be easily done with Spark SQL when function. There is no need to use RDD with a custom lambda function. eg

import pyspark.sql.functions as F

df = events.select(F.when(F.col('eventAction') == 'conversion', 1).otherwise(0).alias('conversion'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM