Getting null values when converting pyspark.rdd.PipelinedRDD object into Pyspark dataframe

Question

My dataset has one column called 'eventAction'.

It has values like 'conversion', 'purchase', 'check-out', etc.. I want to convert this column in such a way that it maps conversion to 1 and all other categories to 0.

I used lambda function in this way:

e1 = event1.rdd.map(lambda x: 1 if x.eventAction == 'conversion' else 0)

where event1 is the name of my spark dataframe.

When printing e1 I get this:

print(e1.take(5))
[0, 0, 0, 0, 0]

So I think the lambda function worked properly. Now when I am converting to pyspark dataframe, I get null values as shown:

schema1 = StructType([StructField('conversion',IntegerType(),True)])
df = spark.createDataFrame(data=[e1],schema=schema1)
df.printSchema()
df.show()

It will be great if you can help me with this.

Thanks!

Answer 1

spark.createDataFrame expects an RDD of Row , not an RDD of integers. You need to map the RDD to Row objects before converting to dataframe. Note that there is no need to add square brackets around e1 .

from pyspark.sql import Row

e1 = event1.rdd.map(lambda x: 1 if x.eventAction == 'conversion' else 0).map(lambda x: Row(x))
schema1 = StructType([StructField('conversion',IntegerType(),True)])
df = spark.createDataFrame(data=e1,schema=schema1)

That said, what you're trying to do should be easily done with Spark SQL when function. There is no need to use RDD with a custom lambda function. eg

import pyspark.sql.functions as F

df = events.select(F.when(F.col('eventAction') == 'conversion', 1).otherwise(0).alias('conversion'))

Getting null values when converting pyspark.rdd.PipelinedRDD object into Pyspark dataframe

Question

1 answers

solution1
1 ACCPTED 2021-03-22 12:58:56

Getting null values when converting pyspark.rdd.PipelinedRDD object into Pyspark dataframe

Question

1 answers

solution1 1 ACCPTED 2021-03-22 12:58:56

solution1
1 ACCPTED 2021-03-22 12:58:56