My dataset has one column called 'eventAction'.
It has values like 'conversion', 'purchase', 'check-out', etc.. I want to convert this column in such a way that it maps conversion to 1 and all other categories to 0.
I used lambda function in this way:
e1 = event1.rdd.map(lambda x: 1 if x.eventAction == 'conversion' else 0)
where event1 is the name of my spark dataframe.
When printing e1
I get this:
print(e1.take(5))
[0, 0, 0, 0, 0]
So I think the lambda function worked properly. Now when I am converting to pyspark dataframe, I get null values as shown:
schema1 = StructType([StructField('conversion',IntegerType(),True)])
df = spark.createDataFrame(data=[e1],schema=schema1)
df.printSchema()
df.show()
It will be great if you can help me with this.
Thanks!
spark.createDataFrame
expects an RDD of Row
, not an RDD of integers. You need to map
the RDD to Row
objects before converting to dataframe. Note that there is no need to add square brackets around e1
.
from pyspark.sql import Row
e1 = event1.rdd.map(lambda x: 1 if x.eventAction == 'conversion' else 0).map(lambda x: Row(x))
schema1 = StructType([StructField('conversion',IntegerType(),True)])
df = spark.createDataFrame(data=e1,schema=schema1)
That said, what you're trying to do should be easily done with Spark SQL when
function. There is no need to use RDD with a custom lambda function. eg
import pyspark.sql.functions as F
df = events.select(F.when(F.col('eventAction') == 'conversion', 1).otherwise(0).alias('conversion'))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.