I am working on extracting some data from xml. My overall workflow, which might be inefficient, is:
I created the RDD like below:
rddd = df_individual.rdd.map(tuple)
'df_individual' is the orignal dataframe where read the xml.
Below is the schema:
schema = types.StructType([
types.StructField('applicaion_id', types.StringType()),
types.StructField('cd_type', types.StringType()),
types.StructField('cd_title', types.StringType()),
types.StructField('firstname', types.StringType()),
types.StructField('middlename', types.StringType()),
types.StructField('nm_surname', types.StringType()),
types.StructField('dt_dob', types.DateType()),
types.StructField('cd_gender', types.StringType()),
types.StructField('cd_citizenship', types.StringType())
])
It fails on
df_result = spark.createDataFrame(rddd, schema)
The error is
TypeError: field dt_dob: DateType can not accept object '1973-02-19' in type <class 'str'>
The main purpose of creating the 'df_result' dataframe is having a predefined schema and implicitly casting all the columns where there is difference between RDD and dataframe. This is my first time working with RDD and I couldn't find a straight forward casting mechanism for such a case.
If you can help with solving the casting error or share a better workflow that would be great.
Thanks
If your aim is only to get your data into the right schema and transform some string columns into date columns, I would use a select
combined with to_date .
df.select('applicaion_id', 'cd_type', 'cd_title', 'firstname', 'middlename', 'nm_surname', \
F.to_date('dt_dob').alias('dt_bob'), \
'cd_gender', 'cd_citizenship') \
.printSchema()
prints
root
|-- applicaion_id: string (nullable = true)
|-- cd_type: string (nullable = true)
|-- cd_title: string (nullable = true)
|-- firstname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- nm_surname: string (nullable = true)
|-- dt_bob: date (nullable = true)
|-- cd_gender: string (nullable = true)
|-- cd_citizenship: string (nullable = true)
with the column dt_bob
having a date datatype.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.