简体   繁体   English

将 RDD 转换为 dataframe 字符串到日期转换失败

[英]converting RDD to dataframe fails on string to date conversion

I am working on extracting some data from xml.我正在从 xml 中提取一些数据。 My overall workflow, which might be inefficient, is:我的整体工作流程可能效率低下,是:

  1. Read xml into a dataframe ('df_individual')将 xml 读入 dataframe ('df_individual')
  2. Filter unwanted columns过滤不需要的列
  3. Make the target schema (shared below)制作目标架构(在下面共享)
  4. Convert the dataframe to RDD将 dataframe 转换为 RDD
  5. Create a dataframe using schema and RDD from step 3 and 4使用步骤 3 和 4 中的模式和 RDD 创建 dataframe

I created the RDD like below:我创建了如下RDD:

rddd = df_individual.rdd.map(tuple)

'df_individual' is the orignal dataframe where read the xml. 'df_individual' 是原始 dataframe,其中读取 xml。

Below is the schema:下面是架构:

schema = types.StructType([
        types.StructField('applicaion_id', types.StringType()),
        types.StructField('cd_type', types.StringType()),
        types.StructField('cd_title', types.StringType()),
        types.StructField('firstname', types.StringType()),
        types.StructField('middlename', types.StringType()),
        types.StructField('nm_surname', types.StringType()),
        types.StructField('dt_dob', types.DateType()),
        types.StructField('cd_gender', types.StringType()),
        types.StructField('cd_citizenship', types.StringType())
    ])

It fails on它失败了

df_result = spark.createDataFrame(rddd, schema)

The error is错误是

TypeError: field dt_dob: DateType can not accept object '1973-02-19' in type <class 'str'>

The main purpose of creating the 'df_result' dataframe is having a predefined schema and implicitly casting all the columns where there is difference between RDD and dataframe.创建“df_result”dataframe 的主要目的是具有预定义的模式并隐式转换 RDD 和 dataframe 之间存在差异的所有列。 This is my first time working with RDD and I couldn't find a straight forward casting mechanism for such a case.这是我第一次使用 RDD,我找不到针对这种情况的直接转换机制。

If you can help with solving the casting error or share a better workflow that would be great.如果您可以帮助解决铸造错误或分享更好的工作流程,那就太好了。

Thanks谢谢

If your aim is only to get your data into the right schema and transform some string columns into date columns, I would use a select combined with to_date .如果您的目标只是将数据放入正确的模式并将一些字符串列转换为日期列,我将使用selectto_date结合使用。

df.select('applicaion_id', 'cd_type', 'cd_title', 'firstname', 'middlename', 'nm_surname', \
          F.to_date('dt_dob').alias('dt_bob'), \
          'cd_gender', 'cd_citizenship') \
  .printSchema()

prints印刷

root
 |-- applicaion_id: string (nullable = true)
 |-- cd_type: string (nullable = true)
 |-- cd_title: string (nullable = true)
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- nm_surname: string (nullable = true)
 |-- dt_bob: date (nullable = true)
 |-- cd_gender: string (nullable = true)
 |-- cd_citizenship: string (nullable = true)

with the column dt_bob having a date datatype.dt_bob具有日期数据类型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM