简体   繁体   English

PySpark 读取具有相同架构但在列中具有多种类型的多拼花地板

[英]PySpark read multi parquet with same schema but multi type in columns

My problem is pretty simple but I cannot find a great solution.我的问题很简单,但我找不到很好的解决方案。 I have a lot of order parquet.我有很多订单镶木地板。 I read all of them with this :我用这个阅读了所有这些:

df = spark.read.option(
    'mergeSchema',
    True).parquet(*list_order).select(
        'at',
        'order_id',
        'items')

This work great if the schemas are the same.如果模式相同,这将非常有用。 But in my new data one of the column "quantity" we change the type String to Float.但是在我的新数据“数量”列之一中,我们将字符串类型更改为浮点数。 This mean that this read code does not work anymore.这意味着这个读取的代码不再起作用。 It create an error like that :它会产生这样的错误:

Caused by: org.apache.spark.SparkException: Failed to merge fields 'quantity' and 'quantity'. Failed to merge incompatible data types string and double

Do you know how I can merge this multi type column ?你知道我如何合并这个多类型的列吗? I'll pref to not regenerate my 4 year history of parquet (will take to much time on prod)我宁愿不恢复我 4 年的镶木地板历史(在产品上需要很多时间)

Thank You all.谢谢你们。

You can try by providing the schema manually using the .schema(your_schema) property您可以尝试使用.schema(your_schema)属性手动提供架构

old_df = spark.read.parquet('old_files')
old_schema = old_df.schema

# Make changes in old_schema and declare it as new_schema

df = spark.read.schema(new_schema).parquet(your_files)

The idea is that you are reading your data as SCHEMA_ON_READ instead of the conventional SCHEMA_ON_WRITE approach in Databases.这个想法是您将数据读取为 SCHEMA_ON_READ 而不是数据库中传统的 SCHEMA_ON_WRITE 方法。 So if you tell spark to read data with a given schema, it will stick to it.所以如果你告诉 spark 用给定的模式读取数据,它会坚持下去。

Or you can always read your data as 2 data frames.或者,您始终可以将数据读取为 2 个数据帧。 Cast the columns and union them.铸造柱子并将它们联合起来。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM