繁体   English   中英

Pyspark - 从另一个 DF 读取镶木地板文件时如何设置架构?

[英]Pyspark - How to set the schema when reading parquet file from another DF?

我有带架构的 DF1:

df1 = spark.read.parquet(load_path1)
df1.printSchema()

root
 |-- PRODUCT_OFFERING_ID: string (nullable = true)
 |-- CREATED_BY: string (nullable = true)
 |-- CREATION_DATE: string (nullable = true)

和DF2:

df2 = spark.read.parquet(load_path2)
df2.printSchema()
root
 |-- PRODUCT_OFFERING_ID: decimal(38,10) (nullable = true)
 |-- CREATED_BY: decimal(38,10) (nullable = true)
 |-- CREATION_DATE: timestamp (nullable = true)

现在我想联合这两个数据框..
有时,当我尝试联合这两个 DF 时,它会因为不同的模式而出错..

如何将 DF2 设置为与 DF1 具有完全相同的架构(在加载期间)?

我试过:

df2 = spark.read.parquet(load_path2).schema(df1.schema)

获取错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'StructType' object is not callable

或者我应该改为 CAST (一旦 DF2 被读取)?

谢谢。

.parquet() .schema()之前移动.schema()然后 spark 将读取具有指定架构的 parquet 文件

df2 = spark.read.schema(df1.schema).parquet(load_path2)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM