[英]Write parquet from another parquet with a new schema using pyspark
[英]What is the point of providing schema in case of reading a parquet using pyspark?
當我使用 CSV 時,我可以在讀取文件時提供自定義模式,我獲得的好處如下(以及與 parquet 案例的對比):
Expected: IntegerType, found: binary
異常,以防提供的模式為 integer 並且 parquet 列存儲為字符串)。鑒於這兩點,在 parquet 文件的情況下提供模式有什么好處?
除了類型安全的明顯好處(您已經提到過;)),在讀取時提供模式允許 parquet 數據源執行模式合並,例如:
scala> val df2 = spark.range(1,6).map(i => (2,i,i*i)).toDF("version","value","square")
df2: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 1 more field]
scala> df2.write.parquet("/tmp/reconciled")
scala> val df3 = spark.range(1,6).map(i => (3,i,i*i*i)).toDF("version","value","cube")
df3: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 1 more field]
scala> df3.write.mode("append").parquet("/tmp/reconciled")
scala> val sch = StructType(Array(StructField("version",IntegerType,false),
| StructField("value",LongType,true),
| StructField("square",LongType,false),
| StructField("cube",LongType,false)))
sch: org.apache.spark.sql.types.StructType = StructType(StructField(version,IntegerType,false), StructField(value,LongType,true), StructField(square,LongType,false), StructField(cube,LongType,false))
scala> val dfn = spark.read.schema(sch).parquet("/tmp/reconciled")
dfn: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 2 more fields]
scala> dfn.show(false)
+-------+-----+------+----+
|version|value|square|cube|
+-------+-----+------+----+
|2 |1 |1 |null|
|2 |2 |4 |null|
|2 |3 |9 |null|
|2 |4 |16 |null|
|2 |5 |25 |null|
|3 |1 |null |1 |
|3 |2 |null |8 |
|3 |3 |null |27 |
|3 |4 |null |64 |
|3 |5 |null |125 |
+-------+-----+------+----+
...以便您可以處理一系列相似但不同的鑲木地板文件,就好像它們都具有相同的架構一樣。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.