在使用 pyspark 读取镶木地板的情况下提供模式有什么意义？

Question

When I am working with CSV, I can provide custom schema while reading a file, and the benefits I receive are as follows (along with the contrast with the parquet case):当我使用 CSV 时，我可以在读取文件时提供自定义模式，我获得的好处如下（以及与 parquet 案例的对比）：

All the rows do not have to be scanned through to infer the schema.不必扫描所有行来推断架构。 (Parquet: Since the schema is stored in a separate file inferring schema is as simple as reading the schema from this file) （Parquet：由于模式存储在单独的文件中，因此推断模式就像从该文件中读取模式一样简单）
I can convert the schema of the file on the fly.我可以即时转换文件的架构。 For example, if I have stored integer numbers as a string(numbers) then I can provide the schema to read that column as an integer and it does.例如，如果我将 integer 数字存储为字符串（数字），那么我可以提供架构以将该列读取为 integer，它确实如此。 (This is impossible in case of parquet since it throws the Expected: IntegerType, found: binary exception in case the schema provided is of integer and the parquet column is stored as a string). （这在 parquet 的情况下是不可能的，因为它抛出Expected: IntegerType, found: binary异常，以防提供的模式为 integer 并且 parquet 列存储为字符串）。

Given these two points, what are the benefits of providing schema in case of the parquet files?鉴于这两点，在 parquet 文件的情况下提供模式有什么好处？

Answer 1

Apart from the obvious benefit of type safety (that you've already mentioned;)), providing a schema on read allows parquet data source to perform schema merging , for one:除了类型安全的明显好处（您已经提到过；）），在读取时提供模式允许 parquet 数据源执行模式合并，例如：

scala> val df2 = spark.range(1,6).map(i => (2,i,i*i)).toDF("version","value","square")
df2: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 1 more field]

scala> df2.write.parquet("/tmp/reconciled")

scala> val df3 = spark.range(1,6).map(i => (3,i,i*i*i)).toDF("version","value","cube")
df3: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 1 more field]

scala> df3.write.mode("append").parquet("/tmp/reconciled")

scala> val sch = StructType(Array(StructField("version",IntegerType,false),
     | StructField("value",LongType,true),
     | StructField("square",LongType,false),
     | StructField("cube",LongType,false)))
sch: org.apache.spark.sql.types.StructType = StructType(StructField(version,IntegerType,false), StructField(value,LongType,true), StructField(square,LongType,false), StructField(cube,LongType,false))

scala> val dfn = spark.read.schema(sch).parquet("/tmp/reconciled")
dfn: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 2 more fields]

scala> dfn.show(false)
+-------+-----+------+----+                                                     
|version|value|square|cube|
+-------+-----+------+----+
|2      |1    |1     |null|
|2      |2    |4     |null|
|2      |3    |9     |null|
|2      |4    |16    |null|
|2      |5    |25    |null|
|3      |1    |null  |1   |
|3      |2    |null  |8   |
|3      |3    |null  |27  |
|3      |4    |null  |64  |
|3      |5    |null  |125 |
+-------+-----+------+----+

...so that you can process a range of similar but different parquet files as if they all had the same schema. ...以便您可以处理一系列相似但不同的镶木地板文件，就好像它们都具有相同的架构一样。

在使用 pyspark 读取镶木地板的情况下提供模式有什么意义？

问题描述

1 个解决方案

解决方案1
0 2023-01-24 21:10:23

在使用 pyspark 读取镶木地板的情况下提供模式有什么意义？

问题描述

1 个解决方案

解决方案1 0 2023-01-24 21:10:23

解决方案1
0 2023-01-24 21:10:23