[英]What is the point of providing schema in case of reading a parquet using pyspark?
When I am working with CSV, I can provide custom schema while reading a file, and the benefits I receive are as follows (along with the contrast with the parquet case):当我使用 CSV 时,我可以在读取文件时提供自定义模式,我获得的好处如下(以及与 parquet 案例的对比):
Expected: IntegerType, found: binary
exception in case the schema provided is of integer and the parquet column is stored as a string). (这在 parquet 的情况下是不可能的,因为它抛出Expected: IntegerType, found: binary
异常,以防提供的模式为 integer 并且 parquet 列存储为字符串)。Given these two points, what are the benefits of providing schema in case of the parquet files?鉴于这两点,在 parquet 文件的情况下提供模式有什么好处?
Apart from the obvious benefit of type safety (that you've already mentioned;)), providing a schema on read allows parquet data source to perform schema merging , for one:除了类型安全的明显好处(您已经提到过;)),在读取时提供模式允许 parquet 数据源执行模式合并,例如:
scala> val df2 = spark.range(1,6).map(i => (2,i,i*i)).toDF("version","value","square")
df2: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 1 more field]
scala> df2.write.parquet("/tmp/reconciled")
scala> val df3 = spark.range(1,6).map(i => (3,i,i*i*i)).toDF("version","value","cube")
df3: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 1 more field]
scala> df3.write.mode("append").parquet("/tmp/reconciled")
scala> val sch = StructType(Array(StructField("version",IntegerType,false),
| StructField("value",LongType,true),
| StructField("square",LongType,false),
| StructField("cube",LongType,false)))
sch: org.apache.spark.sql.types.StructType = StructType(StructField(version,IntegerType,false), StructField(value,LongType,true), StructField(square,LongType,false), StructField(cube,LongType,false))
scala> val dfn = spark.read.schema(sch).parquet("/tmp/reconciled")
dfn: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 2 more fields]
scala> dfn.show(false)
+-------+-----+------+----+
|version|value|square|cube|
+-------+-----+------+----+
|2 |1 |1 |null|
|2 |2 |4 |null|
|2 |3 |9 |null|
|2 |4 |16 |null|
|2 |5 |25 |null|
|3 |1 |null |1 |
|3 |2 |null |8 |
|3 |3 |null |27 |
|3 |4 |null |64 |
|3 |5 |null |125 |
+-------+-----+------+----+
...so that you can process a range of similar but different parquet files as if they all had the same schema. ...以便您可以处理一系列相似但不同的镶木地板文件,就好像它们都具有相同的架构一样。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.