简体   繁体   English

在使用 pyspark 读取镶木地板的情况下提供模式有什么意义?

[英]What is the point of providing schema in case of reading a parquet using pyspark?

When I am working with CSV, I can provide custom schema while reading a file, and the benefits I receive are as follows (along with the contrast with the parquet case):当我使用 CSV 时,我可以在读取文件时提供自定义模式,我获得的好处如下(以及与 parquet 案例的对比):

  1. All the rows do not have to be scanned through to infer the schema.不必扫描所有行来推断架构。 (Parquet: Since the schema is stored in a separate file inferring schema is as simple as reading the schema from this file) (Parquet:由于模式存储在单独的文件中,因此推断模式就像从该文件中读取模式一样简单)
  2. I can convert the schema of the file on the fly.我可以即时转换文件的架构。 For example, if I have stored integer numbers as a string(numbers) then I can provide the schema to read that column as an integer and it does.例如,如果我将 integer 数字存储为字符串(数字),那么我可以提供架构以将该列读取为 integer,它确实如此。 (This is impossible in case of parquet since it throws the Expected: IntegerType, found: binary exception in case the schema provided is of integer and the parquet column is stored as a string). (这在 parquet 的情况下是不可能的,因为它抛出Expected: IntegerType, found: binary异常,以防提供的模式为 integer 并且 parquet 列存储为字符串)。

Given these two points, what are the benefits of providing schema in case of the parquet files?鉴于这两点,在 parquet 文件的情况下提供模式有什么好处?

Apart from the obvious benefit of type safety (that you've already mentioned;)), providing a schema on read allows parquet data source to perform schema merging , for one:除了类型安全的明显好处(您已经提到过;)),在读取时提供模式允许 parquet 数据源执行模式合并,例如:

scala> val df2 = spark.range(1,6).map(i => (2,i,i*i)).toDF("version","value","square")
df2: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 1 more field]

scala> df2.write.parquet("/tmp/reconciled")

scala> val df3 = spark.range(1,6).map(i => (3,i,i*i*i)).toDF("version","value","cube")
df3: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 1 more field]

scala> df3.write.mode("append").parquet("/tmp/reconciled")

scala> val sch = StructType(Array(StructField("version",IntegerType,false),
     | StructField("value",LongType,true),
     | StructField("square",LongType,false),
     | StructField("cube",LongType,false)))
sch: org.apache.spark.sql.types.StructType = StructType(StructField(version,IntegerType,false), StructField(value,LongType,true), StructField(square,LongType,false), StructField(cube,LongType,false))

scala> val dfn = spark.read.schema(sch).parquet("/tmp/reconciled")
dfn: org.apache.spark.sql.DataFrame = [version: int, value: bigint ... 2 more fields]

scala> dfn.show(false)
+-------+-----+------+----+                                                     
|version|value|square|cube|
+-------+-----+------+----+
|2      |1    |1     |null|
|2      |2    |4     |null|
|2      |3    |9     |null|
|2      |4    |16    |null|
|2      |5    |25    |null|
|3      |1    |null  |1   |
|3      |2    |null  |8   |
|3      |3    |null  |27  |
|3      |4    |null  |64  |
|3      |5    |null  |125 |
+-------+-----+------+----+

...so that you can process a range of similar but different parquet files as if they all had the same schema. ...以便您可以处理一系列相似但不同的镶木地板文件,就好像它们都具有相同的架构一样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM