简体   繁体   English

Spark 中区分大小写的拼花模式合并

[英]Case sensitive parquet schema merge in Spark

I'm trying to load and analyze some parquet files with Spark.我正在尝试使用 Spark 加载和分析一些镶木地板文件。 I'm using schemaMerge to load the files since newer files have some extra columns.我正在使用schemaMerge加载文件,因为较新的文件有一些额外的列。 Also some files have their column names in lower case and others in upper case.还有一些文件的列名是小写的,而其他文件的列名是大写的。

For example例如

file1.parquet has a schema like file1.parquet有一个像

column1 integer,
column2 integer

and file2.parquet has something like:file2.parquet有类似的东西:

Column1 integer,
Column2 integer,
Column3 integer

I'm running into an issue with inferSchema method of ParquetFileFormat class.inferSchemaParquetFileFormat类的inferSchema方法的问题。 Schema merging is delegated to StructType merge method of spark sql.模式合并委托给 spark sql 的StructType merge方法。 From what I can tell, that method can only work in a case sensitive way.据我所知,该方法只能以区分大小写的方式工作。 Internally it uses a map to lookup fields by name and if the cases don't match it will interpret that as a new field.在内部,它使用地图按名称查找字段,如果案例不匹配,它会将其解释为新字段。 Later, when schema is checked for duplicates, case sensitivity configuration is respected and we end up with having duplicate columns.稍后,当检查模式是否有重复时,会考虑区分大小写的配置,我们最终会得到重复的列。 This results in这导致

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema

Is there any way to make schema merge case insensitive?有没有办法让模式合并不区分大小写?

I was expecting to get something like this as a resulting schema:我期待得到这样的结果作为架构:

column1 integer,
column2 integer,
Column3 integer

You can set spark.sql.caseSensitive=true in your configuration to make Spark SQL schemas case-sensitive.您可以在配置中设置spark.sql.caseSensitive=true以使 Spark SQL 模式区分大小写。 It also affects schema merging.它还影响模式合并。

scala> spark.conf.set("spark.sql.caseSensitive","true")

scala> val df = sc.parallelize(1 to 1000).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.withColumnRenamed("value","VALUE").write.parquet("test_uc")

scala> df.write.parquet("test_lc")

scala> val df2=spark.read.option("mergeSchema","true").parquet("test_*")
df2: org.apache.spark.sql.DataFrame = [value: int, VALUE: int]

scala> val merged = df2.columns.groupBy(_.toLowerCase)
                   .map(t => coalesce(t._2.map(col):_*).as(t._1))
                   .toArray
merged: Array[org.apache.spark.sql.Column] = Array(coalesce(value, VALUE) AS `value`)

scala> df2.select(merged:_*)
res2: org.apache.spark.sql.DataFrame = [value: int]

scala> spark.conf.set("spark.sql.caseSensitive","false")

// process your dataframe

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM