[英]Case sensitive parquet schema merge in Spark
I'm trying to load and analyze some parquet files with Spark.我正在尝试使用 Spark 加载和分析一些镶木地板文件。 I'm using schemaMerge
to load the files since newer files have some extra columns.我正在使用schemaMerge
加载文件,因为较新的文件有一些额外的列。 Also some files have their column names in lower case and others in upper case.还有一些文件的列名是小写的,而其他文件的列名是大写的。
For example例如
file1.parquet
has a schema like file1.parquet
有一个像
column1 integer,
column2 integer
and file2.parquet
has something like:和file2.parquet
有类似的东西:
Column1 integer,
Column2 integer,
Column3 integer
I'm running into an issue with inferSchema
method of ParquetFileFormat
class.我inferSchema
了ParquetFileFormat
类的inferSchema
方法的问题。 Schema merging is delegated to StructType
merge
method of spark sql.模式合并委托给 spark sql 的StructType
merge
方法。 From what I can tell, that method can only work in a case sensitive way.据我所知,该方法只能以区分大小写的方式工作。 Internally it uses a map to lookup fields by name and if the cases don't match it will interpret that as a new field.在内部,它使用地图按名称查找字段,如果案例不匹配,它会将其解释为新字段。 Later, when schema is checked for duplicates, case sensitivity configuration is respected and we end up with having duplicate columns.稍后,当检查模式是否有重复时,会考虑区分大小写的配置,我们最终会得到重复的列。 This results in这导致
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema
Is there any way to make schema merge case insensitive?有没有办法让模式合并不区分大小写?
I was expecting to get something like this as a resulting schema:我期待得到这样的结果作为架构:
column1 integer,
column2 integer,
Column3 integer
You can set spark.sql.caseSensitive=true
in your configuration to make Spark SQL schemas case-sensitive.您可以在配置中设置spark.sql.caseSensitive=true
以使 Spark SQL 模式区分大小写。 It also affects schema merging.它还影响模式合并。
scala> spark.conf.set("spark.sql.caseSensitive","true")
scala> val df = sc.parallelize(1 to 1000).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> df.withColumnRenamed("value","VALUE").write.parquet("test_uc")
scala> df.write.parquet("test_lc")
scala> val df2=spark.read.option("mergeSchema","true").parquet("test_*")
df2: org.apache.spark.sql.DataFrame = [value: int, VALUE: int]
scala> val merged = df2.columns.groupBy(_.toLowerCase)
.map(t => coalesce(t._2.map(col):_*).as(t._1))
.toArray
merged: Array[org.apache.spark.sql.Column] = Array(coalesce(value, VALUE) AS `value`)
scala> df2.select(merged:_*)
res2: org.apache.spark.sql.DataFrame = [value: int]
scala> spark.conf.set("spark.sql.caseSensitive","false")
// process your dataframe
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.