简体   繁体   English

使用 Spark 验证 CSV 文件列

[英]Validate CSV file columns with Spark

I am trying to read a CSV file (which is supposed to have a header) in Spark and load the data into an existing table (with predefined columns and datatypes).我正在尝试在 Spark 中读取 CSV 文件(应该具有标题)并将数据加载到现有表中(具有预定义的列和数据类型)。 The csv file can be very large, so it would be great if I could avoid doing it if the columns header from the csv is not "valid". csv 文件可能非常大,所以如果来自 csv 的列 header 不是“有效”,我可以避免这样做。

When I'm currently reading the file, I'm specyfing a StructType as the schema, but this does not validate that the header contains the right columns in the right order.当我当前正在读取文件时,我将 StructType 指定为模式,但这并不能验证 header 是否包含正确顺序的正确列。 This is what I have so far (I'm building the "schema" StructType in another place):这是我到目前为止所拥有的(我正在另一个地方构建“模式”StructType):

sqlContext
  .read()
  .format("csv")
  .schema(schema)
  .load("pathToFile");

If I add the .option("header", "true)" line it will skill over the first line of the csv file and use the names I'm passing in the StructType's add method.如果我添加.option("header", "true)"行,它将超越 csv 文件的第一行,并使用我在 StructType 的add方法中传递的名称。 (eg if I build the StructType with "id" and "name" and the first row in the csv is "idzzz,name", the resulting dataframe will have columns "id" and "name". I want to be able to validate that the csv header has the same name for columns as the table I'm planning on loading the csv. (例如,如果我使用“id”和“name”构建 StructType,并且 csv 中的第一行是“idzzz,name”,则生成的 dataframe 将包含“id”和“name”列。我希望能够验证csv header 与我计划加载 csv 的表具有相同的名称。

I tried reading the file with .head() , and doing some checks on that first row, but that downloads the whole file.我尝试使用.head()读取文件,并对第一行进行一些检查,但这会下载整个文件。

Any suggestion is more than welcomed.任何建议都非常受欢迎。

From what I understand, you want to validate the schema of the CSV you read.据我了解,您想验证您阅读的 CSV 的架构。 The problem with the schema option is that its goal is to tell spark that it is the schema of your data, and not to check that it is. schema 选项的问题在于它的目标是告诉 spark 它是您的数据的架构,而不是检查它是否是。

There is an option however that infers the said schema when reading a CSV and that could be very useful ( inferSchema ) in your situation.但是,有一个选项可以在读取 CSV 时推断出上述模式,这在您的情况下可能非常有用( inferSchema )。 Then, you can either compare that schema with the one you expect with equals , or do the small workaround that I will introduce to be a little bit more permissive.然后,您可以使用equals将该架构与您期望的架构进行比较,或者执行我将介绍的更宽松的小变通方法。

Let's see how it works the following file:让我们看看它是如何工作的以下文件:

a,b
1,abcd
2,efgh

Then, let's read the data.然后,让我们读取数据。 I used the scala REPL but you should be able to convert all that in Java very easily.我使用了 scala REPL,但您应该能够非常轻松地转换 Java 中的所有内容。

val df = spark.read
    .option("header", true) // reading the header
    .option("inferSchema", true) // infering the sschema
    .csv(".../file.csv")
// then let's define the schema you would expect
val schema = StructType(Array(StructField("a", IntegerType),
                              StructField("b", StringType)))

// And we can check that the schema spark inferred is the same as the one
// we expect:
schema.equals(df.schema)
// res14: Boolean = true

going further走得更远

That's in a perfect world.那是在一个完美的世界里。 Indeed, if you schema contains non nullable columns for instance or other small differences, this solution that's based on strict equality of object will not work.实际上,如果您的架构包含例如不可为空的列或其他小的差异,则此基于 object 严格相等的解决方案将不起作用。

val schema2 = StructType(Array(StructField("a", IntegerType, false),
                               StructField("b", StringType, true)))
// the first column is non nullable, it does not work because all the columns
// are  nullable when inferred by spark:
schema2.equals(df.schema)
// res15: Boolean = false

In that case you may need to implement a schema comparison method that would suit you like:在这种情况下,您可能需要实现适合您的模式比较方法:

def equalSchemas(s1 : StructType, s2 : StructType) = {
  s1.indices
    .map(i => s1(i).name.toUpperCase.equals(s2(i).name.toUpperCase) &&
              s1(i).dataType.equals(s2(i).dataType))
    .reduce(_ && _)
}
equalSchemas(schema2, df.schema)
// res23: Boolean = true

I am checking that the names and the types of the columns are matching and that the order is the same.我正在检查列的名称和类型是否匹配并且顺序是否相同。 You could need to implement a different logic depending on what you want.您可能需要根据需要实现不同的逻辑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM