Spark CSV与各种分隔符到DataSet

Question

I have two CSV files that I am aggregating using spark with Java. 我有两个CSV文件 ，我使用spark与Java聚合。 These files have different delimeters. 这些文件具有不同的分隔符。

file1.dat: file1.dat：

011!345!Ireland

files2.dat: files2.dat：

022Ç486ÇBrazil

Code I use: 我使用的代码：

Dataset<Row> people = spark.read().format("csv").option("header", "false").option("delimeter", "\u00C7").option("delimeter", "!").load(logFile);

Output: 输出：

Error:Cannot resolve column name

If I remove one delimeter: 如果我删除一个分隔符：

Dataset<Row> people = spark.read().format("csv").option("header", "false").option("delimeter", "\u00C7").load(logFile);

Output: 输出：

Sno|code|Country
null|null|null 
022|486|Brazil

Is there a way to perform this? 有办法执行此操作吗？ Can both of these files be aggregated in the same spark code? 这两个文件可以在同一个火花代码中聚合吗？

Answer 1

You can't use multiple delimiters at the same time. 您不能同时使用多个分隔符。

Instead, read the files separatly and use union ( doc ) to merge them together. 相反，分别读取文件并使用union （ doc ）将它们合并在一起。 For example: 例如：

Dataset<Row> people1 = spark.read()
  .option("header", "false")
  .option("delimeter", "!")
  .csv(logFile1);
Dataset<Row> people2 = spark.read()
  .option("header", "false")
  .option("delimeter", "\u00C7")
  .csv(logFile2);

Dataset<Row> people = people1.union(people2);

Spark CSV与各种分隔符到DataSet

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-08-24 05:16:00

Spark CSV与各种分隔符到DataSet

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-08-24 05:16:00

解决方案1
2 已采纳 2018-08-24 05:16:00