简体   繁体   English

Spark CSV与各种分隔符到DataSet

[英]Spark CSV with various delimiters into DataSet

I have two CSV files that I am aggregating using spark with Java. 我有两个CSV文件 ,我使用spark与Java聚合。 These files have different delimeters. 这些文件具有不同的分隔符。

file1.dat: file1.dat:

011!345!Ireland

files2.dat: files2.dat:

022Ç486ÇBrazil

Code I use: 我使用的代码:

Dataset<Row> people = spark.read().format("csv").option("header", "false").option("delimeter", "\u00C7").option("delimeter", "!").load(logFile);

Output: 输出:

Error:Cannot resolve column name

If I remove one delimeter: 如果我删除一个分隔符:

Dataset<Row> people = spark.read().format("csv").option("header", "false").option("delimeter", "\u00C7").load(logFile);

Output: 输出:

Sno|code|Country
null|null|null 
022|486|Brazil

Is there a way to perform this? 有办法执行此操作吗? Can both of these files be aggregated in the same spark code? 这两个文件可以在同一个火花代码中聚合吗?

You can't use multiple delimiters at the same time. 您不能同时使用多个分隔符。

Instead, read the files separatly and use union ( doc ) to merge them together. 相反,分别读取文件并使用uniondoc )将它们合并在一起。 For example: 例如:

Dataset<Row> people1 = spark.read()
  .option("header", "false")
  .option("delimeter", "!")
  .csv(logFile1);
Dataset<Row> people2 = spark.read()
  .option("header", "false")
  .option("delimeter", "\u00C7")
  .csv(logFile2);

Dataset<Row> people = people1.union(people2);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM