[英]Joining two datasets spark scala
我有两个csv文件(数据集)file1和file2。
File1包含以下列:
Orders | Requests | Book1 | Book2
Varchar| Integer | Integer| Integer
File2包含以下列:
Book3 | Book4 | Book5 | Orders
String| String| Varchar| Varchar
如何在Scala中将数据合并到两个CSV文件中以进行检查:
您可以通过配对RDD加入两个csv。
val rightFile = job.patch.get.file
val rightFileByKeys = sc.textFile(rightFile).map { line =>
new LineParser(line, job.patch.get.patchKeyIndex, job.delimRegex, Some(job.patch.get.patchValueIndex))
}.keyBy(_.getKey())
val leftFileByKeys = sc.textFile(leftFile).map { line =>
new LineParser(line, job.patch.get.fileKeyIndex, job.delimRegex)
}.keyBy(_.getKey())
leftFileByKeys.join(rightFileByKeys).map { case (key, (left, right)) =>
(job, left.line + job.delim + right.getValue())
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.