简体   繁体   English

在两个文件上设置操作差异

[英]set operation Diff on two files

I want to get diff of two flat/CSV files source and target which would have same schema. 我想获取两个平面/ CSV文件源和目标具有相同架构的差异。

let say, 说,

source.txt: 的Source.txt:

EmpId|RegionId|Sales EMPID | RegionId |销售
001|R01|$10000 001 | R01 | $ 10000
002|R02|$20000 002 | R02 | $ 20000
003|R03|$30000 003 | R03 | $ 30000

target.txt: target.txt:

EmpId|RegionId|Sales EMPID | RegionId |销售
001|R01|$10000 001 | R01 | $ 10000
002|R02|$10000 002 | R02 | $ 10000
004|R04|$40000 004 | R04 | $ 40000

Result should be: 结果应为:

EmpId1|RegionId1|Sales1|EmpId2|RegionId2|Sales2|Result_Status EmpId1 | RegionId1 |销售额1 | EmpId2 | RegionId2 | Sales2 | Result_Status
001|R01|$10000|001|R01|$10000|matched 001 | R01 | $ 10000 | 001 | R01 | $ 10000 |匹配
002|R02|$20000|002|R02|$10000|unmatched 002 | R02 | $ 20000 | 002 | R02 | $ 10000 |无与伦比
003|R03|$30000|NULL|NULL|NULL|unmatched 003 | R03 | $ 30000 | NULL | NULL | NULL |无与伦比
NULL|NULL|NULL|004|R04|$40000|unmatched NULL | NULL | NULL | 004 | R04 | $ 40000 |无与伦比

any help would be appriciated!! 任何帮助都将被申请!

Edited: 编辑:

Provided given 2 files are huge in size, this problem may look like simpler, but I am trying to find the best way of doing it, Performance is major criteria here, technology can be anything, even hadoop map reduce, I tried using Hive but it was bit slower. 如果给定2个文件很大,这个问题可能看起来更简单,但是我试图找到最好的方法,性能是这里的主要标准,技术可以是任何东西,甚至hadoop map可以减少,我尝试使用Hive但它有点慢。

Here is a map-reduce approach to solve it (in high level pseudo code): 这是一种映射减少方法(高级伪代码)来解决它:

map(source):
   for each line x|y|z:
     emitIntermediate(x,(1,y|z))
map(target):
   for each line x|y|z:
     emitIntermediate(x,(2,y|z))

//make sure each list is sorted/ sort it yourself 1 is before 2 if both exists.
reduce(x, list):
   if list.size() == 1:
      (idx,y|z) <- list.first() //this is the configuration of the element in the list
      if idx == 1:
            emit(x|y|z|NULL|NULL|NULL|unmatched)
      else:
            emit(NULL|NULL|NULL|x|y|z|unmatched)
   else:
       (1,y1|z1) <- list.first()
       (2,y2|z2) <- list.last()
       m = (y1|z1 matches y2|z2 ? "matched" : "unmatched")
       emit(x|y1|z2|x|y2|z2|m)

The idea is to split the data in the reduce part to the different IDs in the map phase, and let the reducers check if the region and sales matches. 想法是在地图阶段将化简部分中的数据拆分为不同的ID,并让化简版检查区域和销售是否匹配。

Implementing it over a large cluster (and in a distributed file format) can improve performance significantly since the work is distributed across the cluster by the map-reduce framework. 由于工作是通过map-reduce框架分布在整个群集中的,因此在大型群集(并采用分布式文件格式)上实现它可以显着提高性能。

You can use Hadoop as an implementing framework, for example. 例如,您可以将Hadoop用作实现框架。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM