[英]How to do a Reduce Side Join as a Map Reduce Job with mrjob in Python
I have 2 datasets which I am trying to combine, namely the transactions
dataset and the contract
dataset, where I want to use address
resp.我有 2 个要合并的数据集,即
transactions
数据集和contract
数据集,我想在其中使用address
。 to_address
as the join attribute and the value
attribute for the value. to_address
作为连接属性和value
属性。
contract dataset fields:
address, is_erc20, is_erc721, block_number, block_timestamp
transactions dataset fields:
block_number, from_address, to_address, value, gas, gas_price, timestamp
So what I'm trying to do is make a join with an output of: address, value
所以我想要做的是与以下输出进行连接:
address, value
example:例子:
transactions dataset:
to_address value
0x412270b1f0f3884 240648550000000000
0x8d5a0a7c555602f 984699000000000000
contract dataset:
address
0x412270b1f0f3884
the output should be:
to_address value
0x412270b1f0f3884 240648550000000000
as 0x8d5a0a7c555602f is not present in the contract dataset.
Below is the code I have and I'm not sure what I'm doing wrong.下面是我的代码,我不确定我做错了什么。 Any help??
有什么帮助吗??
from mrjob.job import MRJob
class repartition_join(MRJob):
def mapper(self, _, line):
try:
if(len(line.split(','))==5): #contracts dataset
fields=line.split(',')
join_key=fields[0] #key is address
yield (join_key, 1) #yield join key given id 1?
elif(len(line.split(','))==7): #transactions dataset
fields=line.split(',')
join_key=fields[2] #to_address, which is the key
join_value=int(fields[3]) #[3] = value
yield (join_key,(join_value,2)) #gives key with value
except:
pass
def reducer(self, key, values):
val = None
for value in values:
if value[1] == 2:
val = (value[0])
yield(key, val)
if __name__=='__main__':
repartition_join.run()
Think about your map-reduce pipeline for the Reduce Side Join again.再次考虑用于Reduce Side Join 的map-reduce 管道。 It looks like you have difficulties in understanding it.
看起来你理解起来有困难。
In order to distinguish a key-value pair from your two relations, you have to add a relation symbol to the value your mapper is yielding.为了将键值对与您的两个关系区分开来,您必须为映射器产生的值添加一个关系符号。 Assuming, you want to do an inner-join, you have to
yield
a tuple in the reducer for the Reduce Side Join only if there is a tuple in your contracts
and your transactions
dataset.假设,你想要做一个内部联接,你必须
yield
在减速机为减少副作用的元组加入只如果在你的元组contracts
,你的transactions
数据集。 Thus, you have to hold the tuples of those relations in separate lists and identify a tuple by the relation symbol.因此,您必须将这些关系的元组保存在单独的列表中,并通过关系符号标识一个元组。 This can be easily adjusted for other joins — eg (Left/Right/Full) Outer Join, Semi/Anti-Join.
这可以很容易地针对其他连接进行调整 - 例如(左/右/全)外连接、半/反连接。
In the following example, I used the relation symbol 'C'
for the contracts
and 'T'
for the transactions
dataset.在以下示例中,我将关系符号
'C'
用于contracts
,将'T'
用于transactions
数据集。 I cannot try it out myself because I am lacking the dataset, but it should work like this.我无法自己尝试,因为我缺少数据集,但它应该像这样工作。 If you have any troubles let me know with a comment.
如果您有任何问题,请通过评论告诉我。
I can suggest that you have a look on the book "MapReduce Design Patterns by Donald Miner, Adam Shook" because it also explains common join algorithms for Map-Reduce-Tasks.我建议您阅读“由 Donald Miner, Adam Shook 编写的 MapReduce 设计模式”一书,因为它也解释了 Map-Reduce-Tasks 的常见连接算法。 Also check out the latest mrjob documentation .
另请查看最新的mrjob 文档。
from mrjob.job import MRJob
from mrjob.step import MRStep
class repartition_join(MRJob):
def mapper(self, _, line):
fields=line.split(',')
if len(fields == 5): # contracts dataset
join_key = fields[0] # key is in attribute address
yield (join_key, ('C', 1)) # yield join key, value not used
elif len(fields) == 7: # transactions dataset
join_key = fields[2] # key is in attribute to_address
join_value = int(fields[3]) # value is in attribute value
yield (join_key, ('T', join_value)) # yields join key with value
else:
pass # TODO handle error
def reducer(self, key, values):
address = key # the join key
contracts_tuples = []
transactions_tuples = []
for value in values:
relation_symbol = value[0] # either 'T' or 'C'
if relation_symbol == 'C': # contracts dataset
contracts_tuples.append(value[1]) # always 1 - just to know that there is a tuple in contracts
elif relation_symbol == 'T': # transactions dataset
transactions_tuples.append(value[1]) # append the value inside value attribute
else:
pass # TODO handle error
# inner join contract and transaction, generalize if needed
if len(contracts_tuples) > 0 and len(transactions_tuples) > 0:
for value in transactions_tuples:
yield (address, value)
def steps(self):
return [MRStep(
mapper=self.mapper,
reducer=self.reducer)
]
if __name__=='__main__':
repartition_join.run()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.