[英]How to do a Reduce Side Join as a Map Reduce Job with mrjob in Python
我有 2 個要合並的數據集,即transactions
數據集和contract
數據集,我想在其中使用address
。 to_address
作為連接屬性和value
屬性。
contract dataset fields:
address, is_erc20, is_erc721, block_number, block_timestamp
transactions dataset fields:
block_number, from_address, to_address, value, gas, gas_price, timestamp
所以我想要做的是與以下輸出進行連接: address, value
例子:
transactions dataset:
to_address value
0x412270b1f0f3884 240648550000000000
0x8d5a0a7c555602f 984699000000000000
contract dataset:
address
0x412270b1f0f3884
the output should be:
to_address value
0x412270b1f0f3884 240648550000000000
as 0x8d5a0a7c555602f is not present in the contract dataset.
下面是我的代碼,我不確定我做錯了什么。 有什么幫助嗎??
from mrjob.job import MRJob
class repartition_join(MRJob):
def mapper(self, _, line):
try:
if(len(line.split(','))==5): #contracts dataset
fields=line.split(',')
join_key=fields[0] #key is address
yield (join_key, 1) #yield join key given id 1?
elif(len(line.split(','))==7): #transactions dataset
fields=line.split(',')
join_key=fields[2] #to_address, which is the key
join_value=int(fields[3]) #[3] = value
yield (join_key,(join_value,2)) #gives key with value
except:
pass
def reducer(self, key, values):
val = None
for value in values:
if value[1] == 2:
val = (value[0])
yield(key, val)
if __name__=='__main__':
repartition_join.run()
再次考慮用於Reduce Side Join 的map-reduce 管道。 看起來你理解起來有困難。
為了將鍵值對與您的兩個關系區分開來,您必須為映射器產生的值添加一個關系符號。 假設,你想要做一個內部聯接,你必須yield
在減速機為減少副作用的元組加入只如果在你的元組contracts
,你的transactions
數據集。 因此,您必須將這些關系的元組保存在單獨的列表中,並通過關系符號標識一個元組。 這可以很容易地針對其他連接進行調整 - 例如(左/右/全)外連接、半/反連接。
在以下示例中,我將關系符號'C'
用於contracts
,將'T'
用於transactions
數據集。 我無法自己嘗試,因為我缺少數據集,但它應該像這樣工作。 如果您有任何問題,請通過評論告訴我。
我建議您閱讀“由 Donald Miner, Adam Shook 編寫的 MapReduce 設計模式”一書,因為它也解釋了 Map-Reduce-Tasks 的常見連接算法。 另請查看最新的mrjob 文檔。
from mrjob.job import MRJob
from mrjob.step import MRStep
class repartition_join(MRJob):
def mapper(self, _, line):
fields=line.split(',')
if len(fields == 5): # contracts dataset
join_key = fields[0] # key is in attribute address
yield (join_key, ('C', 1)) # yield join key, value not used
elif len(fields) == 7: # transactions dataset
join_key = fields[2] # key is in attribute to_address
join_value = int(fields[3]) # value is in attribute value
yield (join_key, ('T', join_value)) # yields join key with value
else:
pass # TODO handle error
def reducer(self, key, values):
address = key # the join key
contracts_tuples = []
transactions_tuples = []
for value in values:
relation_symbol = value[0] # either 'T' or 'C'
if relation_symbol == 'C': # contracts dataset
contracts_tuples.append(value[1]) # always 1 - just to know that there is a tuple in contracts
elif relation_symbol == 'T': # transactions dataset
transactions_tuples.append(value[1]) # append the value inside value attribute
else:
pass # TODO handle error
# inner join contract and transaction, generalize if needed
if len(contracts_tuples) > 0 and len(transactions_tuples) > 0:
for value in transactions_tuples:
yield (address, value)
def steps(self):
return [MRStep(
mapper=self.mapper,
reducer=self.reducer)
]
if __name__=='__main__':
repartition_join.run()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.