[英]left merging list of dictionaries in python with multiple keys, filling in with zeros
我試圖在不使用熊貓的情況下從根本上替換腳本中的 pd.merge() 功能。
如果我有 2 個字典列表(如下):
l1 = [{'key1': '2017', 'key2': '20-30', 'val1': 11},
{'key1': '2017', 'key2': '30-40', 'val1': 22},
{'key1': '2017', 'key2': '40-50', 'val1': 33},
{'key1': '2017', 'key2': '50+', 'val1': 44},
{'key1': '2018', 'key2': '20-30', 'val1': 55},
{'key1': '2018', 'key2': '30-40', 'val1': 66},
{'key1': '2018', 'key2': '40-50', 'val1': 77},
{'key1': '2018', 'key2': '50+', 'val1': 88}]
l2 = [{'key1': '2017', 'key2': '20-30', 'val2': 1000},
{'key1': '2017', 'key2': '40-50', 'val3': 2000},
{'key1': '2018', 'key2': '50+', 'val3': 3000}]
我想使用多個鍵“左合並”以呈現以下結果:
output = [{'key1': '2017', 'key2': '20-30', 'val1': 11, 'val2':1000, 'val3'0:},
{'key1': '2017', 'key2': '30-40', 'val1': 22, 'val2':0, 'val3':0},
{'key1': '2017', 'key2': '40-50', 'val1': 33, 'val2':0, 'val3':2000},
{'key1': '2017', 'key2': '50+', 'val1': 44, 'val2':0, 'val3':0},
{'key1': '2018', 'key2': '20-30', 'val1': 55, 'val2':0, 'val3':0},
{'key1': '2018', 'key2': '30-40', 'val1': 66, 'val2':0, 'val3':0},
{'key1': '2018', 'key2': '40-50', 'val1': 77, 'val2':0, 'val3':0},
{'key1': '2018', 'key2': '50+', 'val1': 88, 'val2':0, 'val3':3000}]
我得到的最接近的是使用它作為參考和下面的代碼,但我不確定如何讓它完全正確(包括零)。
l1 = {(d['key1'], d['key2']):d for d in l1}
all = [dict(d, **l1.get((d['key1'], d['key2']), {})) for d in l2]
在處理 pandas 數據幀時,pandas 通常會提前知道列數和數據類型。
假設每個列表中的所有元素具有相同的結構( l1
中的鍵可能與l2
不同,但l1
中的所有元素具有相同的鍵,並且l2
中的所有元素具有相同的鍵,然后發現默認類型以及鍵的總數在每個輸出字典中都變成O(1)
操作。現在,鑒於l2
有不同的鍵,您將不得不掃描列表,即O(n)
操作,以找出l2
中的列/鍵的總數。
參考下面代碼中的left_merge
函數。 它更冗長,但解釋了發生了什么。
l1 = [{'key1': '2017', 'key2': '20-30', 'val1': 11},
{'key1': '2017', 'key2': '30-40', 'val1': 22},
{'key1': '2017', 'key2': '40-50', 'val1': 33},
{'key1': '2017', 'key2': '50+', 'val1': 44},
{'key1': '2018', 'key2': '20-30', 'val1': 55},
{'key1': '2018', 'key2': '30-40', 'val1': 66},
{'key1': '2018', 'key2': '40-50', 'val1': 77},
{'key1': '2018', 'key2': '50+', 'val1': 88}]
l2 = [{'key1': '2017', 'key2': '20-30', 'val2': 1000},
{'key1': '2017', 'key2': '40-50', 'val3': 2000},
{'key1': '2018', 'key2': '50+', 'val3': 3000}]
op_output = [{'key1': '2017', 'key2': '20-30', 'val1': 11, 'val2':1000, 'val3': 0},
{'key1': '2017', 'key2': '30-40', 'val1': 22, 'val2':0, 'val3':0},
{'key1': '2017', 'key2': '40-50', 'val1': 33, 'val2':0, 'val3':2000},
{'key1': '2017', 'key2': '50+', 'val1': 44, 'val2':0, 'val3':0},
{'key1': '2018', 'key2': '20-30', 'val1': 55, 'val2':0, 'val3':0},
{'key1': '2018', 'key2': '30-40', 'val1': 66, 'val2':0, 'val3':0},
{'key1': '2018', 'key2': '40-50', 'val1': 77, 'val2':0, 'val3':0},
{'key1': '2018', 'key2': '50+', 'val1': 88, 'val2':0, 'val3':3000}]
def left_merge(a, b, key):
# a and b are list of dictionaries
# key is a callable
# TODO: bounds checking if a is empty or b is empty
b_index = {key(i): i for i in b}
output = []
# pick one element from a and b so we know the final columns
merged_item_columns = set()
merged_item_columns.update(a[0].keys())
merged_item_columns.update(b[0].keys())
# UPDATE: Above assumption of picking one element from list a and b
# does not hold true
# In OP's question: l2 has some records with val2, some with val3.
# So it isn't like a dataframe where all columns are known in advance.
# Discovery requires scanning all elements (sigh)
# This can be done when creating the index for b atleast.
b_index = {} # replaces the original b_index computation at the beginning.
# NOTE: if l1 also has similar characteristics, it will also require a similar scan.
for i in b:
b_index[key(i)] = i
merged_item_columns.update(i.keys())
# TODO: determine type for each column and choose correct defaults
# using 0 as default for now.
merged_item_template = {k:0 for k in merged_item_columns}
for a_item in a:
merged_item = merged_item_template.copy()
merged_item.update(a_item)
b_item = b_index.get(key(a_item))
if b_item is not None:
merged_item.update(b_item)
output.append(merged_item)
return output
output = left_merge(l1, l2, key=lambda x:(x['key1'], x['key2']))
print(output)
print(op_output == output)
您可以首先為val2
和val3
分配零並應用您獲得的代碼:
l2 = {(d['key1'], d['key2']): d for d in l2}
output = [{**d, **{'val2': 0, 'val3': 0}} for d in l1] # zeros for val2 and val3
output = [{**d, **l2.get((d['key1'], d['key2']), {})} for d in output] # update
輸出:
[{'key1': '2017', 'key2': '20-30', 'val1': 11, 'val2': 1000, 'val3': 0},
{'key1': '2017', 'key2': '30-40', 'val1': 22, 'val2': 0, 'val3': 0},
{'key1': '2017', 'key2': '40-50', 'val1': 33, 'val2': 0, 'val3': 2000},
{'key1': '2017', 'key2': '50+', 'val1': 44, 'val2': 0, 'val3': 0},
{'key1': '2018', 'key2': '20-30', 'val1': 55, 'val2': 0, 'val3': 0},
{'key1': '2018', 'key2': '30-40', 'val1': 66, 'val2': 0, 'val3': 0},
{'key1': '2018', 'key2': '40-50', 'val1': 77, 'val2': 0, 'val3': 0},
{'key1': '2018', 'key2': '50+', 'val1': 88, 'val2': 0, 'val3': 3000}]
順便說一句,對於 python 3.9+,您可以改用|
運算符來簡化代碼:
l2 = {(d['key1'], d['key2']): d for d in l2}
output = [d | {'val2': 0, 'val3': 0} for d in l1]
output = [d | l2.get((d['key1'], d['key2']), {}) for d in output]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.