![](/img/trans.png)
[英]Remove duplicates from list of dictionaries created using groupby itertools in Python
[英]Itertools groupby to organize list of dictionaries by two values
我正在嘗試按出生時的 state 以及他們是否有 0 錢來組織價值觀。 Itertools groupby function 看起來是最簡單的方法,但我正在努力實現它。 也對其他選項開放。
如果我有一個看起來像這樣的字典列表
users = [
{"name": "John", "state_of_birth": "CA", "money": 0},
{"name": "Andrew", "state_of_birth": "CA", "money": 300},
{"name": "Scott", "state_of_birth": "OR", "money": 20},
{"name": "Travis", "state_of_birth": "NY", "money": 0},
{"name": "Bill", "state_of_birth": "CA", "money": 0},
{"name": "Mike", "state_of_birth": "NY", "money": 0}
]
我正在嘗試獲取此 output
desired_output = [
[{"name": "John", "state_of_birth": "CA", "money": 0}, {"name": "Bill", "state_of_birth": "CA", "money": 0}],
[{"name": "Andrew", "state_of_birth": "CA", "money": 300}],
[{"name": "Scott", "state_of_birth": "OR", "money": 20}],
[{"name": "Travis", "state_of_birth": "NY", "money": 0},{"name": "Mike", "state_of_birth": "NY", "money": 0}]
]
您可以像這樣使用itertools
:
import itertools
def func(x):
return tuple([x['state_of_birth'], x['money'] != 0])
desired_output = list(list(v) for _,v in itertools.groupby(sorted(users, key=func), func))
group_by
function 是生成key
和value
的生成器。 密鑰是從我們傳遞給itertools.groupb_by()
的key_function
派生的。 在您的情況下, keys
不重要,這就是為什么在for _, v
中忽略它的原因。
Output:
[{'name': 'John', 'state_of_birth': 'CA', 'money': 0}, {'name': 'Bill', 'state_of_birth': 'CA', 'money': 0}]
[{'name': 'Andrew', 'state_of_birth': 'CA', 'money': 300}]
[{'name': 'Travis', 'state_of_birth': 'NY', 'money': 0}, {'name': 'Mike', 'state_of_birth': 'NY', 'money': 0}]
[{'name': 'Scott', 'state_of_birth': 'OR', 'money': 20}]
代碼:
users = [
{"name": "John", "state_of_birth": "CA", "money": 0},
{"name": "Andrew", "state_of_birth": "CA", "money": 300},
{"name": "Scott", "state_of_birth": "OR", "money": 20},
{"name": "Travis", "state_of_birth": "NY", "money": 0},
{"name": "Bill", "state_of_birth": "CA", "money": 0},
{"name": "Mike", "state_of_birth": "NY", "money": 0}
]
result = {}
for user in users:
key = (user["state_of_birth"],user["money"])
if key in result:
result[key].extend([user])
else:
result[key] = [user]
for _,v in result.items():
print(v)
結果:
[{'name': 'John', 'state_of_birth': 'CA', 'money': 0}, {'name': 'Bill', 'state_of_birth': 'CA', 'money': 0}]
[{'name': 'Andrew', 'state_of_birth': 'CA', 'money': 300}]
[{'name': 'Scott', 'state_of_birth': 'OR', 'money': 20}]
[{'name': 'Travis', 'state_of_birth': 'NY', 'money': 0}, {'name': 'Mike', 'state_of_birth': 'NY', 'money': 0}]
如果我理解這個問題是正確的,你有一個結構是List[Dict]
並且你想要一個List[List[Dict]]
,其中內部列表包含具有相同state_of_birth
和money > 0
boolean 的字典。
我想說最簡單的解決方案實際上是使用pandas
import pandas as pd
users = [
{"name": "John", "state_of_birth": "CA", "money": 0},
{"name": "Andrew", "state_of_birth": "CA", "money": 300},
{"name": "Scott", "state_of_birth": "OR", "money": 20},
{"name": "Travis", "state_of_birth": "NY", "money": 0},
{"name": "Bill", "state_of_birth": "CA", "money": 0},
{"name": "Mike", "state_of_birth": "NY", "money": 0}
]
df = pd.DataFrame.from_records(users)
# we need a column to indicate if money > 0
df["money_bool"] = df["money"] > 0
# groupby gives you an iterator of Tuple[key, sub-dataframe]
# dfs now holds a list of your grouped dataframes
dfs = [tup[1] for tup in df.groupby(["state_of_birth", "money_bool"])]
# you can now drop the money_bool column if you want
dfs = [df.drop("money_bool", axis=1) for df in dfs]
desired_output = [df.to_dict("records") for df in dfs]
根據問題的上下文,您最好保留數據框/表格格式
您需要確保對groupby
function 的輸入進行排序。 您可以使用與分組相同的密鑰 function :
users = [
{"name": "John", "state_of_birth": "CA", "money": 0},
{"name": "Andrew", "state_of_birth": "CA", "money": 300},
{"name": "Scott", "state_of_birth": "OR", "money": 20},
{"name": "Travis", "state_of_birth": "NY", "money": 0},
{"name": "Bill", "state_of_birth": "CA", "money": 0},
{"name": "Mike", "state_of_birth": "NY", "money": 0}
]
def selector(item): return (item.get('state_of_birth'), item.get('money') != 0)
sorted_users = sorted(users, key=selector)
result = [list(group) for _, group in groupby(sorted_users, selector) ]
Output:
[
[{'name': 'John', 'state_of_birth': 'CA', 'money': 0}, {'name': 'Bill', 'state_of_birth': 'CA', 'money': 0}],
[{'name': 'Andrew', 'state_of_birth': 'CA', 'money': 300}],
[{'name': 'Travis', 'state_of_birth': 'NY', 'money': 0}, {'name': 'Mike', 'state_of_birth': 'NY', 'money': 0}],
[{'name': 'Scott', 'state_of_birth': 'OR', 'money': 20}]
]
雖然它的名字看起來應該是 go 的方式,但itertools.groupby
不是正確的 function 使用,因為它需要對數據進行預排序。 對於應該為 O(n) 的算法,排序會將您的時間復雜度提高到 O(n log(n))。
換個角度來看,如果你有一百萬條記錄要排序,而不是一百萬次迭代,如果你使用groupby
而不是循環和字典,你現在有 2000 萬次迭代。 這是一個相當大的性能損失。
如果groupby
寫起來更干凈或者沒有導入,它可能是合理的,但它比使用普通循環和字典的更簡單方法可讀性差。
Pandas 很好,但除非你已經這樣做了,否則真的沒有理由使用它。 這就像帶上航天飛機烤西葫蘆一樣。
您可以使用defaultdict
和循環:
from collections import defaultdict
from pprint import pprint
users = [
{"name": "John", "state_of_birth": "CA", "money": 0},
{"name": "Andrew", "state_of_birth": "CA", "money": 300},
{"name": "Scott", "state_of_birth": "OR", "money": 20},
{"name": "Travis", "state_of_birth": "NY", "money": 0},
{"name": "Bill", "state_of_birth": "CA", "money": 0},
{"name": "Mike", "state_of_birth": "NY", "money": 0},
]
grouped = defaultdict(list)
groupby = "state_of_birth", "money"
for user in users:
grouped[tuple([user[k] for k in groupby])].append(user)
pprint([*grouped.values()])
如果您想要“錢不是零”而不僅僅是"money"
值本身,您可以使用自定義分組 function:
grouped = defaultdict(list)
def group_by(x):
return x["state_of_birth"], x["money"] != 0
for user in users:
grouped[group_by(user)].append(user)
result = [*grouped.values()]
或內聯邏輯:
grouped = defaultdict(list)
for user in users:
grouped[user["state_of_birth"], user["money"] != 0].append(user)
result = [*grouped.values()]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.