[英]Itertools groupby to organize list of dictionaries by two values
I'm attempting to organize values by state of birth as well as if they have 0 money or not.我正在尝试按出生时的 state 以及他们是否有 0 钱来组织价值观。 Itertools groupby function looks like the simplest way to do so but I'm struggling implementing it.
Itertools groupby function 看起来是最简单的方法,但我正在努力实现它。 Open to other options as well.
也对其他选项开放。
If I have a list of dictionaries that looks like this如果我有一个看起来像这样的字典列表
users = [
{"name": "John", "state_of_birth": "CA", "money": 0},
{"name": "Andrew", "state_of_birth": "CA", "money": 300},
{"name": "Scott", "state_of_birth": "OR", "money": 20},
{"name": "Travis", "state_of_birth": "NY", "money": 0},
{"name": "Bill", "state_of_birth": "CA", "money": 0},
{"name": "Mike", "state_of_birth": "NY", "money": 0}
]
I'm attempting to get this output我正在尝试获取此 output
desired_output = [
[{"name": "John", "state_of_birth": "CA", "money": 0}, {"name": "Bill", "state_of_birth": "CA", "money": 0}],
[{"name": "Andrew", "state_of_birth": "CA", "money": 300}],
[{"name": "Scott", "state_of_birth": "OR", "money": 20}],
[{"name": "Travis", "state_of_birth": "NY", "money": 0},{"name": "Mike", "state_of_birth": "NY", "money": 0}]
]
You can use itertools
like this:您可以像这样使用
itertools
:
import itertools
def func(x):
return tuple([x['state_of_birth'], x['money'] != 0])
desired_output = list(list(v) for _,v in itertools.groupby(sorted(users, key=func), func))
group_by
function is a generator that produces key
and value
. group_by
function 是生成key
和value
的生成器。 The key is derived from the key_function
that we're passing to the itertools.groupb_by()
.密钥是从我们传递给
itertools.groupb_by()
的key_function
派生的。 In your case, having the keys
in not important, that's why it is ignore in for _, v
.在您的情况下,
keys
不重要,这就是为什么在for _, v
中忽略它的原因。
Output: Output:
[{'name': 'John', 'state_of_birth': 'CA', 'money': 0}, {'name': 'Bill', 'state_of_birth': 'CA', 'money': 0}]
[{'name': 'Andrew', 'state_of_birth': 'CA', 'money': 300}]
[{'name': 'Travis', 'state_of_birth': 'NY', 'money': 0}, {'name': 'Mike', 'state_of_birth': 'NY', 'money': 0}]
[{'name': 'Scott', 'state_of_birth': 'OR', 'money': 20}]
code:代码:
users = [
{"name": "John", "state_of_birth": "CA", "money": 0},
{"name": "Andrew", "state_of_birth": "CA", "money": 300},
{"name": "Scott", "state_of_birth": "OR", "money": 20},
{"name": "Travis", "state_of_birth": "NY", "money": 0},
{"name": "Bill", "state_of_birth": "CA", "money": 0},
{"name": "Mike", "state_of_birth": "NY", "money": 0}
]
result = {}
for user in users:
key = (user["state_of_birth"],user["money"])
if key in result:
result[key].extend([user])
else:
result[key] = [user]
for _,v in result.items():
print(v)
result:结果:
[{'name': 'John', 'state_of_birth': 'CA', 'money': 0}, {'name': 'Bill', 'state_of_birth': 'CA', 'money': 0}]
[{'name': 'Andrew', 'state_of_birth': 'CA', 'money': 300}]
[{'name': 'Scott', 'state_of_birth': 'OR', 'money': 20}]
[{'name': 'Travis', 'state_of_birth': 'NY', 'money': 0}, {'name': 'Mike', 'state_of_birth': 'NY', 'money': 0}]
If I understand the question right, you have a structure that is List[Dict]
and you want to get a List[List[Dict]]
where the inner list contain dictionaries that have the same state_of_birth
and money > 0
boolean.如果我理解这个问题是正确的,你有一个结构是
List[Dict]
并且你想要一个List[List[Dict]]
,其中内部列表包含具有相同state_of_birth
和money > 0
boolean 的字典。
I would say the easiest solution is actually to use pandas
我想说最简单的解决方案实际上是使用
pandas
import pandas as pd
users = [
{"name": "John", "state_of_birth": "CA", "money": 0},
{"name": "Andrew", "state_of_birth": "CA", "money": 300},
{"name": "Scott", "state_of_birth": "OR", "money": 20},
{"name": "Travis", "state_of_birth": "NY", "money": 0},
{"name": "Bill", "state_of_birth": "CA", "money": 0},
{"name": "Mike", "state_of_birth": "NY", "money": 0}
]
df = pd.DataFrame.from_records(users)
# we need a column to indicate if money > 0
df["money_bool"] = df["money"] > 0
# groupby gives you an iterator of Tuple[key, sub-dataframe]
# dfs now holds a list of your grouped dataframes
dfs = [tup[1] for tup in df.groupby(["state_of_birth", "money_bool"])]
# you can now drop the money_bool column if you want
dfs = [df.drop("money_bool", axis=1) for df in dfs]
desired_output = [df.to_dict("records") for df in dfs]
Depending on the context of the problem, you may be better off staying in dataframe/tabular format根据问题的上下文,您最好保留数据框/表格格式
You need to make sure that the input to the groupby
function is sorted.您需要确保对
groupby
function 的输入进行排序。 You can use the same key function as for grouping:您可以使用与分组相同的密钥 function :
users = [
{"name": "John", "state_of_birth": "CA", "money": 0},
{"name": "Andrew", "state_of_birth": "CA", "money": 300},
{"name": "Scott", "state_of_birth": "OR", "money": 20},
{"name": "Travis", "state_of_birth": "NY", "money": 0},
{"name": "Bill", "state_of_birth": "CA", "money": 0},
{"name": "Mike", "state_of_birth": "NY", "money": 0}
]
def selector(item): return (item.get('state_of_birth'), item.get('money') != 0)
sorted_users = sorted(users, key=selector)
result = [list(group) for _, group in groupby(sorted_users, selector) ]
Output: Output:
[
[{'name': 'John', 'state_of_birth': 'CA', 'money': 0}, {'name': 'Bill', 'state_of_birth': 'CA', 'money': 0}],
[{'name': 'Andrew', 'state_of_birth': 'CA', 'money': 300}],
[{'name': 'Travis', 'state_of_birth': 'NY', 'money': 0}, {'name': 'Mike', 'state_of_birth': 'NY', 'money': 0}],
[{'name': 'Scott', 'state_of_birth': 'OR', 'money': 20}]
]
Although its name seems like it should be the way to go, itertools.groupby
is not the correct function to use because it requires the data to be pre-sorted.虽然它的名字看起来应该是 go 的方式,但
itertools.groupby
不是正确的 function 使用,因为它需要对数据进行预排序。 Sorting brings your time complexity to O(n log(n)) for an algorithm that should be O(n).对于应该为 O(n) 的算法,排序会将您的时间复杂度提高到 O(n log(n))。
To put that in perspective, if you have a million records to sort, instead of a million iterations, you now have 20 million iterations if you use groupby
instead of a loop and dict.换个角度来看,如果你有一百万条记录要排序,而不是一百万次迭代,如果你使用
groupby
而不是循环和字典,你现在有 2000 万次迭代。 That's a pretty significant performance penalty.这是一个相当大的性能损失。
If groupby
was cleaner to write or didn't have an import, it might be justifiable, but it's less readable than a simpler approach using a plain loop and dictionary.如果
groupby
写起来更干净或者没有导入,它可能是合理的,但它比使用普通循环和字典的更简单方法可读性差。
Pandas is fine, but there's really no reason to use it unless you're already doing so. Pandas 很好,但除非你已经这样做了,否则真的没有理由使用它。 It's like bringing in a space shuttle to grill a zucchini.
这就像带上航天飞机烤西葫芦一样。
You can use defaultdict
and a loop:您可以使用
defaultdict
和循环:
from collections import defaultdict
from pprint import pprint
users = [
{"name": "John", "state_of_birth": "CA", "money": 0},
{"name": "Andrew", "state_of_birth": "CA", "money": 300},
{"name": "Scott", "state_of_birth": "OR", "money": 20},
{"name": "Travis", "state_of_birth": "NY", "money": 0},
{"name": "Bill", "state_of_birth": "CA", "money": 0},
{"name": "Mike", "state_of_birth": "NY", "money": 0},
]
grouped = defaultdict(list)
groupby = "state_of_birth", "money"
for user in users:
grouped[tuple([user[k] for k in groupby])].append(user)
pprint([*grouped.values()])
If you want "money is nonzero" rather than just the "money"
value itself, you can use a custom grouping function:如果您想要“钱不是零”而不仅仅是
"money"
值本身,您可以使用自定义分组 function:
grouped = defaultdict(list)
def group_by(x):
return x["state_of_birth"], x["money"] != 0
for user in users:
grouped[group_by(user)].append(user)
result = [*grouped.values()]
or inline the logic:或内联逻辑:
grouped = defaultdict(list)
for user in users:
grouped[user["state_of_birth"], user["money"] != 0].append(user)
result = [*grouped.values()]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.