简体   繁体   English

Itertools groupby 按两个值组织字典列表

[英]Itertools groupby to organize list of dictionaries by two values

I'm attempting to organize values by state of birth as well as if they have 0 money or not.我正在尝试按出生时的 state 以及他们是否有 0 钱来组织价值观。 Itertools groupby function looks like the simplest way to do so but I'm struggling implementing it. Itertools groupby function 看起来是最简单的方法,但我正在努力实现它。 Open to other options as well.也对其他选项开放。

If I have a list of dictionaries that looks like this如果我有一个看起来像这样的字典列表

users = [
            {"name": "John", "state_of_birth": "CA", "money": 0},
            {"name": "Andrew", "state_of_birth": "CA", "money": 300},
            {"name": "Scott", "state_of_birth": "OR", "money": 20},
            {"name": "Travis", "state_of_birth": "NY", "money": 0},
            {"name": "Bill", "state_of_birth": "CA", "money": 0},
            {"name": "Mike", "state_of_birth": "NY", "money": 0}
        ]

I'm attempting to get this output我正在尝试获取此 output

desired_output = [
            [{"name": "John", "state_of_birth": "CA", "money": 0}, {"name": "Bill", "state_of_birth": "CA", "money": 0}],
            [{"name": "Andrew", "state_of_birth": "CA", "money": 300}],
            [{"name": "Scott", "state_of_birth": "OR", "money": 20}],
            [{"name": "Travis", "state_of_birth": "NY", "money": 0},{"name": "Mike", "state_of_birth": "NY", "money": 0}]
            ]

You can use itertools like this:您可以像这样使用itertools

import itertools

def func(x):
    return tuple([x['state_of_birth'], x['money'] != 0])

desired_output = list(list(v) for _,v in itertools.groupby(sorted(users, key=func), func))

group_by function is a generator that produces key and value . group_by function 是生成keyvalue的生成器。 The key is derived from the key_function that we're passing to the itertools.groupb_by() .密钥是从我们传递给itertools.groupb_by()key_function派生的。 In your case, having the keys in not important, that's why it is ignore in for _, v .在您的情况下, keys不重要,这就是为什么在for _, v中忽略它的原因。

Output: Output:

[{'name': 'John', 'state_of_birth': 'CA', 'money': 0}, {'name': 'Bill', 'state_of_birth': 'CA', 'money': 0}]
[{'name': 'Andrew', 'state_of_birth': 'CA', 'money': 300}]
[{'name': 'Travis', 'state_of_birth': 'NY', 'money': 0}, {'name': 'Mike', 'state_of_birth': 'NY', 'money': 0}]
[{'name': 'Scott', 'state_of_birth': 'OR', 'money': 20}]

code:代码:

users = [
            {"name": "John", "state_of_birth": "CA", "money": 0},
            {"name": "Andrew", "state_of_birth": "CA", "money": 300},
            {"name": "Scott", "state_of_birth": "OR", "money": 20},
            {"name": "Travis", "state_of_birth": "NY", "money": 0},
            {"name": "Bill", "state_of_birth": "CA", "money": 0},
            {"name": "Mike", "state_of_birth": "NY", "money": 0}
        ]

result = {}
for user in users:
    key = (user["state_of_birth"],user["money"])
    if key in result:
        result[key].extend([user])
    else:
        result[key] = [user]
for _,v in result.items():
    print(v)

result:结果:

[{'name': 'John', 'state_of_birth': 'CA', 'money': 0}, {'name': 'Bill', 'state_of_birth': 'CA', 'money': 0}]
[{'name': 'Andrew', 'state_of_birth': 'CA', 'money': 300}]
[{'name': 'Scott', 'state_of_birth': 'OR', 'money': 20}]
[{'name': 'Travis', 'state_of_birth': 'NY', 'money': 0}, {'name': 'Mike', 'state_of_birth': 'NY', 'money': 0}]

If I understand the question right, you have a structure that is List[Dict] and you want to get a List[List[Dict]] where the inner list contain dictionaries that have the same state_of_birth and money > 0 boolean.如果我理解这个问题是正确的,你有一个结构是List[Dict]并且你想要一个List[List[Dict]] ,其中内部列表包含具有相同state_of_birthmoney > 0 boolean 的字典。

I would say the easiest solution is actually to use pandas我想说最简单的解决方案实际上是使用pandas

import pandas as pd

users = [
            {"name": "John", "state_of_birth": "CA", "money": 0},
            {"name": "Andrew", "state_of_birth": "CA", "money": 300},
            {"name": "Scott", "state_of_birth": "OR", "money": 20},
            {"name": "Travis", "state_of_birth": "NY", "money": 0},
            {"name": "Bill", "state_of_birth": "CA", "money": 0},
            {"name": "Mike", "state_of_birth": "NY", "money": 0}
        ]

df = pd.DataFrame.from_records(users)

# we need a column to indicate if money > 0
df["money_bool"] = df["money"] > 0

# groupby gives you an iterator of Tuple[key, sub-dataframe]
# dfs now holds a list of your grouped dataframes
dfs = [tup[1] for tup in df.groupby(["state_of_birth", "money_bool"])]

# you can now drop the money_bool column if you want
dfs = [df.drop("money_bool", axis=1) for df in dfs]

desired_output = [df.to_dict("records") for df in dfs]

Depending on the context of the problem, you may be better off staying in dataframe/tabular format根据问题的上下文,您最好保留数据框/表格格式

You need to make sure that the input to the groupby function is sorted.您需要确保对groupby function 的输入进行排序。 You can use the same key function as for grouping:您可以使用与分组相同的密钥 function :

users = [
            {"name": "John", "state_of_birth": "CA", "money": 0},
            {"name": "Andrew", "state_of_birth": "CA", "money": 300},
            {"name": "Scott", "state_of_birth": "OR", "money": 20},
            {"name": "Travis", "state_of_birth": "NY", "money": 0},
            {"name": "Bill", "state_of_birth": "CA", "money": 0},
            {"name": "Mike", "state_of_birth": "NY", "money": 0}
        ]

def selector(item): return (item.get('state_of_birth'), item.get('money') != 0)
sorted_users = sorted(users, key=selector)
result = [list(group) for _, group in groupby(sorted_users, selector) ]

Output: Output:

[
    [{'name': 'John', 'state_of_birth': 'CA', 'money': 0}, {'name': 'Bill', 'state_of_birth': 'CA', 'money': 0}],
    [{'name': 'Andrew', 'state_of_birth': 'CA', 'money': 300}], 
    [{'name': 'Travis', 'state_of_birth': 'NY', 'money': 0}, {'name': 'Mike', 'state_of_birth': 'NY', 'money': 0}],
    [{'name': 'Scott', 'state_of_birth': 'OR', 'money': 20}]
]

Although its name seems like it should be the way to go, itertools.groupby is not the correct function to use because it requires the data to be pre-sorted.虽然它的名字看起来应该是 go 的方式,但itertools.groupby不是正确的 function 使用,因为它需要对数据进行预排序。 Sorting brings your time complexity to O(n log(n)) for an algorithm that should be O(n).对于应该为 O(n) 的算法,排序会将您的时间复杂度提高到 O(n log(n))。

To put that in perspective, if you have a million records to sort, instead of a million iterations, you now have 20 million iterations if you use groupby instead of a loop and dict.换个角度来看,如果你有一百万条记录要排序,而不是一百万次迭代,如果你使用groupby而不是循环和字典,你现在有 2000 万次迭代。 That's a pretty significant performance penalty.这是一个相当大的性能损失。

If groupby was cleaner to write or didn't have an import, it might be justifiable, but it's less readable than a simpler approach using a plain loop and dictionary.如果groupby写起来更干净或者没有导入,它可能是合理的,但它比使用普通循环和字典的更简单方法可读性差。

Pandas is fine, but there's really no reason to use it unless you're already doing so. Pandas 很好,但除非你已经这样做了,否则真的没有理由使用它。 It's like bringing in a space shuttle to grill a zucchini.这就像带上航天飞机烤西葫芦一样。

You can use defaultdict and a loop:您可以使用defaultdict和循环:

from collections import defaultdict
from pprint import pprint

users = [
    {"name": "John", "state_of_birth": "CA", "money": 0},
    {"name": "Andrew", "state_of_birth": "CA", "money": 300},
    {"name": "Scott", "state_of_birth": "OR", "money": 20},
    {"name": "Travis", "state_of_birth": "NY", "money": 0},
    {"name": "Bill", "state_of_birth": "CA", "money": 0},
    {"name": "Mike", "state_of_birth": "NY", "money": 0},
]

grouped = defaultdict(list)
groupby = "state_of_birth", "money"

for user in users:
    grouped[tuple([user[k] for k in groupby])].append(user)

pprint([*grouped.values()])

If you want "money is nonzero" rather than just the "money" value itself, you can use a custom grouping function:如果您想要“钱不是零”而不仅仅是"money"值本身,您可以使用自定义分组 function:

grouped = defaultdict(list)

def group_by(x):
    return x["state_of_birth"], x["money"] != 0

for user in users:
    grouped[group_by(user)].append(user)

result = [*grouped.values()]

or inline the logic:或内联逻辑:

grouped = defaultdict(list)

for user in users:
    grouped[user["state_of_birth"], user["money"] != 0].append(user)

result = [*grouped.values()]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM