简体   繁体   English

在python中打包字典列表

[英]Pack a list of dicts in python

I have a list of dicts structured like that : 我有这样的字典结构:

[
    {'state': '1', 'city': 'a'},
    {'state': '1', 'city': 'b'},
    {'state': '2', 'city': 'c'},
    {'state': '2', 'city': 'd'},
    {'state': '3', 'city': 'e'}
]

And I want to pack it that way: 我想用这种方式打包:

[
    {'state': '1', 'cities': ['a', 'b']},
    {'state': '2', 'cities': ['c', 'd']},
    {'state': '3', 'cities': ['e']}
]

I have a two steps approach that works but is very slow (my list is more than 10000 items long and my dicts are complex): 我有一个两步式的方法可以工作,但是很慢(我的列表超过10000个项目,我的字典很复杂):

def pack(iterable):

    # step 1: lists -> super slow ! contains duplicates
    listed = [{'state': i['state'],
              'cities': [c['city'] for c in iterable if c['state']==i['state']]}
              for i in iterable]

    # step 2: remove duplicates
    packed = [l for n, l in enumerate(listed) if not l in listed[n+1:]]

    return packed

Any advice to optimize it ? 有什么优化建议吗?

Ps: suggestions for the title of the thread are welcome. 附:欢迎提供有关线程标题的建议。

Edit of 2014/09/26: I just discovered pandas non-standard library which is helpful in this case. 2014/09/26的编辑:我刚刚发现了熊猫非标准库,在这种情况下很有用。

More examples in my self-answer below. 下面的自我解答中有更多示例。

state_merged = {}
for s in states:
    state_merged.setdefault(s['state'], []).append(s['city'])

states = [{'state':k, 'cities':v} for k, v in state_merged.iteritems()]

If you are using python 3.0 use state_merged.items() instead if state_merged.iteritems() 如果您使用的是python 3.0,请使用state_merged.items()代替,如果使用state_merged.iteritems()

The following does not require a pre-sorted iterable and runs in O(n) time, however it assumes an asymmetry between state and the other dictionary keys (which given your example seems to be a correct assumption). 以下代码不需要预先排序的可迭代对象,并且可以在O(n)时间内运行,但是它假定状态和其他字典键之间是不对称的(在您的示例中,这似乎是正确的假设)。

import collections
def pack(iterable):
    out = collections.defaultdict(list) #or use defaultdict(set)
    for d in iterable:
        out[d['state']].append(d['city'])
    return out

it = [
    {'state': '1', 'city': 'a'},
    {'state': '1', 'city': 'b'},
    {'state': '2', 'city': 'c'},
    {'state': '2', 'city': 'd'},
    {'state': '3', 'city': 'e'}
]

pack(it) == {'1': ['a', 'b'],
             '2': ['c', 'd'],
             '3': ['e']}

If you need to return an iterable in the same format as requested, you could convert out into a list . 如果你需要的要求,在相同的格式返回一个迭代器,你可以转换out成一个list

def convert(out):
    final = []
    for state, city in out.iteritems(): #Python 3.0+ use .items()
        final.append({'state': state, 'city': city})
    return final

convert(pack(it)) == [
    {'state': '1', 'city': ['a', 'b']},
    {'state': '2', 'city': ['c', 'd']},
    {'state': '3', 'city': ['e']}
]

If you have more than just 2 keys in your input, you would need to make the following changes: 如果您输入的不仅仅是两个键,则需要进行以下更改:

it = [{'state': 'WA', 'city': 'Seattle', 'zipcode': 98101, 'city_population': 9426},
      {'state': 'OR', 'city': 'Portland', 'zipcode': 97225, 'city_population': 24749},
      {'state': 'WA', 'city': 'Spokane', 'zipcode': 99201, 'city_population': 12523}]


def citydata():
    return {'city': [], 'zipcode': [], 'state_population': 0} #or use a namedtuple('Location', 'city zipcode state_population')

def pack(iterable):
    out = defaultdict(citydata)
    for d in iterable:
        out[d['state']]['city'].append(d['city'])
        out[d['state']]['zipcode'].append(d['zipcode'])
        out[d['state']]['state_population'] += d['city_population']
    return out

pack(it) == {
   'WA':
       {'city': ['Seattle', 'Spokane'], 'zipcode': [98101, 99201], 'state_population': 21949},
   'OR':
       {'city': ['Portland'], 'zipcode': [97225], 'state_population': 24749}
}

The convert function would need adjusted accordingly. convert功能需要相应地进行调整。

convert(pack(it)) == [
       {'state': 'WA', 'city': ['Seattle', 'Spokane'], 'zipcode': [98101, 99201], 'state_population': 21949},
       {'state': 'OR', 'city': ['Portland'], 'zipcode': [97225], 'state_population': 24749}
]

To maintain order of the original iterable, use an OrderedDefaultdict instead of a defaultdict. 为了保持原始可迭代对象的顺序,请使用OrderedDefaultdict而不是defaultdict。

Here's a more functional approach that's a lot faster: 这是一种更加实用的方法,速度更快:

import itertools
def pack(original):
    return [
        {'state': state, 'cities': [element['city'] for element in group]} 
        for state, group 
        in itertools.groupby(original, lambda e: e['state'])
    ]

This assumes that your each state has all its member listed consecutively in the original list. 假设您的每个州在原始列表中连续列出了其所有成员。

The reason your current approach is so much slower is that it has to iterate over the entire list for every state id found. 您当前的方法之所以慢得多的原因是,它必须对找到的每个状态ID遍历整个列表。 That is known as an O(n^2) approach. 这被称为O(n^2)方法。 This approach needs to iterate over the source list only once, so it is O(n) . 这种方法只需要对源列表进行一次迭代,所以它是O(n)

I just discovered pandas lib (which is non-standard) after some trouble installing it on my windows python 2.6.5 (exe here http://www.lfd.uci.edu/~gohlke/pythonlibs/#pandas ). 我在Windows python 2.6.5上安装了一些麻烦后才发现了pandas lib(这是非标准的)(exe在这里http://www.lfd.uci.edu/~gohlke/pythonlibs/#pandas )。

Website: http://pandas.pydata.org/pandas-docs/stable/ 网站: http//pandas.pydata.org/pandas-docs/stable/

General presentation: 总体介绍:

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. pandas是一个Python软件包,提供快速,灵活和富于表现力的数据结构,旨在使使用“关系”或“标记”数据既简单又直观。 It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. 它旨在成为在Python中进行实际,真实世界数据分析的基本高级构建块。

Pandas will be familiar to those already using numpy and R. 已经使用numpy和R的用户会熟悉熊猫。

Here is how to solve my problem with pandas: 这是解决熊猫问题的方法:

>>> import pandas as pd

>>> raw = [{'state': '1', 'city': 'a'},
           {'state': '1', 'city': 'b'},
           {'state': '2', 'city': 'c'},
           {'state': '2', 'city': 'd'},
           {'state': '3', 'city': 'e'}]

>>> df = pd.DataFrame(raw) # magic !

>>> df
  city state
0    a     1
1    b     1
2    c     2
3    d     2
4    e     3

>>> grouped = df.groupby('state')['city']
>>> grouped
<pandas.core.groupby.SeriesGroupBy object at 0x05F22110>

>>> listed = grouped.apply(list)
>>> listed
state
1        [a, b]
2        [c, d]
3           [e]
Name: city, dtype: object

>>> listed.to_dict() # magic again !
{'1': ['a', 'b'], '3': ['e'], '2': ['c', 'd']}

More complex examples including grouped.apply(custom_fct) here: 更复杂的示例,包括此处的grouped.apply(custom_fct)

Pandas groupby: How to get a union of strings 熊猫groupby:如何获得字符串的并集

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM