在python中打包字典列表

Question

我有這樣的字典結構：

[
    {'state': '1', 'city': 'a'},
    {'state': '1', 'city': 'b'},
    {'state': '2', 'city': 'c'},
    {'state': '2', 'city': 'd'},
    {'state': '3', 'city': 'e'}
]

我想用這種方式打包：

[
    {'state': '1', 'cities': ['a', 'b']},
    {'state': '2', 'cities': ['c', 'd']},
    {'state': '3', 'cities': ['e']}
]

我有一個兩步式的方法可以工作，但是很慢（我的列表超過10000個項目，我的字典很復雜）：

def pack(iterable):

    # step 1: lists -> super slow ! contains duplicates
    listed = [{'state': i['state'],
              'cities': [c['city'] for c in iterable if c['state']==i['state']]}
              for i in iterable]

    # step 2: remove duplicates
    packed = [l for n, l in enumerate(listed) if not l in listed[n+1:]]

    return packed

有什么優化建議嗎？

附：歡迎提供有關線程標題的建議。

2014/09/26的編輯：我剛剛發現了熊貓非標准庫，在這種情況下很有用。

下面的自我解答中有更多示例。

Answer 1

state_merged = {}
for s in states:
    state_merged.setdefault(s['state'], []).append(s['city'])

states = [{'state':k, 'cities':v} for k, v in state_merged.iteritems()]

如果您使用的是python 3.0，請使用state_merged.items()代替，如果使用state_merged.iteritems()

Answer 2

以下代碼不需要預先排序的可迭代對象，並且可以在O(n)時間內運行，但是它假定狀態和其他字典鍵之間是不對稱的（在您的示例中，這似乎是正確的假設）。

import collections
def pack(iterable):
    out = collections.defaultdict(list) #or use defaultdict(set)
    for d in iterable:
        out[d['state']].append(d['city'])
    return out

it = [
    {'state': '1', 'city': 'a'},
    {'state': '1', 'city': 'b'},
    {'state': '2', 'city': 'c'},
    {'state': '2', 'city': 'd'},
    {'state': '3', 'city': 'e'}
]

pack(it) == {'1': ['a', 'b'],
             '2': ['c', 'd'],
             '3': ['e']}

如果你需要的要求，在相同的格式返回一個迭代器，你可以轉換out成一個list 。

def convert(out):
    final = []
    for state, city in out.iteritems(): #Python 3.0+ use .items()
        final.append({'state': state, 'city': city})
    return final

convert(pack(it)) == [
    {'state': '1', 'city': ['a', 'b']},
    {'state': '2', 'city': ['c', 'd']},
    {'state': '3', 'city': ['e']}
]

如果您輸入的不僅僅是兩個鍵，則需要進行以下更改：

it = [{'state': 'WA', 'city': 'Seattle', 'zipcode': 98101, 'city_population': 9426},
      {'state': 'OR', 'city': 'Portland', 'zipcode': 97225, 'city_population': 24749},
      {'state': 'WA', 'city': 'Spokane', 'zipcode': 99201, 'city_population': 12523}]


def citydata():
    return {'city': [], 'zipcode': [], 'state_population': 0} #or use a namedtuple('Location', 'city zipcode state_population')

def pack(iterable):
    out = defaultdict(citydata)
    for d in iterable:
        out[d['state']]['city'].append(d['city'])
        out[d['state']]['zipcode'].append(d['zipcode'])
        out[d['state']]['state_population'] += d['city_population']
    return out

pack(it) == {
   'WA':
       {'city': ['Seattle', 'Spokane'], 'zipcode': [98101, 99201], 'state_population': 21949},
   'OR':
       {'city': ['Portland'], 'zipcode': [97225], 'state_population': 24749}
}

convert功能需要相應地進行調整。

convert(pack(it)) == [
       {'state': 'WA', 'city': ['Seattle', 'Spokane'], 'zipcode': [98101, 99201], 'state_population': 21949},
       {'state': 'OR', 'city': ['Portland'], 'zipcode': [97225], 'state_population': 24749}
]

為了保持原始可迭代對象的順序，請使用OrderedDefaultdict而不是defaultdict。

Answer 3

這是一種更加實用的方法，速度更快：

import itertools
def pack(original):
    return [
        {'state': state, 'cities': [element['city'] for element in group]} 
        for state, group 
        in itertools.groupby(original, lambda e: e['state'])
    ]

假設您的每個州在原始列表中連續列出了其所有成員。

您當前的方法之所以慢得多的原因是，它必須對找到的每個狀態ID遍歷整個列表。 這被稱為O(n^2)方法。 這種方法只需要對源列表進行一次迭代，所以它是O(n) 。

Answer 4

我在Windows python 2.6.5上安裝了一些麻煩后才發現了pandas lib（這是非標准的）（exe在這里http://www.lfd.uci.edu/~gohlke/pythonlibs/#pandas ）。

網站： http ： //pandas.pydata.org/pandas-docs/stable/

總體介紹：

pandas是一個Python軟件包，提供快速，靈活和富於表現力的數據結構，旨在使使用“關系”或“標記”數據既簡單又直觀。 它旨在成為在Python中進行實際，真實世界數據分析的基本高級構建塊。

已經使用numpy和R的用戶會熟悉熊貓。

這是解決熊貓問題的方法：

>>> import pandas as pd

>>> raw = [{'state': '1', 'city': 'a'},
           {'state': '1', 'city': 'b'},
           {'state': '2', 'city': 'c'},
           {'state': '2', 'city': 'd'},
           {'state': '3', 'city': 'e'}]

>>> df = pd.DataFrame(raw) # magic !

>>> df
  city state
0    a     1
1    b     1
2    c     2
3    d     2
4    e     3

>>> grouped = df.groupby('state')['city']
>>> grouped
<pandas.core.groupby.SeriesGroupBy object at 0x05F22110>

>>> listed = grouped.apply(list)
>>> listed
state
1        [a, b]
2        [c, d]
3           [e]
Name: city, dtype: object

>>> listed.to_dict() # magic again !
{'1': ['a', 'b'], '3': ['e'], '2': ['c', 'd']}

更復雜的示例，包括此處的grouped.apply(custom_fct) ：

熊貓groupby：如何獲得字符串的並集

在python中打包字典列表

問題描述

4 個解決方案

解決方案1
2 2014-07-31 16:28:18

解決方案2
2 2014-07-31 17:10:29

解決方案3
1 2014-07-31 16:32:49

解決方案4
0 已采納 2014-09-26 08:11:07

在python中打包字典列表

問題描述

4 個解決方案

解決方案1 2 2014-07-31 16:28:18

解決方案2 2 2014-07-31 17:10:29

解決方案3 1 2014-07-31 16:32:49

解決方案4 0 已采納 2014-09-26 08:11:07

解決方案1
2 2014-07-31 16:28:18

解決方案2
2 2014-07-31 17:10:29

解決方案3
1 2014-07-31 16:32:49

解決方案4
0 已采納 2014-09-26 08:11:07