I have a list of dicts structured like that :
[
{'state': '1', 'city': 'a'},
{'state': '1', 'city': 'b'},
{'state': '2', 'city': 'c'},
{'state': '2', 'city': 'd'},
{'state': '3', 'city': 'e'}
]
And I want to pack it that way:
[
{'state': '1', 'cities': ['a', 'b']},
{'state': '2', 'cities': ['c', 'd']},
{'state': '3', 'cities': ['e']}
]
I have a two steps approach that works but is very slow (my list is more than 10000 items long and my dicts are complex):
def pack(iterable):
# step 1: lists -> super slow ! contains duplicates
listed = [{'state': i['state'],
'cities': [c['city'] for c in iterable if c['state']==i['state']]}
for i in iterable]
# step 2: remove duplicates
packed = [l for n, l in enumerate(listed) if not l in listed[n+1:]]
return packed
Any advice to optimize it ?
Ps: suggestions for the title of the thread are welcome.
Edit of 2014/09/26: I just discovered pandas non-standard library which is helpful in this case.
More examples in my self-answer below.
state_merged = {}
for s in states:
state_merged.setdefault(s['state'], []).append(s['city'])
states = [{'state':k, 'cities':v} for k, v in state_merged.iteritems()]
If you are using python 3.0 use state_merged.items()
instead if state_merged.iteritems()
The following does not require a pre-sorted iterable and runs in O(n)
time, however it assumes an asymmetry between state and the other dictionary keys (which given your example seems to be a correct assumption).
import collections
def pack(iterable):
out = collections.defaultdict(list) #or use defaultdict(set)
for d in iterable:
out[d['state']].append(d['city'])
return out
it = [
{'state': '1', 'city': 'a'},
{'state': '1', 'city': 'b'},
{'state': '2', 'city': 'c'},
{'state': '2', 'city': 'd'},
{'state': '3', 'city': 'e'}
]
pack(it) == {'1': ['a', 'b'],
'2': ['c', 'd'],
'3': ['e']}
If you need to return an iterable in the same format as requested, you could convert out
into a list
.
def convert(out):
final = []
for state, city in out.iteritems(): #Python 3.0+ use .items()
final.append({'state': state, 'city': city})
return final
convert(pack(it)) == [
{'state': '1', 'city': ['a', 'b']},
{'state': '2', 'city': ['c', 'd']},
{'state': '3', 'city': ['e']}
]
If you have more than just 2 keys in your input, you would need to make the following changes:
it = [{'state': 'WA', 'city': 'Seattle', 'zipcode': 98101, 'city_population': 9426},
{'state': 'OR', 'city': 'Portland', 'zipcode': 97225, 'city_population': 24749},
{'state': 'WA', 'city': 'Spokane', 'zipcode': 99201, 'city_population': 12523}]
def citydata():
return {'city': [], 'zipcode': [], 'state_population': 0} #or use a namedtuple('Location', 'city zipcode state_population')
def pack(iterable):
out = defaultdict(citydata)
for d in iterable:
out[d['state']]['city'].append(d['city'])
out[d['state']]['zipcode'].append(d['zipcode'])
out[d['state']]['state_population'] += d['city_population']
return out
pack(it) == {
'WA':
{'city': ['Seattle', 'Spokane'], 'zipcode': [98101, 99201], 'state_population': 21949},
'OR':
{'city': ['Portland'], 'zipcode': [97225], 'state_population': 24749}
}
The convert
function would need adjusted accordingly.
convert(pack(it)) == [
{'state': 'WA', 'city': ['Seattle', 'Spokane'], 'zipcode': [98101, 99201], 'state_population': 21949},
{'state': 'OR', 'city': ['Portland'], 'zipcode': [97225], 'state_population': 24749}
]
To maintain order of the original iterable, use an OrderedDefaultdict instead of a defaultdict.
Here's a more functional approach that's a lot faster:
import itertools
def pack(original):
return [
{'state': state, 'cities': [element['city'] for element in group]}
for state, group
in itertools.groupby(original, lambda e: e['state'])
]
This assumes that your each state has all its member listed consecutively in the original list.
The reason your current approach is so much slower is that it has to iterate over the entire list for every state id found. That is known as an O(n^2)
approach. This approach needs to iterate over the source list only once, so it is O(n)
.
I just discovered pandas lib (which is non-standard) after some trouble installing it on my windows python 2.6.5 (exe here http://www.lfd.uci.edu/~gohlke/pythonlibs/#pandas ).
Website: http://pandas.pydata.org/pandas-docs/stable/
General presentation:
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
Pandas will be familiar to those already using numpy and R.
Here is how to solve my problem with pandas:
>>> import pandas as pd
>>> raw = [{'state': '1', 'city': 'a'},
{'state': '1', 'city': 'b'},
{'state': '2', 'city': 'c'},
{'state': '2', 'city': 'd'},
{'state': '3', 'city': 'e'}]
>>> df = pd.DataFrame(raw) # magic !
>>> df
city state
0 a 1
1 b 1
2 c 2
3 d 2
4 e 3
>>> grouped = df.groupby('state')['city']
>>> grouped
<pandas.core.groupby.SeriesGroupBy object at 0x05F22110>
>>> listed = grouped.apply(list)
>>> listed
state
1 [a, b]
2 [c, d]
3 [e]
Name: city, dtype: object
>>> listed.to_dict() # magic again !
{'1': ['a', 'b'], '3': ['e'], '2': ['c', 'd']}
More complex examples including grouped.apply(custom_fct)
here:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.