简体   繁体   中英

Python - categorizing items in a list based on occurrence in a dictionary of lists

I have a dataset like this (simplified):

foods_dict = {}
foods_dict['fruit'] = ['apple', 'orange', 'plum']
foods_dict['veg'] = ['cabbage', 'potato', 'carrot']

And I have a list of items I want to categorize:

items = ['orange', 'potato', 'cabbage', 'plum', 'farmer', 'egg']

I want to be able to assign item from the items list into smaller lists based on their occurrence in the foods_dict . I think these sublists should actually be sets as I don't want any duplicates in there.

My first pass at the code was like this:

fruits = set()
veggies = set()
others = set()
for item in items:
    if item in foods_dict.get('fruit'):
        fruits.add(item)
    elif item in foods_dict.get('veg'):
        veggies.add(item)
    else:
        others.add(item)

But this seems really inefficient and needlessly verbose to me. My question is, how could this code be improved? I'm guessing list comprehension could be useful here, but I'm not sure with the number of lists.

For an efficient solution you want to avoid explicit loops as much as possible:

items = set(items)
fruits = set(foods_dict['fruit']) & items
veggies = set(foods_dict['veg']) & items
others = items - fruits - veggies

This will almost surely be faster than using explicit loops. In particular doing item in foods_dict['fruit'] is time consuming if the list of fruits is long.


A very simple benchmark between the solutions so far:

In [5]: %%timeit
   ...: items2 = set(items)
   ...: fruits = set(foods_dict['fruit']) & items2
   ...: veggies = set(foods_dict['veg']) & items2
   ...: others = items2 - fruits - veggies
   ...: 
1000000 loops, best of 3: 1.75 us per loop

In [6]: %%timeit
   ...: fruits = set()
   ...: veggies = set()
   ...: others = set()
   ...: for item in items:
   ...:     if item in foods_dict.get('fruit'):
   ...:         fruits.add(item)
   ...:     elif item in foods_dict.get('veg'):
   ...:         veggies.add(item)
   ...:     else:
   ...:         others.add(item)
   ...: 
100000 loops, best of 3: 2.57 us per loop

In [7]: %%timeit
   ...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
   ...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
   ...: others = set(items) - veggies - fruits
   ...: 
100000 loops, best of 3: 3.34 us per loop

Surely, before choosing you should do some tests with "real inputs". I have no idea on the number of elements you have in your problem, and timings may change a lot with bigger inputs. Anyway my experience tells me that, at least in CPython, explicit loops tend to be slower than using only built-in operations.


Edit2: An example with bigger inputs:

In [9]: foods_dict = {}
   ...: foods_dict['fruit'] = list(range(0, 10000, 2))
   ...: foods_dict['veg'] = list(range(1, 10000, 2))

In [10]: items = list(range(5, 10000, 13))  #some odd some even

In [11]: %%timeit
    ...: fruits = set()
    ...: veggies = set()
    ...: others = set()
    ...: for item in items:
    ...:     if item in foods_dict.get('fruit'):
    ...:         fruits.add(item)
    ...:     elif item in foods_dict.get('veg'):
    ...:         veggies.add(item)
    ...:     else:
    ...:         others.add(item)
    ...: 
10 loops, best of 3: 68.8 ms per loop

In [12]: %%timeit
    ...: veggies = set(elem for elem in items if elem in foods_dict['veg'])
    ...: fruits = set(elem for elem in items if elem in foods_dict['fruit'])
    ...: others = set(items) - veggies - fruits
    ...: 
10 loops, best of 3: 99.9 ms per loop

In [13]: %%timeit
    ...: items2 = set(items)
    ...: fruits = set(foods_dict['fruit']) & items2
    ...: veggies = set(foods_dict['veg']) & items2
    ...: others = items2 - fruits - veggies
    ...: 
1000 loops, best of 3: 445 us per loop

As you can see using only built-ins is about 20x faster then explicit loops.

This might do what you are looking for (for eg the veggies case):

veggies = set(elem for elem in items if elem in foods_dict['veg'])

More fully:

veggies = set(elem for elem in items if elem in foods_dict['veg'])
fruits = set(elem for elem in items if elem in foods_dict['fruit'])
others = set(items) - veggies - fruits

How about something like this (avoiding list comprehensions using just set operations):

fruits = set(items).intersection(set(foods_dict['fruit']))
veggies = set(items).intersection(set(foods_dict['veg']))
others = set(items).difference(veggies.union(fruits))

And you can probably just start out with sets to avoid the set() conversion if you can help it.

Hope that helps!

EDIT: seems like you are concerned with either efficiency OR verbosity (and being "pythonic"). If you are concerned about efficiency remember that between the bytecode compiler and the interpreter you have no idea what optimizations (if any) are being implemented. It's generally difficult to optimize things in such a high level. Possible, but you need some benchmarks first. If you are concerned with being pythonic, I would try to be more high level (can I say declarative here? Or are we not there yet :) ).

In other words, instead of looping over and telling python exactly how it should decide which item goes where, I would try to be readable, clear and succinct. I think (because I wrote the above) that this style tells the reader exactly what you want to do with the list of items.

Hope this helps, all of this is just my opinion and should be taken with a grain of salt.

Here's a more general if you have more categories. (So you don't have a separate variable for each category.)

from collections import defaultdict

foods_dict = {}
foods_dict['fruit'] = set(['apple', 'orange', 'plum'])
foods_dict['veg']   = set(['cabbage', 'potato', 'carrot'])

items = set(['orange', 'potato', 'cabbage', 'plum', 'farmer', 'egg'])

dict_items = set.union(*foods_dict.values())

assignments = defaultdict(set)

assignments['other'] = dict_items.copy()
for key in foods_dict.keys():
    assignments[key] = foods_dict[key] & items
    assignments['other'] -= foods_dict[key]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM