简体   繁体   中英

How to use itertools.groupby with a true/false lambda function

Suppose I have the following string:

data = """
Pakistan[country]
Karachi
lahore
islamabad
UAE[country]
dubai
sharjah
India[country]
goa
chennai
"""

How to use itertools.groupby here to have a dict (with the countries as keys) and their corresponding cities? The closest I have come to is

from itertools import groupby

filtered = (line for line in data.split("\n") if line)
for key, values in groupby(filtered, lambda line: line.endswith('[country]')):
    print(key)
    print(list(values))

However, how to group the result properly? I am not interested in other possible solutions (I have written a generator function myself) but want to explicitly use/understand itertools.groupby .


My generator function looks like

{'Pakistan': ['Karachi', 'lahore', 'islamabad']}
{'UAE': ['dubai', 'sharjah']}
{'India': ['goa', 'chennai']}

Which yields

{'Pakistan': ['Karachi', 'lahore', 'islamabad']} {'UAE': ['dubai', 'sharjah']} {'India': ['goa', 'chennai']}

I think groupby is the wrong tool for this. That's because it collects all successive items that have the same result when the key-function is applied to them. However from the problem description it seems more like you want to "split" your list when the function returns true.


However if you really want/must do it with groupby then there would be (conceptually) two approaches:

One possible way would be to collect pairs from the groupby result. So you collect the one which gave true and the following ones that returned False:

>>> filtered = (line for line in data.split("\n") if line)
>>> l = [list(g) for _, g in groupby(filtered, lambda line: line.endswith('[country]'))]
>>> d = {l[i*2][0].split('[')[0]: l[i*2+1] for i in range(len(l) // 2)}
>>> d
{'Pakistan': ['Karachi', 'lahore', 'islamabad'],
 'UAE': ['dubai', 'sharjah'],
 'India': ['goa', 'chennai']}

Or some sort of stateful container as function which remembers what the "current country" is:

class KeepCountry:
    def __call__(self, item):
        if item.endswith('[country]'):
            self._last = item.split('[country]')[0]
        return self._last

>>> filtered = (line for line in data.split("\n") if line)
>>> {k: list(g)[1:] for k, g in groupby(filtered, KeepCountry())}
{'Pakistan': ['Karachi', 'lahore', 'islamabad'],
 'UAE': ['dubai', 'sharjah'],
 'India': ['goa', 'chennai']}

Both solutions assume quite a few things - just in case you want to use any of these:

  • the first encountered item will be a country
  • each country has at least one associated city
  • no country name is encountered more than once

Just in case a third-party package might be acceptable then you could use iteration_utilities (my library) which provides a split -function for iterables:

>>> from iteration_utilities import Iterable

>>> (Iterable(data.split('\n'))
...    .filter(bool)  # Removes empty lines
...    # Split by countries while keeping them
...    .split(lambda l: l.endswith('[country]'), keep_after=True)[1:]  
...    # Convert to a tuple containing the country as first and the cities as second element
...    .map(lambda l: (l[0][:-9], l[1:]))  
...    .as_dict())
{'Pakistan': ['Karachi', 'lahore', 'islamabad'],
 'UAE': ['dubai', 'sharjah'],
 'India': ['goa', 'chennai']}

Not sure about itertools but why not:

from collections import defaultdict

data = """
Pakistan[country]
Karachi
lahore
islamabad
UAE[country]
dubai
sharjah
India[country]
goa
chennai
"""

dct = defaultdict(list)

country = ''

for x in data.split('\n')[1:-1]:
    if '[country]' in x:
        country = x.replace('[country]', '')
    else:
        dct[country].append(x)

print(dct)

# {'Pakistan': ['Karachi', 'lahore', 'islamabad'], 'UAE': ['dubai', 'sharjah'], 'India': ['goa', 'chennai']}

itertools.groupby() will return an alternating sequence of countries and cities. When it returns a country, you save the country. When it returns cities, you add an entry to the dictionary with the saved country.

result = {}
for is_country, values in itertools.groupby(filtered, key = lambda line: line.endswith("[country]")):
    if is_country:
        country = next(values)
    else:
        result[country] = list(values)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM