Splitting python list based on regular expression

Question

I have the following python list:

['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv', 'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv', 'daman_and_diu_2002_aa.csv']

How do I separate it into 2 lists:

['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv'] and ['daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv', 'daman_and_diu_2002_aa.csv']

The lists are split based on the words preceeding the year ie 2000...

I know I should use regex in python but not sure how to do it. Also, the solution needs to be extensible and not dependent on actual names eg chattisgarh

Answer 1

Here is one way to get a dictionary, where for each "name" key the value is a list of the strings starting with that name, keeping the order of the original list. This does not use regex and in fact uses no modules at all. You can easily modify this to make a function, remove the trailing underscore from each name, checking for various errors in the data list, getting the resulting lists out of the dictionary, and so on.

If you allow other modules, or allow changes in the order, I'm sure there are other ways.

a = ['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv',
     'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv',
     'daman_and_diu_2002_aa.csv']

names_dict = {}
for item in a:
    # Find the first numeric character in the item
    for i, c in enumerate(item):
        if c.isdigit():
            break
    # Store the string in the dictionary according to its preceding characters
    name = item[:i]
    if names_dict.get(name, None):
        names_dict[name].append(item)
    else:
        names_dict[name] = [item]

print(names_dict)

The result of this code (prettified) is

{'daman_and_diu_': [
    'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv',
    'daman_and_diu_2002_aa.csv'],
 'chhattisgarh_': [
    'chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv']
}

Answer 2

You can use itertools.groupby here:

import itertools
import re

list = ['chhattisgarh_2015_aa.csv', 'chhattisgarh_2016_aa.csv',
        'daman_and_diu_2000_aa.csv', 'daman_and_diu_2001_aa.csv',
        'daman_and_diu_2002_aa.csv']

grouped = itertools.groupby(sorted(list), lambda x: re.match('(.+)_\d{4}', x).group(1))    

for (key, values) in grouped:
    print(key)
    print([x for x in values])

The regex (.+)_\\d{4} matches a group of at least one character (which is what we group by) followed by an underscore and 4 digits.

Answer 3

Another option to use regular expression combined with dictionary:

files = ["chhattisgarh_2015_aa.csv", "chhattisgarh_2016_aa.csv", "daman_and_diu_2000_aa.csv", "daman_and_diu_2001_aa.csv", "daman_and_diu_2002_aa.csv"]

import re
from collections import defaultdict

groupedFiles = defaultdict(list)
for fileName in files:
    pattern = re.findall("(.*)\\d{4}", fileName)[0]
    groupedFiles[pattern].append(fileName)

groupedFiles

{'chhattisgarh_': ['chhattisgarh_2015_aa.csv',
                   'chhattisgarh_2016_aa.csv'],
 'daman_and_diu_': ['daman_and_diu_2000_aa.csv',
                    'daman_and_diu_2001_aa.csv',
                    'daman_and_diu_2002_aa.csv']}

Splitting python list based on regular expression

Question

3 answers

solution1
4 2016-06-19 23:16:10

solution2
4 ACCPTED 2016-06-19 23:16:44

solution3
2 2016-06-19 23:20:42

Splitting python list based on regular expression

Question

3 answers

solution1 4 2016-06-19 23:16:10

solution2 4 ACCPTED 2016-06-19 23:16:44

solution3 2 2016-06-19 23:20:42

solution1
4 2016-06-19 23:16:10

solution2
4 ACCPTED 2016-06-19 23:16:44

solution3
2 2016-06-19 23:20:42