简体   繁体   中英

How to group similarly named elements in a list into tuples in python?

I have read the names of all of the files in a directory in a python list like this:

files = ['ch1.txt', 'ch2.txt', 'ch3_1.txt', 'ch4_2.txt', 'ch3_2.txt', 'ch4_1.txt'] 

What I want to do is group similar files as tuples in the list. The above example should look like

files_grouped = ['ch1.txt', 'ch2.txt', ('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')]

One way I have tried is to separate the elements I need to group from the list like so

groups = tuple([file for file in files if '_' in file])
single = [file for file in files if not '_' in file]

And I would create a new list appending the both. But how do I create the groups as list of tuple for ch3 and ch4 like [('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')] instead of one big tuple?

None of the answers give you a generic solution that works for any kind of file names. I think you should be using regex, if you want to account for that.

import itertools
import re

sorted_files = sorted(files, key=lambda x: re.findall('(\d+)_(\d+)', x))    
out = [list(g) for _, g in itertools.groupby(sorted_files, 
                       key=lambda x: re.search('\d+', x).group() )]

print(out)
[['ch1.txt'],
 ['ch2.txt'],
 ['ch3_1.txt', 'ch3_2.txt'],
 ['ch4_1.txt', 'ch4_2.txt']]

Note that this should work for any naming format, not just chX_X .

If you want your output in the exact format described, you could do a little extra post-processing:

out = [o[0] if len(o) == 1 else tuple(o) for o in out]
print(out)
['ch1.txt', 'ch2.txt', ('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')]

Regex Details

The first regex sorts by chapter section and subsection.

(       # first group 
\d+     # 1 or more digits
)
_       # literal underscore
(       # second group
\d+     # 1 or more digits
)

The second regex groups by chapter sections only - all chapters with the same section are grouped together.

You could use a dictionary (or, for simpler initialising a collections.defaultdict :

from collections import defaultdict
from pprint import pprint

files = ['ch1.txt', 'ch2.txt', 'ch3_1.txt', 'ch4_2.txt', 'ch3_2.txt', 'ch4_1.txt']

grouped = defaultdict(list)  # create an empty list for not existent entries

for f in files:
    key = f[:3]
    grouped[key].append(f)

pprint(grouped)

Result:

defaultdict(<class 'list'>,
            {'ch1': ['ch1.txt'],
             'ch2': ['ch2.txt'],
             'ch3': ['ch3_1.txt', 'ch3_2.txt'],
             'ch4': ['ch4_2.txt', 'ch4_1.txt']})

If you want your list of tuples, you can do:

grouped = [tuple(l) for l in grouped.values()]

Which is

[('ch1.txt',),
 ('ch2.txt',),
 ('ch3_1.txt', 'ch3_2.txt'),
 ('ch4_2.txt', 'ch4_1.txt')]

Maybe you can sort the list of file name, and then use groupby() to do this:

eg

from itertools import groupby

files = ['ch1.txt', 'ch2.txt', 'ch3_1.txt', 'ch4_2.txt', 'ch3_2.txt', 'ch4_1.txt']

print([tuple(g) for k,g in groupby(sorted(files),key=lambda x : x[:-4].split("_")[0])])

Result:

[('ch1.txt',), ('ch2.txt',), ('ch3_1.txt', 'ch3_2.txt'), ('ch4_1.txt', 'ch4_2.txt')]

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM