如何在python 3中的其他字符串列表中的元素中对匹配字符串的列表进行分组

Question

I got 744 image files with names with the following scheme: 'mission_code_coord_date1_date2_01_T1/2_Bnumber.TIF'.我得到了 744 个图像文件，其名称采用以下方案：“mission_code_coord_date1_date2_01_T1/2_Bnumber.TIF”。 Like in this list, for example:就像在这个列表中一样，例如：

files = [
'LM02_L1TP_028046_19760327_20180424_01_T2_B6.TIF', #--¬
'LM02_L1TP_028047_19760327_20180424_01_T2_B6.TIF', #---note match except in the 'coord' part
'LT05_L1TP_026046_19951010_20170106_01_T1_B5.TIF',
'LT05_L1TP_026047_19951010_20170107_01_T1_B5.TIF',
'LC08_L1TP_026047_20150713_20170226_01_T1_B1.TIF']
#---------^-----^
#         9    15

The objective is to group the files in sublists for those whose 'mission_code' and 'date1_date2_01_T1/2_Bnumber.TIF' matches, then the output would be an array like this:目标是将那些“mission_code”和“date1_date2_01_T1/2_Bnumber.TIF”匹配的文件分组在子列表中，然后输出将是这样的数组：

ord_files=[
    ['LM02_L1TP_028046_19760327_20180424_01_T2_B6.TIF','LM02_L1TP_028047_19760327_20180424_01_T2_B6.TIF'],
    ['LT05_L1TP_026046_19951010_20170106_01_T1_B5.TIF','LT05_L1TP_026047_19951010_20170107_01_T1_B5.TIF'],
    ['LC08_L1TP_026047_20150713_20170226_01_T1_B1.TIF','']]

Some files have a pair, triplet or they are alone.有些文件有一对、三元组或者它们是单独的。 My idea was remove the string from the coord part in a new list, mo_files, so that could be easy to do a filter and then with a conditional create the otput list, ord_files.我的想法是从新列表 mo_files 中的coord部分中删除字符串，这样可以很容易地进行过滤，然后有条件地创建 otput 列表 ord_files。

On that mood so far I have tried things like:到目前为止，在这种心情下，我尝试了以下方法：

for k in range(len(files)):
    mo_files[k][:] = files[k][9] + files[k][15]

Only im getting errors like IndexError: list index out of range There is a simpler or better method?.只有我收到类似IndexError: list index out of range错误IndexError: list index out of range有更简单或更好的方法吗？。

Thanks.谢谢。

Answer 1

you can use:您可以使用：

d = {} # you can also use collections.defaultdict

for f in files:
    d.setdefault(tuple(e for i, e in enumerate(f.split('_')) if i != 2), []).append(f)
list(d.values())

output:输出：

[['LM02_L1TP_028046_19760327_20180424_01_T2_B6.TIF',
  'LM02_L1TP_028047_19760327_20180424_01_T2_B6.TIF'],
 ['LT05_L1TP_026046_19951010_20170106_01_T1_B5.TIF'],
 ['LT05_L1TP_026047_19951010_20170107_01_T1_B5.TIF'],
 ['LC08_L1TP_026047_20150713_20170226_01_T1_B1.TIF']]

or you can use:或者你可以使用：

from collections import defaultdict

d = defaultdict(list) 
for f in files:
    d[tuple(e for i, e in enumerate(f.split('_')) if i != 2)].append(f)

list(d.values())

this version is a bot faster这个版本是一个更快的机器人

Answer 2

If you're into pandas :如果你喜欢pandas ：

import pandas as pd
df = pd.DataFrame(files, columns=["filename"])                                                                                                                                 

# indeed define a "key" that is the whole string without 'coord' part
df["key"] = df.filename.apply(lambda s: s[:9]+s[16:])

Now all you have to do is groupby and aggregate using list :现在您所要做的就是使用list groupby和聚合：

>>> df.groupby("key").filename.apply(list).values                                                                                                                                  
array([list(['LC08_L1TP_026047_20150713_20170226_01_T1_B1.TIF']),
       list(['LM02_L1TP_028046_19760327_20180424_01_T2_B6.TIF', 'LM02_L1TP_028047_19760327_20180424_01_T2_B6.TIF']),
       list(['LT05_L1TP_026046_19951010_20170106_01_T1_B5.TIF']),
       list(['LT05_L1TP_026047_19951010_20170107_01_T1_B5.TIF'])],
      dtype=object)

By the way, if you're not sure whether indices could change within the 700+ files, then a more stable solution is to make things using _ -splitting:顺便说一句，如果您不确定 700 多个文件中的索引是否会发生变化，那么更稳定的解决方案是使用_ -splitting 进行处理：

df["key"] = df.filename.apply(
    lambda filename: "_".join([part for idx, part in enumerate(filename.split("_")) if idx != 2])
)

如何在python 3中的其他字符串列表中的元素中对匹配字符串的列表进行分组

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-03-31 12:51:53

解决方案2
1 2020-03-31 13:08:05

如何在python 3中的其他字符串列表中的元素中对匹配字符串的列表进行分组

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-03-31 12:51:53

解决方案2 1 2020-03-31 13:08:05

解决方案1
1 已采纳 2020-03-31 12:51:53

解决方案2
1 2020-03-31 13:08:05