[英]How to group in a list matching strings from elements from other string lists in python 3
I got 744 image files with names with the following scheme: 'mission_code_coord_date1_date2_01_T1/2_Bnumber.TIF'.我得到了 744 个图像文件,其名称采用以下方案:“mission_code_coord_date1_date2_01_T1/2_Bnumber.TIF”。 Like in this list, for example:就像在这个列表中一样,例如:
files = [
'LM02_L1TP_028046_19760327_20180424_01_T2_B6.TIF', #--¬
'LM02_L1TP_028047_19760327_20180424_01_T2_B6.TIF', #---note match except in the 'coord' part
'LT05_L1TP_026046_19951010_20170106_01_T1_B5.TIF',
'LT05_L1TP_026047_19951010_20170107_01_T1_B5.TIF',
'LC08_L1TP_026047_20150713_20170226_01_T1_B1.TIF']
#---------^-----^
# 9 15
The objective is to group the files in sublists for those whose 'mission_code' and 'date1_date2_01_T1/2_Bnumber.TIF' matches, then the output would be an array like this:目标是将那些“mission_code”和“date1_date2_01_T1/2_Bnumber.TIF”匹配的文件分组在子列表中,然后输出将是这样的数组:
ord_files=[
['LM02_L1TP_028046_19760327_20180424_01_T2_B6.TIF','LM02_L1TP_028047_19760327_20180424_01_T2_B6.TIF'],
['LT05_L1TP_026046_19951010_20170106_01_T1_B5.TIF','LT05_L1TP_026047_19951010_20170107_01_T1_B5.TIF'],
['LC08_L1TP_026047_20150713_20170226_01_T1_B1.TIF','']]
Some files have a pair, triplet or they are alone.有些文件有一对、三元组或者它们是单独的。 My idea was remove the string from the coord part in a new list, mo_files, so that could be easy to do a filter and then with a conditional create the otput list, ord_files.我的想法是从新列表 mo_files 中的coord部分中删除字符串,这样可以很容易地进行过滤,然后有条件地创建 otput 列表 ord_files。
On that mood so far I have tried things like:到目前为止,在这种心情下,我尝试了以下方法:
for k in range(len(files)):
mo_files[k][:] = files[k][9] + files[k][15]
Only im getting errors like IndexError: list index out of range
There is a simpler or better method?.只有我收到类似IndexError: list index out of range
错误IndexError: list index out of range
有更简单或更好的方法吗?。
Thanks.谢谢。
you can use:您可以使用:
d = {} # you can also use collections.defaultdict
for f in files:
d.setdefault(tuple(e for i, e in enumerate(f.split('_')) if i != 2), []).append(f)
list(d.values())
output:输出:
[['LM02_L1TP_028046_19760327_20180424_01_T2_B6.TIF',
'LM02_L1TP_028047_19760327_20180424_01_T2_B6.TIF'],
['LT05_L1TP_026046_19951010_20170106_01_T1_B5.TIF'],
['LT05_L1TP_026047_19951010_20170107_01_T1_B5.TIF'],
['LC08_L1TP_026047_20150713_20170226_01_T1_B1.TIF']]
or you can use:或者你可以使用:
from collections import defaultdict
d = defaultdict(list)
for f in files:
d[tuple(e for i, e in enumerate(f.split('_')) if i != 2)].append(f)
list(d.values())
this version is a bot faster这个版本是一个更快的机器人
If you're into pandas
:如果你喜欢pandas
:
import pandas as pd
df = pd.DataFrame(files, columns=["filename"])
# indeed define a "key" that is the whole string without 'coord' part
df["key"] = df.filename.apply(lambda s: s[:9]+s[16:])
Now all you have to do is groupby
and aggregate using list
:现在您所要做的就是使用list
groupby
和聚合:
>>> df.groupby("key").filename.apply(list).values
array([list(['LC08_L1TP_026047_20150713_20170226_01_T1_B1.TIF']),
list(['LM02_L1TP_028046_19760327_20180424_01_T2_B6.TIF', 'LM02_L1TP_028047_19760327_20180424_01_T2_B6.TIF']),
list(['LT05_L1TP_026046_19951010_20170106_01_T1_B5.TIF']),
list(['LT05_L1TP_026047_19951010_20170107_01_T1_B5.TIF'])],
dtype=object)
By the way, if you're not sure whether indices could change within the 700+ files, then a more stable solution is to make things using _
-splitting:顺便说一句,如果您不确定 700 多个文件中的索引是否会发生变化,那么更稳定的解决方案是使用_
-splitting 进行处理:
df["key"] = df.filename.apply(
lambda filename: "_".join([part for idx, part in enumerate(filename.split("_")) if idx != 2])
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.