[英]Filter the txt file satisfies some condition in python?
I have a txt file contained the subjectid_num_[dog/cat]_[option] . 我有一个包含subjectid_num_ [dog / cat] _ [option]的txt文件。
ID1_0123_CAT_ANIMAL_3
ID1_0123_CAT_ANIMAL_GOOD_3
ID1_0123_ABC_3
ID2_1234_CAT_ANIMAL_3
ID2_1234_CAT_ANIMAL_GOOD_3
ID2_1234_DOG_ANIMAL_2
ID2_1234_DOG_ANIMAL_GOOD_0
ID2_1234_ABCD_3
ID3_4321_DOG_ANIMAL_1
ID3_4321_DOG_ANIMAL_GOOD_4
ID3_4321_DOG_3
I want to filter the file to get the output that satisfies the condition. 我想过滤文件以获得满足条件的输出。 For example, the below code will file the output has
CAT
and GOOD
in the name, and does not contains the DOG
and GOOD
in the name. 例如,下面的代码将归档输出名称中包含
CAT
和GOOD
且名称中不包含DOG
和GOOD
的文件。 The name is determined by same subject_id
and same number num
. 名称由相同的
subject_id
和相同的数字num
。 However, the code does not show my expected output. 但是,该代码未显示我的预期输出。 How should I fix it?
我应该如何解决?
This is my code 这是我的代码
with open("./cat_dog.txt", 'r') as f:
files_list = [line.rstrip('\n') for line in f]
file_filter = []
for i, cat in enumerate(files_list):
if 'GOOD' in cat and 'CAT' in cat:
subject_id = cat.split('_')[0]
num_id = cat.split('_')[1]
subject_num = subject_id + '_' + num_id
for j, dog in enumerate(files_list):
if subject_num in dog and 'GOOD' in dog:
if 'GOOD' in dog and 'DOG' in dog:
continue;
else:
file_filter.append(cat)
The current output is 当前输出是
ID1_0123_CAT_ANIMAL_GOOD_3
ID2_1234_CAT_ANIMAL_GOOD_3
While the expected is 虽然预期是
ID1_0123_CAT_ANIMAL_GOOD_3
Your code is wrong. 您的代码是错误的。 Consider what happens when you check line
ID2_1234_CAT_ANIMAL_GOOD_3
in the inner loop: 考虑一下在内部循环中检查
ID2_1234_CAT_ANIMAL_GOOD_3
会发生什么:
subject_id = cat.split('_')[0] #ID2
num_id = cat.split('_')[1] # 1234
subject_num = subject_id + '_' + num_id #ID2_1234
for j, dog in enumerate(files_list):
# when dog is the line ID2_1234_CAT_ANIMAL_GOOD_3
if subject_num in dog and 'GOOD' in dog: # this is true
if 'GOOD' in dog and 'DOG' in dog: # this is false
continue;
else:
file_filter.append(cat) # then it outputs it
The problem is that every line with GOOD
and CAT
in it will "match itself" in the inner loop. 问题在于,每条带有
GOOD
和CAT
行都会在内部循环中“匹配”。
IMHO I'd use itertools.groupby
. 恕我直言,我会使用
itertools.groupby
。 Something along the lines of: 类似于以下内容:
from itertools import groupby
def key(line):
return line.split('_')[:2]
for key, lines in groupby(sorted(files_list, key=key), key=key):
good_lines = [line for line in lines if 'GOOD' in line]
if len(good_lines) == 1 and 'CAT' in good_lines[0]:
file_filter.append(good_lines[0])
This should also be more efficient O(nlog n) vs O(n^2), although it needs all the contents of the file in RAM. 尽管O(nlog n)比O(n ^ 2)还需要RAM中文件的所有内容,但这也应该更有效。
If you have other "classes" other than CAT
and DOG
and you want to output all GOOD CAT
lines except if the subject_id
is also a GOOD
DOG
you can modify the code above in this way: 如果您有除
CAT
和DOG
之外的其他“类”,并且要输出所有GOOD CAT
行,除非subject_id
也是GOOD
DOG
,则可以按以下方式修改上面的代码:
is_good_cat = any('CAT' in line for line in good_lines)
is_good_dog = any('DOG' in line for line in good_lines)
if is_good_cat and not is_good_dog:
file_filter.extend(line for line in good_lines if 'CAT' in good_lines)
(You need to use .extend
and the loop because we no longer know which is the line to write, so you have to filter them. (您需要使用
.extend
和循环,因为我们不再知道要写的行,因此您必须对其进行过滤。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.