I have two files, the first file is a list of item with the items listed one per line. The second file is a tsv file with many items listed per line. So, some lines in the second file have items that might be listed in the first file. I need to generate a list of lines from the second file that might have items listed in the first file.
grep -f is being finicky for me so I decided to make my own python script. This is what I came up with:-
Big list is the second file, tiny list is the first file.
def main():
desired_subset = []
small_list = open('tiny_list.txt','r')
big_list = open('big_list.tsv','r')
for i in small_list.readlines():
i = i.rstrip('\n')
for big_line in big_list:
if i in big_line:
if i not in desired_subset:
desired_subset.append(big_line)
print(desired_subset)
print(len(desired_subset))
main()
The problem is that the for loop is only reading through the first line. Any suggestions?
When you iterate over file (here over big_list
) you "consume it, so that on the second iteration of small_list
you don't have anything left in big_list
. Try reading big_list
with .readlines()
into the list variable before the main for
loop and use that:
def main():
desired_subset = []
small_list = open('tiny_list.txt','r')
big_list = open('big_list.tsv','r').readlines() # note here
for i in small_list.readlines():
i = i.rstrip('\n')
for big_line in big_list:
if i in big_line:
if i not in desired_subset:
desired_subset.append(big_line)
print(desired_subset)
print(len(desired_subset))
Also, you don't close your files which is a bad practice. I'd suggest to use context manager (open files with with
statement):
def main():
desired_subset = []
with open('tiny_list.txt','r') as small_list,
open('big_list.tsv','r') as big_list:
small_file_lines = small_list.readlines()
big_file_lines = big_list.readlines()
for i in small_file_lines:
i = i.rstrip('\n')
for big_line in big_file_lines:
if i in big_line:
if i not in desired_subset:
desired_subset.append(big_line)
print(desired_subset)
print(len(desired_subset))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.