简体   繁体   English

Python - 从大型 .csv 文件中的文本文件中搜索字符串列表

[英]Python - Search a list of strings from a text file in a large .csv file

Apologies, as I am probably making a whole host of errors here, but I am trying to search with a list of strings from a file (justgenes.txt) against a large CSV file and return the lines featuring the strings from the justgenes list.抱歉,我可能在这里犯了一大堆错误,但我正在尝试使用文件 (justgenes.txt) 中的字符串列表搜索大型 CSV 文件,并返回包含 justgenes 列表中的字符串的行。

I've been working largely with BASH, but the code I have takes more than 100GB of memory and crashes:我一直在很大程度上使用 BASH,但是我的代码占用了超过 100GB 的内存并且崩溃了:

grep -f justgenes.txt allDandHunique.csv > HPCgenesandbugs.csv

Therefore, I am attempting to do it in python, assuming that it will be more efficient, but I have very little knowledge of it.因此,我试图用 python 来做,假设它会更有效率,但我对它知之甚少。

I use this code (which I've grabbed from the web), but getting an empty file at the end:我使用这个代码(我从网上抓取的),但最后得到一个空文件:

data = open('allDandHunique.csv')
                
with open('justgenes.txt', "r+") as file1:
    fileline1= file1.readlines()
    for x in data: # <--- Loop through the list to check      
        for line in fileline1: # <--- Loop through each line
            if x in line:
                 print(x)

The justgenes file looks like this: justgenes 文件如下所示:

1A0N_B
1A1A_A
1A4I_A
1A5Y_A
1ACO_A
1AGN_A
1AGS_A
1AJE_A
1AJJ_A
1AP0_A
1APQ_A

whilst the csv looks like this:虽然 csv 看起来像这样:

"0403181A:PDB=1BP2,2BPP",
"0403181A:PDB=1BP2,2BPP",,,
"0706243A:PDB=1HOE,2AIT,3AIT,4AIT",
"0706243A:PDB=1HOE,2AIT,3AIT,4AIT",,,
"1309311A:PDB=1EMD,2CMD",
"1309311A:PDB=1EMD,2CMD",,,
"1513188A:PDB=1BBC,1POD",
"1513188A:PDB=1BBC,1POD",,,
0308206A,
0308206A,,,
0308221A,
0308221A,,,
0308230A,
0308230A,,,

Any help would be gratefully received.任何帮助将不胜感激。

I would use pandas to accomplish this.我会用熊猫来完成这个。

Try something like:尝试类似:

import pandas as pd

df = pd.read_csv('allDandHunique.csv')

with open('justgenes.txt', "r+") as file1:
    fileline1= file1.readlines()
    for x in fileline1: 
      for col in df:
         if col.str.contains(x, regex=False):
             ##do something here##

If when reading the file in you are getting a blank file, I would check and make sure the path is correct.如果在读取文件时得到一个空白文件,我会检查并确保路径正确。

Since I don't have the files, I couldn't test it myself but I assume this code might help.由于我没有这些文件,因此我无法自己测试,但我认为此代码可能会有所帮助。

data = open('allDandHunique.csv')
        
for x in data: # <--- Loop through the list to check      
    with open('justgenes.txt', "r+") as file1:
        fileline1= file1.readlines()
        for line in fileline1: # <--- Loop through each line
            if x in line:
                    print(x)

For each x in data you have to loop over all lines in file1.对于x in data每个x in data您必须遍历 file1 中的所有行。 If I'm not wrong, you need to open your file for each iteration, otherwise, when you reach EOF it returns nothing then.如果我没有错,你需要为每次迭代打开你的文件,否则,当你到达 EOF 时它什么都不返回。

import csv

with open('justGenes') as infile:
    searchTargets = set(line.strip() for line in infile)


with open('allDandHunique.csv') as infile:
    for row in csv.reader(infile):
        if any(target in row for target in searchTargets):
            print(row)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM