简体   繁体   中英

comparing two large csv file and writing another one with python

I am trying to compare two large csv files containing chemicals data.

the first one, "file1" is 14 Mb (which is not so heavy), but the second one, "file2" is 3Go (47798771 lines).

here a sample of file 1 (we'll focus on the fourth column, which contains inchikeys) :

MFCD00134034    7440-42-8   B   UORVGPXVDQYIDP-UHFFFAOYSA-N
MFCD01745487    64719-89-7  B1BBBB(BBBBB1[Li])[Li]  XZXJQLAKEUKXOT-UHFFFAOYSA-N
MFCD01310566    19287-45-7  BB  QSJRRLWJRLPVID-UHFFFAOYSA-N
MFCD00011323    10035-10-6  Br  CPELXLSAUQHCOX-UHFFFAOYSA-N
        N(CCNCCCCCCCCCCNCCN(CC)CC)(CC)CC    PISAWRHWZGEVPP-UHFFFAOYSA-N
MFCD01744969    137638-86-9 O(C(=O)C(c1ccccc1)c1ccccc1)c1cc2c(C[C@H]3N(CC[C@]2(C)C3C)Cc2ccccc2)cc1  CIRJJEXKLBHURV-MAYWEXTGSA-N
        O(CCCN1CCCC1)c1ccc(NC(=Nc2ccccc2)c2ccccc2)cc1   KETUBKLQEXFJBX-UHFFFAOYSA-N
MFCD01694581    3810-31-9   S(CCN(CCSC(N)=N)CCSC(N)=N)C(N)=N    GGDUORJVTMUGNU-UHFFFAOYSA-N
MFCD06794992    60066-94-6  Brc1cc(C(=O)c2ncccc2)c(NC(=O)CNC(=O)[C@@H](N)CCCCN)cc1  NVOGGKXDMDDFEG-HNNXBMFYSA-N
MFCD06794980    60066-98-0  Brc1cc(C(=O)c2ncccc2)c(NC(=O)CNC(=O)[C@@H](N)CCCNC(N)=N)cc1 LFCYDGUHINTBOJ-AWEZNQCLSA-N

file 2 :

lat_chemical_id stereo_chemical_id  source_cid  inchikey
CID100000001    CID000000001    1   RDHQFKQIGNGIED-UHFFFAOYSA-N
CID100000010    CID000000010    10  AUFGTPPARQZWDO-UHFFFAOYSA-N
CID100000100    CID000000100    100 UTIBHEBNILDQKX-UHFFFAOYSA-N
CID100001000    CID000001000    1000    ULSIYEODSMZIPX-UHFFFAOYSA-N
CID100010000    CID000010000    10000   ZPIFKCVYZBVZIV-UHFFFAOYSA-N
CID100100000    CID000100000    100000  SPTBIJLJJBZGDY-UHFFFAOYSA-N
CID101000000    CID001000000    1000000 XTNVYACQOFUTNH-UHFFFAOYSA-N
CID110000000    CID010000000    10000000    WUGPGGSZFRVGGA-UHFFFAOYSA-N
CID110000001    CID010000001    10000001    ANOUMYXLUIDQNL-UHFFFAOYSA-N

My goal is to compare the inchikeys, fourth row in boths files,to see if they are the same. Then when its the case, extract all info (from both files)and write them i third one.

here's my (naive) code :

#!/usr/bin/env python
#-*- coding: utf-8 -*-
######################
import numpy as np
import argparse 
import csv 
#################################


def compare(tab_data_inchik,stitch,output):
    dt = open(tab_data_inchik, 'rb')
    st = open(stitch,'rb')
    out = open(output,'wb')
    data = csv.reader(dt, delimiter = '\t')
    database = csv.reader(st, delimiter = '\t')
    result = csv.writer(out, delimiter = '\t')
    for line in data:
        for row in database:
            if line[3] == row[3]:
                result.writerow((line[0],line[1],line[2],row[0],row[1],row[2],row[3]))        

    dt.close()
    st.close()
    out.close()

##############################""
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("tsv1", help = "Pr Data")
    parser.add_argument("tsv2", help = "Database")
    parser.add_argument("output", help = "output file")
    args=parser.parse_args()

    compare(args.tsv1,args.tsv2,args.output)

It seems that, the program does not even reach the second line of the database loop, I guess it's because the file is to large and my method is bot optimized. Maybe I should use numpy.where() but I don't see how.

Is there a way to get the information without the double loop ? Thanks in advance.

Where is the problem:
In your code you are looping over milions of lines, in 3GB document contains more than 44000000 lines assumed mean character per line is 68 characters, with that assumption in the 14MB doc there is more than 205000 of rows.
Then the line 20th line would be executed 44000000 * 205000 = 9.02*10^12 times.

if line[3] == row[3]:

A common computer on single CPU only can run under 10^10 low level instructions per second, and a line of python code usually takes a lot more than single instructions for executing. since it take a huge amount time for a CPU to complete it

Python dict data structure( Hash table):
A set is a data structure that efficiently checks whether a single piece of data is stored in it before or not in an constant amount of small CPU instructions( it is very time efficient).

If you use something like this, It would take less than 5 minutes to complete on a common Intel Core i5 or something similar.

database_set = dict()
for row in database: #Loop on the smaller file so we store less in memory.
    database_set[row[3]] = (row[0],row[1],row[2])
for line in data:
    if line[3] in database_set:     
        row = database_set[line[3]]

        result.writerow((line[0],line[1],line[2],row[0],row[1],row[2],line[3]))

If you want to about how to use python sets look up here
If you want to know how does set do its job, you can find out here

The issue is that when you loop through all of the rows of database for the first time, the file pointer inside of database ( st ) is at the end of the files so you can't iterate again without first explicitly moving it back to the beginning of the file. This can be done using seek .

for line in data:
    st.seek(0)    # Resets the file back to the beginning
    for row in database:
        if line[3] == row[3]:
            # Write output data

A Better Solution

Depending on the size of database , this may not be very quick due to the fact that you are reading the entire file for every line in data . You may consider loading database once into memory to do your comparisons.

# Load entire database in
database_rows = [row for row in database]

for line in data:
    for row in database_rows:
        if line[3] == row[3]:
            # Write output data

An Even Better Solution

The better option (since data is much smaller than database ) would be to load data into memory and read database directly from the file. To do this, you would reverse the ordering of your loops.

data_rows = [row for row in data]

for row in database:
    for line in data_rows:
        if line[3] == row[3]:
            # Write output data

This solution wouldn't require you to load database into memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM