I am trying to compare two large csv files containing chemicals data.
the first one, "file1" is 14 Mb (which is not so heavy), but the second one, "file2" is 3Go (47798771 lines).
here a sample of file 1 (we'll focus on the fourth column, which contains inchikeys) :
MFCD00134034 7440-42-8 B UORVGPXVDQYIDP-UHFFFAOYSA-N
MFCD01745487 64719-89-7 B1BBBB(BBBBB1[Li])[Li] XZXJQLAKEUKXOT-UHFFFAOYSA-N
MFCD01310566 19287-45-7 BB QSJRRLWJRLPVID-UHFFFAOYSA-N
MFCD00011323 10035-10-6 Br CPELXLSAUQHCOX-UHFFFAOYSA-N
N(CCNCCCCCCCCCCNCCN(CC)CC)(CC)CC PISAWRHWZGEVPP-UHFFFAOYSA-N
MFCD01744969 137638-86-9 O(C(=O)C(c1ccccc1)c1ccccc1)c1cc2c(C[C@H]3N(CC[C@]2(C)C3C)Cc2ccccc2)cc1 CIRJJEXKLBHURV-MAYWEXTGSA-N
O(CCCN1CCCC1)c1ccc(NC(=Nc2ccccc2)c2ccccc2)cc1 KETUBKLQEXFJBX-UHFFFAOYSA-N
MFCD01694581 3810-31-9 S(CCN(CCSC(N)=N)CCSC(N)=N)C(N)=N GGDUORJVTMUGNU-UHFFFAOYSA-N
MFCD06794992 60066-94-6 Brc1cc(C(=O)c2ncccc2)c(NC(=O)CNC(=O)[C@@H](N)CCCCN)cc1 NVOGGKXDMDDFEG-HNNXBMFYSA-N
MFCD06794980 60066-98-0 Brc1cc(C(=O)c2ncccc2)c(NC(=O)CNC(=O)[C@@H](N)CCCNC(N)=N)cc1 LFCYDGUHINTBOJ-AWEZNQCLSA-N
file 2 :
lat_chemical_id stereo_chemical_id source_cid inchikey
CID100000001 CID000000001 1 RDHQFKQIGNGIED-UHFFFAOYSA-N
CID100000010 CID000000010 10 AUFGTPPARQZWDO-UHFFFAOYSA-N
CID100000100 CID000000100 100 UTIBHEBNILDQKX-UHFFFAOYSA-N
CID100001000 CID000001000 1000 ULSIYEODSMZIPX-UHFFFAOYSA-N
CID100010000 CID000010000 10000 ZPIFKCVYZBVZIV-UHFFFAOYSA-N
CID100100000 CID000100000 100000 SPTBIJLJJBZGDY-UHFFFAOYSA-N
CID101000000 CID001000000 1000000 XTNVYACQOFUTNH-UHFFFAOYSA-N
CID110000000 CID010000000 10000000 WUGPGGSZFRVGGA-UHFFFAOYSA-N
CID110000001 CID010000001 10000001 ANOUMYXLUIDQNL-UHFFFAOYSA-N
My goal is to compare the inchikeys, fourth row in boths files,to see if they are the same. Then when its the case, extract all info (from both files)and write them i third one.
here's my (naive) code :
#!/usr/bin/env python
#-*- coding: utf-8 -*-
######################
import numpy as np
import argparse
import csv
#################################
def compare(tab_data_inchik,stitch,output):
dt = open(tab_data_inchik, 'rb')
st = open(stitch,'rb')
out = open(output,'wb')
data = csv.reader(dt, delimiter = '\t')
database = csv.reader(st, delimiter = '\t')
result = csv.writer(out, delimiter = '\t')
for line in data:
for row in database:
if line[3] == row[3]:
result.writerow((line[0],line[1],line[2],row[0],row[1],row[2],row[3]))
dt.close()
st.close()
out.close()
##############################""
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("tsv1", help = "Pr Data")
parser.add_argument("tsv2", help = "Database")
parser.add_argument("output", help = "output file")
args=parser.parse_args()
compare(args.tsv1,args.tsv2,args.output)
It seems that, the program does not even reach the second line of the database loop, I guess it's because the file is to large and my method is bot optimized. Maybe I should use numpy.where() but I don't see how.
Is there a way to get the information without the double loop ? Thanks in advance.
Where is the problem:
In your code you are looping over milions of lines, in 3GB document contains more than 44000000 lines assumed mean character per line is 68 characters, with that assumption in the 14MB doc there is more than 205000 of rows.
Then the line 20th line would be executed 44000000 * 205000 = 9.02*10^12 times.
if line[3] == row[3]:
A common computer on single CPU only can run under 10^10 low level instructions per second, and a line of python code usually takes a lot more than single instructions for executing. since it take a huge amount time for a CPU to complete it
Python dict data structure( Hash table):
A set is a data structure that efficiently checks whether a single piece of data is stored in it before or not in an constant amount of small CPU instructions( it is very time efficient).
If you use something like this, It would take less than 5 minutes to complete on a common Intel Core i5 or something similar.
database_set = dict()
for row in database: #Loop on the smaller file so we store less in memory.
database_set[row[3]] = (row[0],row[1],row[2])
for line in data:
if line[3] in database_set:
row = database_set[line[3]]
result.writerow((line[0],line[1],line[2],row[0],row[1],row[2],line[3]))
If you want to about how to use python sets look up here
If you want to know how does set do its job, you can find out here
The issue is that when you loop through all of the rows of database
for the first time, the file pointer inside of database
( st
) is at the end of the files so you can't iterate again without first explicitly moving it back to the beginning of the file. This can be done using seek
.
for line in data:
st.seek(0) # Resets the file back to the beginning
for row in database:
if line[3] == row[3]:
# Write output data
Depending on the size of database
, this may not be very quick due to the fact that you are reading the entire file for every line in data
. You may consider loading database
once into memory to do your comparisons.
# Load entire database in
database_rows = [row for row in database]
for line in data:
for row in database_rows:
if line[3] == row[3]:
# Write output data
The better option (since data
is much smaller than database
) would be to load data
into memory and read database
directly from the file. To do this, you would reverse the ordering of your loops.
data_rows = [row for row in data]
for row in database:
for line in data_rows:
if line[3] == row[3]:
# Write output data
This solution wouldn't require you to load database
into memory.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.