简体   繁体   中英

python parallel compare 2 csv files

I try to compare 2 csv files, which contain 100000 row and 10 column in each file. I run this code it work, but it use only one thread of CPU while I have 8 cores. I want this code use all cpu thread. I have search and I found the idea of parallel. But when I try apply parallel to for loop in this python code, it is not work. How to apply parallel this code? thank you in advance for your help!

import csv  
#read csv files
f1= file('host.csv','r')
f2= file('master.csv','r')
f3= file('results.csv','w') 

c1=csv.reader(f1) 
c2=csv.reader(f2)
next(c2, None)
c3=csv.writer(f3)
#for loop compare row in host csv file 
master_list = list(c2) 
for row in c1: 
    row=1
    found = False
    colA = str(row[0])  #protocol
    colB = str(row[11])  
    colC = str(row[12])  
    colD = str(row[13]) 
    colE = str(row[14])  
    #loop in each row of master csv file
    for master_row in master_list:
        results_row=row
        colBf2 = str(master_row[4])  
        colCf2 = str(master_row[5])  
        colDf2 = str(master_row[6])  
        colEf2 = str(master_row[7])  
        colFf2 = str(master_row[3])
        #check condition
        if colA == 'icmp':
           #sub condiontion
           if colB == colBf2 and colD == colDf2:
              results_row.append(colFf2)
              found = True
              break
           row = row + 1
        else:
           if colB == colBf2 and colD == colDf2 and colE == colEf2:
              results_row.append(colFf2)
              found = True
              break
           row =row+1
   if not found:
      results_row.append('Not Match')
   c3.writerow(results_row)
f1.close()
f2.close()
f3.close()

The expensive task is the inner loop that rescans the master table for each host row. Since python does cooperative multithreading (you can search "python GIL") only one thread runs at a time and so multiple threads don't speed up a cpu-bound operation. You could spawn subprocesses, but then you have to weigh the cost of getting the data to the worker processes against the speed gain.

Or, optimize your code. Instead of running in parallel, index the master instead. You can exchange an expensive scan of 100000 records for a quick dictionary lookup.

I took the liberty of adding with clauses to your code to save a few lines and also skipped breaking out colA and etc... (using named indexes instead) to keep the code small.

import csv

# columns of interest
A, B, C, D, E, F = 0, 11, 12, 13, 14, 3

# read and index column F in master by (B,D) and (B,D,E), discarding
# duplicates for those keys
col_index = {}
with open('master.csv') as master:
    next(master)
    for row in csv.reader(master):
        key = row[B], row[D]
        if key not in col_index:
            col_index[key] = row[F]
        key = row[B], row[D], row[E]
        if key not in col_index:
            col_index[key] = row[F]

#read csv files
with open('host.csv') as f1, open('results.csv','w') as f3: 
    c1=csv.reader(f1)
    c3=csv.writer(f3) 
    for row in c1:
        if row[A] == "icmp":
            indexer = (row[B], row[D])
        else:
            indexer = (row[B], row[D], row[E])
        row.append(col_index.get(indexer, 'Not Match'))
        c3.writerow(row)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM