简体   繁体   English

Python并行比较2个CSV文件

[英]python parallel compare 2 csv files

I try to compare 2 csv files, which contain 100000 row and 10 column in each file. 我尝试比较2个csv文件,每个文件包含100000行和10列。 I run this code it work, but it use only one thread of CPU while I have 8 cores. 我运行此代码可以正常工作,但是当我有8个内核时,它仅使用一个CPU线程。 I want this code use all cpu thread. 我希望这段代码使用所有cpu线程。 I have search and I found the idea of parallel. 经过搜索,我发现了并行的想法。 But when I try apply parallel to for loop in this python code, it is not work. 但是当我尝试在此python代码中将并行应用于for循环时,它不起作用。 How to apply parallel this code? 如何并行应用此代码? thank you in advance for your help! 预先感谢您的帮助!

import csv  
#read csv files
f1= file('host.csv','r')
f2= file('master.csv','r')
f3= file('results.csv','w') 

c1=csv.reader(f1) 
c2=csv.reader(f2)
next(c2, None)
c3=csv.writer(f3)
#for loop compare row in host csv file 
master_list = list(c2) 
for row in c1: 
    row=1
    found = False
    colA = str(row[0])  #protocol
    colB = str(row[11])  
    colC = str(row[12])  
    colD = str(row[13]) 
    colE = str(row[14])  
    #loop in each row of master csv file
    for master_row in master_list:
        results_row=row
        colBf2 = str(master_row[4])  
        colCf2 = str(master_row[5])  
        colDf2 = str(master_row[6])  
        colEf2 = str(master_row[7])  
        colFf2 = str(master_row[3])
        #check condition
        if colA == 'icmp':
           #sub condiontion
           if colB == colBf2 and colD == colDf2:
              results_row.append(colFf2)
              found = True
              break
           row = row + 1
        else:
           if colB == colBf2 and colD == colDf2 and colE == colEf2:
              results_row.append(colFf2)
              found = True
              break
           row =row+1
   if not found:
      results_row.append('Not Match')
   c3.writerow(results_row)
f1.close()
f2.close()
f3.close()

The expensive task is the inner loop that rescans the master table for each host row. 昂贵的任务是为每个主机行重新扫描主表的内部循环。 Since python does cooperative multithreading (you can search "python GIL") only one thread runs at a time and so multiple threads don't speed up a cpu-bound operation. 由于python执行协作多线程(您可以搜索“ python GIL”),一次只能运行一个线程,因此多个线程不会加快cpu绑定操作的速度。 You could spawn subprocesses, but then you have to weigh the cost of getting the data to the worker processes against the speed gain. 您可以生成子流程,但随后必须权衡将数据传输到工作流程的成本与速度的提高。

Or, optimize your code. 或者,优化您的代码。 Instead of running in parallel, index the master instead. 与其并行运行,不如索引主节点。 You can exchange an expensive scan of 100000 records for a quick dictionary lookup. 您可以交换昂贵的100000条记录扫描来快速查找字典。

I took the liberty of adding with clauses to your code to save a few lines and also skipped breaking out colA and etc... (using named indexes instead) to keep the code small. colA在代码中添加with子句以节省几行,并且跳过了colA等...(改为使用命名索引)来保持代码较小。

import csv

# columns of interest
A, B, C, D, E, F = 0, 11, 12, 13, 14, 3

# read and index column F in master by (B,D) and (B,D,E), discarding
# duplicates for those keys
col_index = {}
with open('master.csv') as master:
    next(master)
    for row in csv.reader(master):
        key = row[B], row[D]
        if key not in col_index:
            col_index[key] = row[F]
        key = row[B], row[D], row[E]
        if key not in col_index:
            col_index[key] = row[F]

#read csv files
with open('host.csv') as f1, open('results.csv','w') as f3: 
    c1=csv.reader(f1)
    c3=csv.writer(f3) 
    for row in c1:
        if row[A] == "icmp":
            indexer = (row[B], row[D])
        else:
            indexer = (row[B], row[D], row[E])
        row.append(col_index.get(indexer, 'Not Match'))
        c3.writerow(row)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM