简体   繁体   English

在PYTHON中的两个CSV文件中查找公共区域

[英]Find Common Region in two CSV File in PYTHON

I have two CSV files with 10 columns each where the first column is called the "Primary Key". 我有两个CSV文件,每个文件有10列,其中第一列称为“主键”。

I need to use Python to find the common region between the two CSV files. 我需要使用Python查找两个CSV文件之间的公共区域。 For example, I should be able to detect that rows 27-45 in CSV1 is equal to rows 125-145 in CSV2 and so on. 例如,我应该能够检测到CSV1中的第27-45行等于CSV2中的第125-145行,依此类推。

I am only comparing the Primary Key (Column One). 我只比较主键(第一列)。 The rest of the data is not considered for comparison. 其余数据不考虑进行比较。 I need to extract these common regions in two separate CSV files (one for CSV1 and one for CSV2). 我需要在两个单独的CSV文件中提取这些公共区域(一个用于CSV1,一个用于CSV2)。

I have already parsed and stored the rows of the two CSV files in two 'list of lists', lstCAN_LOG_TABLE and lstSHADOW_LOG_TABLE , so the problem reduces down to comparing these two list of lists. 我已经解析了两个CSV文件的行并将其存储在两个“列表列表” lstCAN_LOG_TABLElstSHADOW_LOG_TABLE ,因此该问题可以简化为比较这两个列表列表。

I am currently assuming is that if there are 10 subsequent matches ( MAX_COMMON_THRESHOLD ), I have reached the beginning of a common region. 我目前假设的是,如果以后有10个匹配项( MAX_COMMON_THRESHOLD ),那么我已经到达一个公共区域的开始。 I must not log single rows (comparing to true) because there would be regions equal (As per primary key) and those regions I need to identify. 我不能记录单行(与true相比),因为会有相等的区域(按主键)和需要标识的区域。

for index in range(len(lstCAN_LOG_TABLE)):
    for l_index in range(len(lstSHADOW_LOG_TABLE)):
        if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]):  #Consider for comparison only CAN IDs
            index_can_log = index                                           #Position where CAN Log is to be compared
            index_shadow_log = l_index                                      #Position from where CAN Shadow Log is to be considered
            start = index_shadow_log
            if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
                end = index_shadow_log + MAX_COMMON_THRESHOLD
            else:
                end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))
            can_index = index
            bPreScreened = 1
            for num in range(start,end):
                if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):
                    if((can_index + 1) < (input_file_one_row_count-1)):
                        can_index = can_index + 1                           
                    else:
                        break   
                else:
                    bPreScreened = 0
                    print("No Match")
                    break
            #we might have found start of common region         
            if(bPreScreened == 1):      
                print("Start={0} End={1} can_index={2}".format(start,end,can_index))
                for number in range(start,end):
                    if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):                           
                        writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
                        writer_one.writerow(lstCAN_LOG_TABLE[index][0])
                        if((index + 1) < (input_file_one_row_count-1)):
                            index = index + 1                           
                        else:
                            dump_file.close()   
                            print("\nCommon Region in Two CSVs identifed and recorded\n")                           
                            return
dump_file.close()   
print("\nCommon Region in Two CSVs identifed and recorded\n")

I am getting strange output. 我得到奇怪的输出。 Even the first CSV file has only 1880 Rows but in the common region CSV for the first CSV I am getting many more entries. 即使第一个CSV文件也只有1880行,但是在第一个CSV的公共区域CSV中,我得到了更多的条目。 I am not getting desired output. 我没有得到想要的输出。

EDITED FROM HERE 从这里编辑

CSV1: CSV1:

216 0.000238225 F4  41  C0  FB  28  0   0   0   MS CAN
109 0.0002256   15  8B  31  0   8   43  58  0   HS CAN
216 0.000238025 FB  47  C6  1   28  0   0   0   MS CAN
340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

CSV2: CSV2:

216 0.0002361   0F  5C  DB  14  28  0   0   0   MS CAN
216 0.000236225 16  63  E2  1B  28  0   0   0   MS CAN
109 0.0001412   16  A3  31  0   8   63  58  0   HS CAN
216 0.000234075 1C  6A  E9  22  28  0   0   0   MS CAN
40A 0.000259925 C1  1   46  54  30  44  47  36  HS CAN
4A  0.000565975 2   0   0   0   0   0   0   C0  MS CAN
340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

EXPECTED OUTPUT CSV1: 预期输出CSV1:

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

EXPECTED OUTPUT CSV2: 预期输出CSV2:

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

OBSERVED OUTPUT CSV1 观察到的输出CSV1

340 0.000240175 0A  18  0   C2  0   0   6F  FF  MS CAN
216 0.000240225 24  70  EF  28  28  0   0   0   MS CAN
216 0.000236225 2B  77  F7  2F  28  0   0   0   MS CAN
216 0.0002278   31  7D  FD  35  28  0   0   0   MS CAN

And many thousands of redundant row data 以及数千个冗余行数据

EDITED - SOLVED AS PER ADVICE (CHANGED FOR TO WHILE): 已编辑-已按建议解决(更改为白色):

LEARNING: In Python FOR Loop Index cannot be changed at RunTime 学习: 在Python中,无法在运行时更改FOR循环索引

dump_file=open("MATCH_PATTERN.txt",'w+')
print("Number of Entries CAN LOG={0}".format(len(lstCAN_LOG_TABLE)))
print("Number of Entries SHADOW LOG={0}".format(len(lstSHADOW_LOG_TABLE)))  
index = 0   
while(index < (input_file_one_row_count - 1)):
    l_index = 0
    while(l_index < (input_file_two_row_count - 1)):
        if(lstSHADOW_LOG_TABLE[l_index][1] == lstCAN_LOG_TABLE[index][1]):  #Consider for comparison only CAN IDs
            index_can_log = index                                           #Position where CAN Log is to be compared
            index_shadow_log = l_index                                      #Position from where CAN Shadow Log is to be considered
            start = index_shadow_log
            can_index = index
            if((index_shadow_log + MAX_COMMON_THRESHOLD) <= (input_file_two_row_count-1)):
                end = index_shadow_log + MAX_COMMON_THRESHOLD
            else:
                end = (index_shadow_log) + ((input_file_two_row_count-1) - (index_shadow_log))              
            bPreScreened = 1
            for num in range(start,end):
                if(lstSHADOW_LOG_TABLE[num][1] == lstCAN_LOG_TABLE[can_index][1]):                      
                    if((can_index + 1) < (input_file_one_row_count-1)):
                        can_index = can_index + 1                           
                    else:
                        break   
                else:
                    bPreScreened = 0
                    break
            #we might have found start of common region         
            if(bPreScreened == 1):      
                print("Shadow Start={0} Shadow End={1} CAN INDEX={2}".format(start,end,index))
                for number in range(start,end):
                    if(lstSHADOW_LOG_TABLE[number][1] == lstCAN_LOG_TABLE[index][1]):                           
                        writer_two.writerow(lstSHADOW_LOG_TABLE[number][0])
                        writer_one.writerow(lstCAN_LOG_TABLE[index][0])
                        if((index + 1) < (input_file_one_row_count-1)):
                            index = index + 1
                        if((l_index + 1) < (input_file_two_row_count-1)):
                            l_index = l_index + 1                               
                        else:
                            dump_file.close()   
                            print("\nCommon Region in Two CSVs identifed and recorded\n")                           
                            return
            else:
                l_index = l_index + 1
        else:
            l_index = l_index + 1
    index = index + 1   
dump_file.close()   
print("\nCommon Region in Two CSVs identifed and recorded\n")

index is the iterator in your for loop. indexfor循环中的迭代器。 Should you changed it inside the loop, it will be reassigned after each iteration. 如果您在循环内更改了它,它将在每次迭代后重新分配。

Say, when index = 5 in your for loop, and index += 1 was execute 3 times. 假设在for循环中index = 5index += 1被执行3次。 Now index = 8 . 现在index = 8 But after this iteration ends, when your code go back to for , index will be assigned to index x = 6 . 但是,在此迭代结束之后,当您的代码返回到for ,会将index x = 6分配给index x = 6

Try following example: 请尝试以下示例:

for index in range(0,5):
    print 'iterator:', index
    index = index + 2
    print 'index:', index

The output will be: 输出将是:

iterator: 0
index: 2
iterator: 1
index: 3
iterator: 2
index: 4
iterator: 3
index: 5
iterator: 4
index: 6

To fix this issue, you might want to change your for loop to while loop 要解决此问题,您可能需要将for循环更改for while循环

EDIT: If I didn't understand wrong, you were trying to find 'same' columns in two files and store them. 编辑:如果我不明白错误,您正在尝试在两个文件中查找“相同”列并将其存储。 If this is the case, actually your work can be done easily by using following code: 如果是这种情况,实际上可以使用以下代码轻松完成您的工作:

import csv # import csv module to read csv files

file1 = 'csv1.csv' # input file 1
file2 = 'csv2.csv' # input file 2
outfile = 'csv3.csv' # only have one output file since two output files will be the same

read1 = csv.reader(open(file1, 'r')) # read input file 1
write = csv.writer(open(outfile, 'w')) # write to output file

# for each row in input file 1, compare it with each row in input file 2
# if they are the same, write that row into output file
for row1 in read1:
    read2 = csv.reader(open(file2, 'r'))
    for row2 in read2:
        if row1 == row2:
            write.writerow(row1)

read1.close()
write.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM