简体   繁体   English

Python:比较来自 csv 的行并将 pdf 布局比较的相同结果组合在一起

[英]Python: compare rows from csv and group together identical results for pdf layout comparison

I am trying to find a way to compare the layout of different pdf files.我试图找到一种方法来比较不同 pdf 文件的布局。 Using tesseract, I am able to export to a CSV file the following data of specific keywords.使用 tesseract,我可以将特定关键字的以下数据导出到 CSV 文件。

Consider this generated csv file with the following content, displaying the left and top coordinates of each keyword, as well as the keyword and the file name:考虑这个生成的 csv 文件,其内容如下,显示每个关键字的左上角坐标,以及关键字和文件名:

Left,Top,Text,File
118,174,INVOICE,file0
117,333,INVOICE,file0
119,525,BILLED,file0
119,1554,INVOICE,file0
322,1880,invoice,file0
118,174,INVOICE,file1
117,333,INVOICE,file1
119,525,BILLED,file1
119,1554,INVOICE,file1
322,1880,invoice,file1
1112,185,Invoice,file2
113,219,Invoice,file2
1112,212,Invoice,file3
113,219,Invoice,file3
113,217,Invoice,file3
118,174,INVOICE,file4
117,333,INVOICE,file4
119,525,BILLED,file4
119,1554,INVOICE,file4
322,1884,invoice,file4

My initial idea is to concatenate the first 3 columns and compare them to the other rows.我最初的想法是连接前 3 列并将它们与其他行进行比较。 I am able to obtain which files match for each keyword.我能够获得与每个关键字匹配的文件。 But I am unable to obtain which files have an overall matching layout of more than 80% for example.但例如,我无法获得哪些文件的整体匹配布局超过 80%。

Here is my code so far:到目前为止,这是我的代码:

import pandas as pd
import itertools

Loop over csv to get position and text of keywords循环遍历 csv 以获取 position 和关键字文本

with open('data.csv') as file:
    results = []
    file_names = []
    for row in file:
        columns = row.split(',')
        data = columns[0] + columns[1] + columns[2]
        file_name = columns[3].rstrip()
        results = results + [data]
        file_names.append(file_name)

Get and print indices of the matches获取并打印匹配的索引

indices = [] 
for a, b in itertools.combinations(results, 2):
    if a == b:
        indices = indices + [[i -1 for i, x in enumerate(results) if x == a]]
print("Indices: ", indices)

Print:打印:

Indices:  [[0, 5, 15], [0, 5, 15], [1, 6, 16], [1, 6, 16], [2, 7, 17], [2, 7, 17], [3, 8, 18], [3, 8, 18], [4, 9], [0, 5, 15], [1, 6, 16], [2, 7, 17], [3, 8, 18], [11, 13]]

Get and print filenames with match获取并打印匹配的文件名

dataset = pd.read_csv('data.csv', sep=',')
identical_files = []
for indice in indices:
    file_matches = []
    for i in indice:
        file_matches.append(dataset.iloc[i, -1])
    identical_files.append(file_matches)
print("Identical files: ", identical_files)

Print:打印:

Identical files:  [['file0', 'file1', 'file4'], ['file0', 'file1', 'file4'], ['file0', 'file1', 'file4'], ['file0', 'file1', 'file4'], ['file0', 'file1', 'file4'], ['file0', 'file1', 'file4'], ['file0', 'file1', 'file4'], ['file0', 'file1', 'file4'], ['file0', 'file1'], ['file0', 'file1', 'file4'], ['file0', 'file1', 'file4'], ['file0', 'file1', 'file4'], ['file0', 'file1', 'file4'], ['file2', 'file3']]

So I am able to print the identical files, however, after many attempts I am struggling to figure out the logic to identify which files have an identical layout and should, therefore, be grouped together.因此,我能够打印相同的文件,但是,经过多次尝试,我正在努力找出逻辑来识别哪些文件具有相同的布局,因此应该将它们组合在一起。

Based on this data, the output should be something like this:根据这些数据, output应该是这样的:

[
  [file0, file1, file4],
  [file2, file3]
]

I am still new to Python so I hope I have made myself clear.我对 Python 还是新手,所以我希望我已经说清楚了。

I am not sure if this is exactly what you need, but try this:我不确定这是否正是您所需要的,但试试这个:

res = (df.groupby(by=['Left', 'Top', 'Text'])
       .agg(files = pd.NamedAgg(column="File", aggfunc=', '.join)))
print(res)

This will give me这会给我

                                 files
Left Top  Text                        
1112 185  Invoice                file2
     212  Invoice                file3
113  217  Invoice                file3
     219  Invoice         file2, file3
117  333  INVOICE  file0, file1, file4
118  174  INVOICE  file0, file1, file4
119  1554 INVOICE  file0, file1, file4
     525  BILLED   file0, file1, file4
322  1880 invoice         file0, file1
     1884 invoice                file4

which are for all combinatiosn of "Left", "Top", and "Text" the files which have the same combinations.它们适用于“Left”、“Top”和“Text”的所有组合,即具有相同组合的文件。

Does that help?这有帮助吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM