简体   繁体   English

无边界 pdf 提取到 json 对于 Python camelot 库无法正常工作

[英]Borderless pdf extraction to json is not working properly for Python camelot library

Can anyone give me quick answer/help that as we are facing some issue after pdf extraction to json using python camelot is not giving exact content.任何人都可以给我快速回答/帮助,因为我们在使用 python camelot 将 pdf 提取到 json 后面临一些问题没有给出确切的内容。 some content is missing after extraction.提取后部分内容丢失。

I tried the following code:我尝试了以下代码:

import camelot

pdf_path = '/YOUR/FILEPATH.pdf'
tables = camelot.read_pdf(pdf_path, flavor='stream')

在此处输入图片说明

Here are two problems:这里有两个问题:

  • headers font is not properly read, so you find strange characters like (cid:71) ...标题字体未正确读取,因此您会发现奇怪的字符,例如(cid:71) ...
  • using flavor='lattice' , the table isn't detected.使用flavor='lattice' ,未检测到该表。 Using flavor='stream' , the table is detected, but the cells aren't properly detected.使用flavor='stream' ,可以检测到表格,但没有正确检测到单元格。

At the moment, I think that Camelot can't properly extract this table.目前,我认为 Camelot 无法正确提取此表。 They are working on fixing the second problem (see this and this ).他们正在努力解决第二个问题(见这个这个)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM