无边界 pdf 提取到 json 对于 Python camelot 库无法正常工作

Question

Can anyone give me quick answer/help that as we are facing some issue after pdf extraction to json using python camelot is not giving exact content.任何人都可以给我快速回答/帮助，因为我们在使用 python camelot 将 pdf 提取到 json 后面临一些问题没有给出确切的内容。 some content is missing after extraction.提取后部分内容丢失。

Answer 1

I tried the following code:我尝试了以下代码：

import camelot

pdf_path = '/YOUR/FILEPATH.pdf'
tables = camelot.read_pdf(pdf_path, flavor='stream')

Here are two problems:这里有两个问题：

headers font is not properly read, so you find strange characters like (cid:71) ...标题字体未正确读取，因此您会发现奇怪的字符，例如(cid:71) ...
using flavor='lattice' , the table isn't detected.使用flavor='lattice' ，未检测到该表。 Using flavor='stream' , the table is detected, but the cells aren't properly detected.使用flavor='stream' ，可以检测到表格，但没有正确检测到单元格。

At the moment, I think that Camelot can't properly extract this table.目前，我认为 Camelot 无法正确提取此表。 They are working on fixing the second problem (see this and this ).他们正在努力解决第二个问题（见这个和这个）。

无边界 pdf 提取到 json 对于 Python camelot 库无法正常工作

问题描述

1 个解决方案

解决方案1
0 2020-09-24 13:40:21

无边界 pdf 提取到 json 对于 Python camelot 库无法正常工作

问题描述

1 个解决方案

解决方案1 0 2020-09-24 13:40:21

解决方案1
0 2020-09-24 13:40:21