简体繁体 English

如何阻止 camelot-py 将单个单元格中的多行文本拆分为多个单元格？

[英]How can I stop camelot-py from splitting multi-line text in a single cell into multiple cells?

原文 2020-05-10 07:51:39 9 1 python/ python-camelot

I am trying to build an app which reads arbitrary PDFs and extracts tables from them and I am using Camelot for extracting the tables.我正在尝试构建一个应用程序来读取任意 PDF 并从中提取表格，并且我正在使用Camelot来提取表格。 This is working fine for tables in which cells have single line values.这适用于单元格具有单行值的表格。 However, for tables having cells with multi-line values, Camelot is splitting the multi-line text in a single cell, into multiple cells.但是，对于具有多行值的单元格的表格，Camelot 将单个单元格中的多行文本拆分为多个单元格。 Since Camelot is built on top of pdfminer, I tried to tweak the layout analysis parameters (specifically line_margin ) to make Camelot not split the lines.由于 Camelot 是建立在 pdfminer 之上的，我尝试调整布局分析参数（特别是line_margin ）以使 Camelot 不会拆分行。 However, the issue remains.但是，问题仍然存在。

What other parameters can I tweak to handle this issue?我可以调整哪些其他参数来处理这个问题？ Here is an example of the tables which have this issue.这是有此问题的表的示例。

I do not want to use the 'lattice' flavor as most of the tables that I expect to see do not have demarcating lines.我不想使用“格子”风格，因为我希望看到的大多数表格都没有分界线。

1 个解决方案

If your PDFs tables have lines that are brighter than the cells, as in your example, then you might try lattice flavour with process_background=True.如果您的 PDF 表格中的线条比单元格更亮，如您的示例所示，那么您可以尝试使用 process_background=True 的格子风格。

tables = camelot.read_pdf('background_lines.pdf', process_background=True)

See, https://camelot-py.readthedocs.io/en/master/user/advanced.html见， https://camelot-py.readthedocs.io/en/master/user/advanced.html