简体   繁体   English

如何阻止 camelot-py 将单个单元格中的多行文本拆分为多个单元格?

[英]How can I stop camelot-py from splitting multi-line text in a single cell into multiple cells?

I am trying to build an app which reads arbitrary PDFs and extracts tables from them and I am using Camelot for extracting the tables.我正在尝试构建一个应用程序来读取任意 PDF 并从中提取表格,并且我正在使用Camelot来提取表格。 This is working fine for tables in which cells have single line values.这适用于单元格具有单行值的表格。 However, for tables having cells with multi-line values, Camelot is splitting the multi-line text in a single cell, into multiple cells.但是,对于具有多行值的单元格的表格,Camelot 将单个单元格中的多行文本拆分为多个单元格。 Since Camelot is built on top of pdfminer, I tried to tweak the layout analysis parameters (specifically line_margin ) to make Camelot not split the lines.由于 Camelot 是建立在 pdfminer 之上的,我尝试调整布局分析参数(特别是line_margin )以使 Camelot 不会拆分行。 However, the issue remains.但是,问题仍然存在。

What other parameters can I tweak to handle this issue?我可以调整哪些其他参数来处理这个问题? Here is an example of the tables which have this issue.这是有此问题的表的示例。 在此处输入图像描述

I do not want to use the 'lattice' flavor as most of the tables that I expect to see do not have demarcating lines.我不想使用“格子”风格,因为我希望看到的大多数表格都没有分界线。

If your PDFs tables have lines that are brighter than the cells, as in your example, then you might try lattice flavour with process_background=True.如果您的 PDF 表格中的线条比单元格更亮,如您的示例所示,那么您可以尝试使用 process_background=True 的格子风格。

tables = camelot.read_pdf('background_lines.pdf', process_background=True)

See, https://camelot-py.readthedocs.io/en/master/user/advanced.html见, https://camelot-py.readthedocs.io/en/master/user/advanced.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM