使用Camelot从多页PDF中提取不同页上的多个表

Question

My PDF contains 16 tables on 3 pages, which I want to output to an Excel file as a single worksheet using Camelot.我的 PDF 在 3 页上包含 16 个表，我想使用 Camelot 将 output 到 Excel 文件作为单个工作表。 I can extract each page individually with no problems but I cannot figure out how to handle all 3 pages in one pass.我可以毫无问题地单独提取每一页，但我不知道如何一次处理所有 3 页。 My code shown below:我的代码如下所示：

    # Read Obslog Page 1 to extract all the required tables
obstables = camelot.read_pdf(filepath, 
                             pages='1', \
                             flavor='stream', \
                             edge_tol=500, \
                             strip_text=' °, kn, m, µbar, mbar, in³, psi,\n', \
                             table_areas=[' 15, 750, 575, 680', \
                                          ' 15, 680, 575, 570', \
                                          ' 15, 570, 575, 460', \
                                          ' 15, 460, 575, 380', \
                                          ' 15, 380, 575, 300', \
                                          ' 15, 300, 575, 240', \
                                          ' 15, 240, 575, 180', \
                                          ' 15, 180, 575, 110'], \
                             columns=['','','','','','','',''])
 # Read Obslog Page 2 to extract all the required tables
obstables1 = camelot.read_pdf(filepath, 
                              pages='2', \
                              flavor='stream', \
                              edge_tol=500, \
                              strip_text=' °, kn, m, µbar, mbar, in³, psi,\n', \
                              table_areas=[' 20, 820, 575, 750', \
                                           ' 20, 730, 140, 655', \
                                           ' 20, 635, 270, 560', \
                                           ' 20, 540, 270, 470'], \
                              columns=['','','',''])
# Read Obslog Page 3 to extract all the required tables
obstables2 = camelot.read_pdf(filepath, 
                              pages='3', \
                              flavor='stream', \
                              edge_tol=500, \
                              strip_text=' °, kn, m, µbar, mbar, in³, psi,\n', \
                              table_areas=[' 15, 820, 575, 750', \
                                           ' 15, 730, 575, 660', \
                                           ' 15, 640, 575, 570', \
                                           ' 15, 560, 150, 500', \
                                           ' 15, 480, 575, 390',] \
                              columns=['','','','',''])

When I try to execute the script the first line of the page 2 'table_areas' gives me the following syntax error:当我尝试执行脚本时，第 2 页“table_areas”的第一行给了我以下语法错误：

table_areas=[' 15, 820, 575, 750', table_areas=['15, 820, 575, 750',
^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^

I cannot see any syntax problem with this line.我看不到这一行有任何语法问题。

I get the same error if I try to use the 'tables.append' option(as suggested by Anakin87 on 12/7/2021 in answer a similar post).如果我尝试使用“tables.append”选项（正如 Anakin87 在 2021 年 12 月 7 日在回答类似帖子时所建议的那样），我会遇到同样的错误。 In this case replacing the camelot procedures for pages 2 and 3 with the following code:在这种情况下，将第 2 页和第 3 页的 camelot 程序替换为以下代码：

     obstables._tables.append(camelot.read_pdf(filepath, 
                                            pages='2', \
                                            flavor='stream', \
                                            edge_tol=500, \
                                            strip_text=' °, kn, m, µbar, mbar, in³, psi,\n', \
                                            table_areas=[' 20, 820, 575, 750', \
                                                         ' 20, 730, 140, 655', \
                                                         ' 20, 635, 270, 560', \
                                                         ' 20, 540, 270, 470'], \
                                            columns=['','','','']))
                                            
obstables._tables.append(camelot.read_pdf(filepath, 
                                            pages='3', \
                                            flavor='stream', \
                                            edge_tol=500, \
                                            strip_text=' °, kn, m, µbar, mbar, in³, psi,\n', \
                                            table_areas=[' 15, 820, 575, 750', \
                                                         ' 15, 730, 575, 660', \
                                                         ' 15, 640, 575, 570', \
                                                         ' 15, 560, 150, 500', \
                                                         ' 15, 480, 575, 390',] \
                                            columns=['','','','','']))

Appending all the tables seems a good option as I the final output will be concatenated to a single dataframe before output to an Excel worksheet, however at the moment I am stuck with the cause of the syntax error. Appending all the tables seems a good option as I the final output will be concatenated to a single dataframe before output to an Excel worksheet, however at the moment I am stuck with the cause of the syntax error.

Answer 1

After going through all the code the error was a simple rookie mistake, I was trying find the syntax error on the first line of the table_areas definition.在浏览完所有代码后，错误是一个简单的新手错误，我试图在 table_areas 定义的第一行找到语法错误。 in fact I had left a comma in the last line of the definition before the ']', I was slightly mislead by the error message which pointed to the first line of the table_areas definition rather than the last.事实上，我在定义的最后一行']'之前留下了一个逗号，我被指向 table_areas 定义的第一行而不是最后一行的错误消息稍微误导了。 because I copy/pasted the code this was also why the 'tables.append' option failed.因为我复制/粘贴了代码，这也是“tables.append”选项失败的原因。

' 15, 480, 575, 390',] \

which should have read应该读

' 15, 480, 575, 390'], \

使用Camelot从多页PDF中提取不同页上的多个表

问题描述

1 个解决方案

解决方案1
0 2022-08-15 04:26:23

使用Camelot从多页PDF中提取不同页上的多个表

问题描述

1 个解决方案

解决方案1 0 2022-08-15 04:26:23

解决方案1
0 2022-08-15 04:26:23