簡體   English   中英

使用 Tabula 從 PDF 中以字符串形式讀取表格

[英]Reading Tables as string from PDF with Tabula

我在 python 3.7 上使用 tabula-py 2.0.4,pandas 1.17.4。 我正在嘗試使用 tabula.read_pdf 將 PDF 表格讀取到數據框

from tabula import read_pdf
fn = "file.pdf"
print(read_pdf(fn, pages='all', multiple_tables=True)[0])

問題是這些值被讀取為浮點數而不是字符串。

我需要將它作為字符串讀取,所以如果值是 20.0000,我知道精度是小數點后第四位。 現在它返回 20.0 而不是 20.0000。

PDF 中的輸入數據看起來像在此處輸入圖片說明

上面代碼的輸出是

在此處輸入圖片說明

您需要向tabula.read_pdf添加幾個選項。 解析 pdf 文件並以不同方式解釋發現的列的示例:

import tabula

print(tabula.environment_info())

fname = ("https://github.com/chezou/tabula-py/raw/master/tests/resources/"
         "data.pdf")

# Columns iterpreted as str
col2str = {'dtype': str}
kwargs = {'output_format': 'dataframe',
          'pandas_options': col2str,
          'stream': True}
df1 = tabula.read_pdf(fname, **kwargs)

print(df1[0].dtypes)
print(df1[0].head())

# Guessing column type
col2val = {'dtype': None}
kwargs = {'output_format': 'dataframe',
          'pandas_options': col2val,
          'stream': True}
df2 = tabula.read_pdf(fname, **kwargs)

print(df2[0].dtypes)
print(df2[0].head())

具有以下輸出:

Python version:
    3.7.6 (default, Jan  8 2020, 13:42:34) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
Java version:
    openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment (build 13.0.2+8)
OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)
tabula-py version: 2.0.4
platform: Darwin-19.3.0-x86_64-i386-64bit
uname:
    uname_result(system='Darwin', node='MacBook-Pro-10.local', release='19.3.0', version='Darwin Kernel Version 19.3.0: Thu Jan  9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64', machine='x86_64', processor='i386')
linux_distribution: ('Darwin', '19.3.0', '')
mac_ver: ('10.15.3', ('', '', ''), 'x86_64')

None
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0    object
mpg           object
cyl           object
disp          object
hp            object
drat          object
wt            object
qsec          object
vs            object
am            object
gear          object
carb          object
dtype: object
          Unnamed: 0   mpg cyl   disp   hp  drat     wt   qsec vs am gear carb
0          Mazda RX4  21.0   6  160.0  110  3.90  2.620  16.46  0  1    4    4
1      Mazda RX4 Wag  21.0   6  160.0  110  3.90  2.875  17.02  0  1    4    4
2         Datsun 710  22.8   4  108.0   93  3.85  2.320  18.61  1  1    4    1
3     Hornet 4 Drive  21.4   6  258.0  110  3.08  3.215  19.44  1  0    3    1
4  Hornet Sportabout  18.7   8  360.0  175  3.15  3.440  17.02  0  0    3    2
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0     object
mpg           float64
cyl             int64
disp          float64
hp              int64
drat          float64
wt            float64
qsec          float64
vs              int64
am              int64
gear            int64
carb            int64
dtype: object
          Unnamed: 0   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
2         Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
3     Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
4  Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2


暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM