[英]How to read tables in pdf when there is line breaks in table by Python tabula-py?
I tried to use Python package, tabula-py to read table in pdf, It seems that line breaks in pdf table cells would separate the contents in the original cell into multiple cells.我尝试使用 Python 包 tabula-py 读取 pdf 中的表格,似乎 pdf 表格单元格中的换行符会将原始单元格中的内容分成多个单元格。
I tried to search for all kinds of python packages to solve this problem.我试图搜索各种python包来解决这个问题。 It seems that tabula-py is the most steady package to convert pdf table into pandas data.
看来 tabula-py 是将 pdf table 转换为 pandas 数据的最稳定的包。 However, if this problem cannot be solved, I have to turn to online service , which would produce ideal excel output for me.
但是,如果这个问题无法解决,我就不得不求助于在线服务,这对我来说会产生理想的 excel 输出。
from tabula import read_pdf
df=read_pdf("C:/Users/Desktop/test.pdf", pages='all')
I expected the pdf table can be converted correctly with this .我希望可以使用此正确转换 pdf 表。
Tabula no longer has 'spreadsheet' as an option. Tabula 不再有“电子表格”作为选项。 Instead use 'lattice' option to avoid the line breaks separating into new rows.
而是使用 'lattice' 选项来避免将换行符分成新行。 Code like this:
像这样的代码:
import tabula
# Read pdf into DataFrame
df = tabula.read_pdf("FDA EPC Text Phrases (updated March 2018.pdf", pages='all',
lattice=True)
print(df)
You can use 'spreadsheet' option with value 'True' to omit multiple rows of NAN value caused by line breaks.您可以使用值为 'True' 的 'spreadsheet' 选项来省略由换行符引起的多行 NAN 值。
import tabula
# Read pdf into DataFrame
df = tabula.read_pdf("FDA EPC Text Phrases (updated March 2018.pdf", pages='all', spreadsheet=True)
print(df)
#print(df['Active Moiety Name'])
#print(df['FDA Established Pharmacologic Class\r(EPC) Text Phrase\rPLR regulations require that the following\rstatement is included in the Highlights\rIndications and Usage heading if a drug is a\rmember of an EPC [see 21 CFR\r201.57(a)(6)]: “(Drug) is a (FDA EPC Text\rPhrase) indicated for [indication(s)].” For\reach listed active moiety, the associated\rFDA EPC text phrase is included in this\rdocument. For more information about how\rFDA determines the EPC Text Phrase, see\rthe 2009 "Determining EPC for Use in the\rHighlights" guidance and 2013 "Determining\rEPC for Use in the Highlights" MAPP\r7400.13.'])
Output:输出:
1758 ziconotide N-type calcium channel antagonist
1759 zidovudine HIV nucleoside analog reverse transcriptase in...
1760 zileuton 5-lipoxygenase inhibitor
1761 zinc cation copper absorption inhibitor
1762 ziprasidone atypical antipsychotic
1763 zoledronic acid bisphosphonate
1764 zoledronic acid anhydrous bisphosphonate
1765 zolmitriptan serotonin 5-HT1B/1D receptor agonist (triptan)
1766 zolmitriptan serotonin 5-HT1B/1D receptor agonist (triptan)
1767 zolpidem gamma-aminobutyric acid (GABA) A agonist
1768 zonisamide antiepileptic drug (AED)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.