简体   繁体   English

无法使我的脚本以所需格式打印输出

[英]Can't make my script print output in the desired format

I'm trying to extract a certain portion of text from a pdf file. 我正在尝试从pdf文件中提取文本的特定部分。 I've used PyPDF2 library to do that. 我已经使用PyPDF2库来做到这一点。 However, when i excecute the script below I can see that the content I wish to grab is being printed in the console awkwardly. 但是,当我执行下面的脚本时,我可以看到想要抓取的内容正在笨拙地打印在控制台中。

I've written so far: 到目前为止,我已经写过:

import io
import PyPDF2
import requests

URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'

res = requests.get(URL)
f = io.BytesIO(res.content)
reader = PyPDF2.PdfFileReader(f)
contents = reader.getPage(0).extractText()
print(contents)

Output I'm having: 我有的输出:

ACCESSHEALTHCTConnecticutAllPayersClaimsDatabaseDATASUBMISSIONGUIDE
December5,2013
Version1.2(withclarifications)

Output I wish to grab like: 我希望抓取的输出如下:

ACCESS HEALTH CT
Connecticut All Payers Claims Database
DATA SUBMISSION GUIDE
December 5, 2013
Version 1.2 (with clarifications)

This is the issue with pyPDF2, the reason is PyPDF doesn't read newline character. 这是pyPDF2的问题,原因是PyPDF不读取换行符。 Alternately you can pdftotext 您也可以pdftotext

Simple and clean, you can loop over pages or get extract one page. 简单干净,您可以循环浏览页面或提取一页。

import io
import requests
import pdftotext
URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'
res = requests.get(URL)
f = io.BytesIO(res.content)
pdf = pdftotext.PDF(f)
print(pdf[0])
# Iterate over all the pages
# for page in pdf:
#     print(page)

在此处输入图片说明

I would suggest PDFMiner if installing other packages causes a dependency issue. 如果安装其他软件包会导致依赖性问题,我建议使用PDFMiner

You can install it for python 3.7 by doing pip install pdfminer.six , I've already tested and its working on my python 3.7. 您可以通过pip install pdfminer.six为python 3.7安装它,我已经测试过了,并且可以在python 3.7上运行。

The code for getting page 0 is as follows 获取页面0的代码如下

import io
import requests
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser

URL = 'http://www.ct.gov/hix/lib/hix/CT_DSG_-12132014_version_1.2_%28with_clarifications%29.pdf'

res = requests.get(URL)
fp = io.BytesIO(res.content)

rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)

page_no = 0
for pageNumber, page in enumerate(PDFPage.get_pages(fp)):
    if pageNumber == page_no:
        interpreter.process_page(page)

        data = retstr.getvalue()

print(data.strip())

Outputs 输出

ACCESS HEALTH CT 

Connecticut All Payers Claims Database 

DATA SUBMISSION GUIDE 

December 5, 2013 

Version 1.2 (with clarifications) 

The good thing about PDFMiner is that it reads your pages directly and it focuses entirely on getting and analyzing text data. PDFMiner的优点在于,它可以直接读取您的页面,并且完全专注于获取和分析文本数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM