简体   繁体   English

如何使用Python3打开和读取pdf(最初为.html)文件

[英]How open and read pdf (originally .html) file using Python3

I need to open this file in python3: 我需要在python3中打开此文件:

http://www.arch.gob.ec/index.php/descargas/doc_download/478-historial-de-produccion-nacional-de-crudo-2011.html http://www.arch.gob.ec/index.php/descargas/doc_download/478-historial-de-produccion-nacional-de-crudo-2011.html

Here will I have to read it, and extract the data tables. 在这里,我必须阅读它,并提取数据表。 I have searched for several hours but nothing seem to work. 我已经搜索了几个小时,但似乎没有任何反应。 I am new to scraping/parsing and it is the first time I have looked in to file handling of PDF. 我是第一次抓图/解析,这是我第一次研究PDF的文件处理。

Thanks for all kind of help! 感谢您提供的所有帮助!

Obtaining the PDF from internet is called scraping. 从互联网上获取PDF称为抓取。 Trying to read the PDF to obtain data from it is quite another problem! 试图读取PDF来获取数据是另一个问题!

There are many utilities available which try to convert PDF to text - not entirely successful. 有许多实用程序尝试将PDF转换为文本-并非完全成功。 As this article explains, PDF files are nice to use (look at), but the internals aren't nearly as elegant. 正如本文所解释的,PDF文件很好用(看),但是内部却不那么优雅。 The reason is that the visible text, is frequently not present directly inside the document, and has to be reconstructed from tables. 原因是可见的文本通常不直接出现在文档内部,而必须从表中重建。 In some cases the PDF doesn't even contain the text, but is just an image of a text. 在某些情况下,PDF甚至不包含文本,而只是文本的图像。

The article contains several tools to (try to) convert PDF to text. 本文包含几种(尝试)将PDF转换为文本的工具。 Some have 'wrappers' in Python to access them. 有些人在Python中具有“包装器”来访问它们。 There are a few modules which sound interesting, such as PyPDF (which does not convert to text), but really aren't. 有一些听起来很有趣的模块,例如PyPDF (不会转换为文本),但实际上不是。

aTXT looks interesting for data mining - haven't tested it yet. 对于数据挖掘, aTXT看起来很有趣-尚未测试。

As mentioned above, most of these are wrappers (or GUIs) around existing - mostly command-line - tools. 如上所述,其中大多数都是围绕现有工具(主要是命令行工具)的包装器(或GUI)。 Eg. 例如。 a simple tool (which works with your PDF!) in Linux is pdftotext (if you want to stay in Python, you can call it with subprocess 's call , or even with os.system . 在Linux的一个简单的工具(与你的PDF工作!)是pdftotext (如果你想留在Python,你可以把它叫做subprocesscall ,甚至有os.system

After this, you get a text file, which you can process more easily with just basic Python string functions, or regular expressions, or sophisticated things like PyParser . 之后,您将获得一个文本文件,您可以使用基本的Python字符串函数,正则表达式或诸如PyParser之类的复杂对象来更轻松地处理该文本文件。

Found a way that works for me. 找到了一种对我有用的方法。

url = 'http://www.arch.gob.ec/index.php/descargas/doc_download/478-historial-de-produccion-nacional-de-crudo-2011.html'

(pdfFile, headers) = urllib.request.urlretrieve(url)
print(os.path.abspath(pdfFile))
s = pdf_convert(str(os.path.abspath(pdfFile)))

where pdf_convert is: pdf_convert是:

def pdf_convert(path):
outtype='txt'
opts={}
# Create file that that can be populated in Desktop
outfile = 'c:\\users\\yourusername\\Desktop\\temp2.txt'
outdir = '/'.join(path.split('/')[:-1])
# debug option
debug = 0
# input option
password = ''
pagenos = set()
maxpages = 0
# output option
# ?outfile = None
# ?outtype = None
outdir = None
#layoutmode = 'normal'
codec = 'utf-8'
pageno = 1
scale = 1
showpageno = True
laparams = LAParams()
for (k, v) in opts:
    if k == '-d': debug += 1
    elif k == '-p': pagenos.update( int(x)-1 for x in v.split(',') )
    elif k == '-m': maxpages = int(v)
    elif k == '-P': password = v
    elif k == '-o': outfile = v
    elif k == '-n': laparams = None
    elif k == '-A': laparams.all_texts = True
    elif k == '-V': laparams.detect_vertical = True
    elif k == '-M': laparams.char_margin = float(v)
    elif k == '-L': laparams.line_margin = float(v)
    elif k == '-W': laparams.word_margin = float(v)
    elif k == '-F': laparams.boxes_flow = float(v)
    elif k == '-Y': layoutmode = v
    elif k == '-O': outdir = v
    elif k == '-t': outtype = v
    elif k == '-c': codec = v
    elif k == '-s': scale = float(v)
#
#PDFDocument.debug = debug
#PDFParser.debug = debug
CMapDB.debug = debug
PDFResourceManager.debug = debug
PDFPageInterpreter.debug = debug
PDFDevice.debug = debug
#
rsrcmgr = PDFResourceManager()

outtype = 'text'

if outfile:
    outfp = open(outfile, 'w')

else:
    outfp = sys.stdout
device = TextConverter(rsrcmgr, outfp, laparams=laparams)


fp = open(path, 'rb')
process_pdf(rsrcmgr, device, fp, pagenos, maxpages=maxpages, password=password,
                check_extractable=True)
fp.close()
device.close()
outfp.close()
with open ('c:\\users\\studma~1\\Desktop\\temp2.txt', 'r') as myfile:
    data = myfile.read()
myfile.close()
return str(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM