[英]Parsing a PDF via URL with Python using pdfminer
I am trying to parse this file but without downloading it off of the website.我正在尝试解析这个文件,但没有从网站上下载它。 I have run this with the file on my hard drive and I am able to parse it without issue but running this script it trips.我已经用硬盘驱动器上的文件运行了它,我能够毫无问题地解析它,但是运行这个脚本它会跳闸。
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
I think I am integrating the url wrong.我想我整合了 url 错误。
import sys
import getopt
import urllib2
import datetime
import re
from pdfminer.pdfparser import PDFParser
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, PDFConverter, LTContainer, LTText, LTTextBox, LTImage
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from urllib2 import Request
# Define a PDF parser function
def parsePDF(url):
# Open the url provided as an argument to the function and read the content
open = urllib2.urlopen(Request(url)).read()
# Cast to StringIO object
from StringIO import StringIO
memory_file = StringIO(open)
# Create a PDF parser object associated with the StringIO object
parser = PDFParser(memory_file)
# Create a PDF document object that stores the document structure
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Define parameters to the PDF device objet
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
codec = 'utf-8'
Create a PDF device object
device = PDFDevice(rsrcmgr, retstr, codec = codec, laparams = laparams)
# Create a PDF interpreter object
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
# Construct the url
url = 'http://www.city.pittsburgh.pa.us/police/blotter/blotter_monday.pdf'
Building on your own answer and the function provided here , this should return a string from a pdf in a url without downloading it:基于您自己的答案和此处提供的函数,这应该从 url 中的 pdf 返回一个字符串,而无需下载它:
import urllib2
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def pdf_from_url_to_txt(url):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Open the url provided as an argument to the function and read the content
f = urllib2.urlopen(urllib2.Request(url)).read()
# Cast to StringIO object
fp = StringIO(f)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
Extending above answer, a little hack work for me like a charm!扩展以上答案,对我来说一点点技巧就像魅力一样! Here is mine version of function:这是我的功能版本:
def pdf_from_url_to_txt(url):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
f = urllib.request.urlopen(url).read()
fp = BytesIO(f)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.