简体   繁体   English

如何在Python3中使用pdf2txt打开基于Web的PDF

[英]How to open web based PDF with pdf2txt in Python3

I'm successfuly parsing local PDFs with pdfminer pdf2txt in Python3. 我已经在Python3中使用pdfminer pdf2txt成功解析了本地PDF。 I use the following code: 我使用以下代码:

Python3 pdf2txt.py -A -M 15.0 -L 0.3 -W 0.2 -F 0.5 -V -o output.txt -t text input.pdf

I was wondering if there is any way I can use pdf web link instead of local file. 我想知道是否有任何方法可以使用pdf Web链接而不是本地文件。 I'm not sure how I can declare this. 我不确定该如何声明。 I tried with quotes and parentheses but there is error. 我尝试使用引号和括号,但是有错误。

Python has urllib in the standard library, for retrieving the contents of a URL you can use urlretrieve : Python在标准库中具有urllib ,要检索URL的内容,可以使用urlretrieve

import urllib2
urllib.urlretrieve('http://www.example.com/myfile.pdf', 'myfile_local.pdf')

In Python 3 I believe this is tucked away slightly deeper, in urllib.request.urlretrieve 我相信在Python 3中,它在urllib.request.urlretrieve被隐藏得更深。

I don't know what OS you're using but you also might want to just use something like the wget program from the command line, that way you don't have to write any Python code to do the retrieval. 我不知道您使用的是什么操作系统,但您可能还想只使用命令行中的wget程序之类的方法,这样就不必编写任何Python代码即可进行检索。

Unfortunately pdf2txt.py doesn't support the parsing of streamed PDF documents. 不幸的是pdf2txt.py不支持流PDF文档的解析。 The internals require seeking within the file, which is difficult to achieve with a stream. 内部需要在文件中查找,而使用流很难实现。

Your only option is to download the PDF document to your file system and then call pdf2txt.py on it. 唯一的选择是将PDF文档下载到文件系统,然后在其上调用pdf2txt.py There a lots of tools to download URL resources, eg curl, wget, et al., or you could write your own with Python. 有很多工具可以下载URL资源,例如curl,wget等,或者您可以使用Python编写自己的工具。

You could easily make a shell, batch, or Python script to download the PDF file to a temporary file, run pdf2txt.py , and then clean up. 您可以轻松地制作一个shell,批处理或Python脚本,以将PDF文件下载到一个临时文件,运行pdf2txt.py ,然后进行清理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM