简体   繁体   English

使用PHP从PDF中提取图像

[英]extract images from PDF with PHP

The thing is that the client wants to upload a pdf with images as a way of batch processing multiple images at once. 问题是客户想要上传带有图像的pdf,作为一次批量处理多个图像的方式。

I already looked around and out of the box PHP can't read PDF's. 我已经环顾四周,开箱即用PHP无法阅读PDF文件。

What are my alternatives? 我有什么选择?

I already know the host has not installed imageMagick or any pdf library and the exec function is disabled. 我已经知道主机没有安装imageMagick或任何pdf库, exec功能被禁用。 That's basicly leaving me with nothing to work with, I guess? 我猜,这基本上让我无需工作。

Does anyone know if there is an online service that can do this, with an api of sorts? 有没有人知道是否有一个可以做到这一点的在线服务,有各种各样的api?

thanks in adv 谢谢你

AFAIK, there is no PHP module to do it. AFAIK,没有PHP模块可以做到。 There is a command line tool, pdfimages (part of xpdf ). 有一个命令行工具, pdfimagesxpdf的一部分)。 For reference, here's how that works: 作为参考,这是如何工作的:

pdfimages -j source.pdf image

Which will extract all images from source.pdf as image-000.jpg, image-001.jpg, etc. Note the output format is always Jpeg. 这将从source.pdf中提取所有图像为image-000.jpg,image-001.jpg等。请注意,输出格式始终为Jpeg。

Possible Options 可能的选择

Being a command line tool, you need exec (or system , passthru , any of the command executing functions built into PHP). 作为命令行工具,您需要exec (或systempassthru ,PHP内置的任何命令执行函数)。 As your environment doesn't have that, I see four options: 由于您的环境没有,我看到四个选项:

  1. Beg that exec be turned on for you (your hosting provider can limit what you can exec to a single command) 请求为您打开exec(您的托管服务提供商可以限制您执行单个命令的操作)
  2. Change the design -- how about a ZIP upload? 更改设计 - ZIP上传怎么样?
  3. Roll your own, using the source code of pdfimages as a model 使用pdfimages的源代码作为模型, pdfimages
  4. Let pdfimages do the heavy lifting, by running it on a remote host you do control pdfimages通过在您控制的远程主机上运行它来完成繁重的工作

Regarding #3, rolling your own, I don't think rolling your own, to solve a very narrow definition of requirements, would be too difficult. 关于#3,滚动你自己,我不认为滚动你自己,解决一个非常狭窄的要求定义,将是太困难了。 I seem to recall that the image boundaries in PDF are well defined: just read in the file to a boundary, cut to the end of the boundary, base64_decode, and write to a file -- repeat. 我似乎记得PDF中的图像边界定义得很好:只需将文件读入边界,切割到边界的末尾,base64_decode,然后写入文件 - 重复。 However, that may be too much... 但是,这可能太多了......

If rolling your own is too complicated, then option #4 is kind of like what Joel Spolsky describes for working with complicated Excel objects (see the numbered list under the bold heading "Let Office do the heavy work for you"). 如果滚动你自己太复杂了,那么选项#4就像Joel Spolsky描述的使用复杂的Excel对象一样 (参见大胆标题下的编号列表“让Office为你做繁重的工作”)。

  • Find a cheap hosting environment (eg Amazon EC2) that let's you exec and curl 找到一个便宜的托管环境(例如亚马逊EC2)让你execcurl
  • Install pdfimages 安装pdfimages
  • Write a PHP script that takes a URL to a PDF, curl opens that PDF, writes it to disk, passes it to pdfimages, then returns the URL to the resulting images. 编写一个PHP脚本,将URL带到PDF,curl打开PDF,将其写入磁盘,将其传递给pdfimages,然后将URL返回到生成的图像。

An example exchange could look like this: 示例交换可能如下所示:

GET http://www.cheaphost.com/pdfimages.php?extract=http://www.limitedhost.com/path/to/uploaded.pdf

Content-type: text/html


<html>
<body>
<ul>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-000.jpg</li>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-001.jpg</li>
</ul>
</body>
</html>

So your single pdfimages.php script (running on the host with the exec functionality) can both extract images, and give you access to the extracted images. 因此,您的单个pdfimages.php脚本(在具有exec功能的主机上运行)可以提取图像,并允许您访问提取的图像。 When extracting, it reads a PDF you tell it, runs pdfimages on it, and gives you back a list of URL to call to retrieve the extracted images. 提取时,它会读取您告诉它的PDF,在其上运行pdfimages,并返回一个要调用的URL列表以检索提取的图像。 When retrieving, it just gives you back a straight image. 检索时,它只会让您回到直线图像。

You would need to deal with cleanup, perhaps the thing to do would be to delete the image after retrieval. 您需要处理清理,或许要做的事情是在检索后删除图像。 You would also need to handle security -- don't know what's in these images, but the content might need to be wrapped in SSL and other precautions taken. 您还需要处理安全性 - 不知道这些图像中的内容,但内容可能需要包含在SSL中并采取其他预防措施。

You can use pdfimages and install it this way: 您可以使用pdfimages并以这种方式安装它:

apt install poppler-utils

Then use it this way to get all the images as PNG files: 然后以这种方式使用它将所有图像作为PNG文件:

pdfimages -j mypdf.pdf image -png

Images will be placed in the same folder under image-000.png, image-001.png, etc. 图像将放在image-000.png,image-001.png等下的同一文件夹中。

There are many options available, including some to change the output format, more information here . 有许多选项可供选择,包括一些可以更改输出格式的选项, 这里有更多信息。

I hope this helps! 我希望这有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM