[英]extract images from PDF with PHP
The thing is that the client wants to upload a pdf with images as a way of batch processing multiple images at once. 问题是客户想要上传带有图像的pdf,作为一次批量处理多个图像的方式。
I already looked around and out of the box PHP can't read PDF's. 我已经环顾四周,开箱即用PHP无法阅读PDF文件。
What are my alternatives? 我有什么选择?
I already know the host has not installed imageMagick
or any pdf library and the exec
function is disabled. 我已经知道主机没有安装
imageMagick
或任何pdf库, exec
功能被禁用。 That's basicly leaving me with nothing to work with, I guess? 我猜,这基本上让我无需工作。
Does anyone know if there is an online service that can do this, with an api of sorts? 有没有人知道是否有一个可以做到这一点的在线服务,有各种各样的api?
thanks in adv 谢谢你
AFAIK, there is no PHP module to do it. AFAIK,没有PHP模块可以做到。 There is a command line tool, pdfimages (part of xpdf ).
有一个命令行工具, pdfimages ( xpdf的一部分)。 For reference, here's how that works:
作为参考,这是如何工作的:
pdfimages -j source.pdf image
Which will extract all images from source.pdf as image-000.jpg, image-001.jpg, etc. Note the output format is always Jpeg. 这将从source.pdf中提取所有图像为image-000.jpg,image-001.jpg等。请注意,输出格式始终为Jpeg。
Possible Options 可能的选择
Being a command line tool, you need exec
(or system
, passthru
, any of the command executing functions built into PHP). 作为命令行工具,您需要
exec
(或system
, passthru
,PHP内置的任何命令执行函数)。 As your environment doesn't have that, I see four options: 由于您的环境没有,我看到四个选项:
pdfimages
as a model pdfimages
的源代码作为模型, pdfimages
pdfimages
do the heavy lifting, by running it on a remote host you do control pdfimages
通过在您控制的远程主机上运行它来完成繁重的工作 Regarding #3, rolling your own, I don't think rolling your own, to solve a very narrow definition of requirements, would be too difficult. 关于#3,滚动你自己,我不认为滚动你自己,解决一个非常狭窄的要求定义,将是太困难了。 I seem to recall that the image boundaries in PDF are well defined: just read in the file to a boundary, cut to the end of the boundary, base64_decode, and write to a file -- repeat.
我似乎记得PDF中的图像边界定义得很好:只需将文件读入边界,切割到边界的末尾,base64_decode,然后写入文件 - 重复。 However, that may be too much...
但是,这可能太多了......
If rolling your own is too complicated, then option #4 is kind of like what Joel Spolsky describes for working with complicated Excel objects (see the numbered list under the bold heading "Let Office do the heavy work for you"). 如果滚动你自己太复杂了,那么选项#4就像Joel Spolsky描述的使用复杂的Excel对象一样 (参见大胆标题下的编号列表“让Office为你做繁重的工作”)。
exec
and curl
exec
和curl
pdfimages
pdfimages
An example exchange could look like this: 示例交换可能如下所示:
GET http://www.cheaphost.com/pdfimages.php?extract=http://www.limitedhost.com/path/to/uploaded.pdf
Content-type: text/html
<html>
<body>
<ul>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-000.jpg</li>
<li>http://www.cheaphost.com/pdfimages.php?retrieve=ab9895v/image-001.jpg</li>
</ul>
</body>
</html>
So your single pdfimages.php script (running on the host with the exec
functionality) can both extract images, and give you access to the extracted images. 因此,您的单个pdfimages.php脚本(在具有
exec
功能的主机上运行)可以提取图像,并允许您访问提取的图像。 When extracting, it reads a PDF you tell it, runs pdfimages on it, and gives you back a list of URL to call to retrieve the extracted images. 提取时,它会读取您告诉它的PDF,在其上运行pdfimages,并返回一个要调用的URL列表以检索提取的图像。 When retrieving, it just gives you back a straight image.
检索时,它只会让您回到直线图像。
You would need to deal with cleanup, perhaps the thing to do would be to delete the image after retrieval. 您需要处理清理,或许要做的事情是在检索后删除图像。 You would also need to handle security -- don't know what's in these images, but the content might need to be wrapped in SSL and other precautions taken.
您还需要处理安全性 - 不知道这些图像中的内容,但内容可能需要包含在SSL中并采取其他预防措施。
You can use pdfimages and install it this way: 您可以使用pdfimages并以这种方式安装它:
apt install poppler-utils
Then use it this way to get all the images as PNG files: 然后以这种方式使用它将所有图像作为PNG文件:
pdfimages -j mypdf.pdf image -png
Images will be placed in the same folder under image-000.png, image-001.png, etc. 图像将放在image-000.png,image-001.png等下的同一文件夹中。
There are many options available, including some to change the output format, more information here . 有许多选项可供选择,包括一些可以更改输出格式的选项, 这里有更多信息。
I hope this helps! 我希望这有帮助!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.