简体   繁体   English

使用PHP linux将doc,docx,pdf转换为HTML

[英]Converting doc, docx, pdf to HTML using PHP linux

i run a job search site, and i need to convert doc, docx and pdf files into HTML on linux CentOS server running php. 我运行一个求职网站,我需要在运行php的linux CentOS服务器上将doc,docx和pdf文件转换为HTML。 People submit these files as resumes. 人们将这些文件作为简历提交。 So far, I found PHPDocx to be great at converting docx to html. 到目前为止,我发现PHPDocx非常适合将docx转换为html。 But I am stuck at doc/pdf. 但我被困在doc / pdf。 PDFTOHTML gives error "bad color" when i run tests. 当我运行测试时,PDFTOHTML给出错误“颜色不好”。 As far as doc, i only found wvwave, which seems complex and bulky to install. 至于doc,我只找到了wvwave,它看起来既复杂又笨重。

does anyone have any ideas on how to easily convert doc/pdf to HTML? 有没有人对如何轻松地将doc / pdf转换为HTML有任何想法?

The only thing i can think of is FPDF. 我唯一能想到的是FPDF。 It is intended for creating PDF files in PHP but it can also open PDF files. 它用于在PHP中创建PDF文件,但也可以打开PDF文件。 Maybe you can use that as a base and develop some sort of toHTML function for it. 也许你可以使用它作为基础并为它开发某种toHTML功能。

It is completely free to use and it has some extensions already. 它完全免费使用,并且已经有一些扩展。 It MIGHT help you. 它可能会帮助你。

http://www.fpdf.org http://www.fpdf.org

EDIT: Thanks for the addition to my post in the comments to Pierre: 编辑:感谢您在对皮埃尔的评论中添加我的帖子:

You can use fpdi: http://www.setasign.de/products/pdf-php-solutions/fpdi but the input pdf is just like an image. 您可以使用fpdi: http//www.setasign.de/products/pdf-php-solutions/fpdi,但输入的pdf就像一个图像。

I havent taken a look at it myself so far but this might help. 到目前为止我还没看过它,但这可能会有所帮助。

As far as .doc files go how about trying OpenOffice/LibreOffice, something like: 至于.doc文件,请尝试OpenOffice / LibreOffice,如:
lowriter -convert-to html doc_file.doc –
As far as PDF goes, if the PDF is a graphical representation of text then you're out of luck, best you can do is try convert it to an image with ImageMagick, if it is a proper text it should easily convert. 就PDF来说,如果PDF是文本的图形表示,那么你运气不好,你可以做的最好是尝试使用ImageMagick将其转换为图像,如果它是一个应该容易转换的正确文本。

There are various tools out there already to do this, such as http://dag.wieers.com/home-made/unoconv/ , http://www.phpdocx.com/ (which you've already tried) 有各种工具,有已经做到这一点,如http://dag.wieers.com/home-made/unoconv/http://www.phpdocx.com/ (你已经尝试过)

http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/ looks promising. http://www.phplivedocx.org/2009/08/13/convert-docx-doc-rtf-to-html-in-php/看起来很有希望。

Or, you could install a portable version of libreoffice on your server which allows command line conversion https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters 或者,您可以在服务器上安装可移植版本的libreoffice,它允许命令行转换https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters

I'm sure there'll be tutorials out there (on libreoffice support area) 我相信那里会有教程(在libreoffice支持区域)

To easily convert pdf to html, I would suggest pdf2htmlEX which produces outstanding HTML and is fast enough for runtime converting. 为了轻松地将pdf转换为html,我建议使用pdf2htmlEX来生成出色的HTML并且足够快速进行运行时转换。 You should first put some effort to optimize and build it for your system. 您应该首先花些精力为您的系统优化和构建它。 There is simple build howto included on the project link. 项目链接中包含简单的构建方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM