简体繁体 English

在不丢失任何格式的情况下将 PDF 转换为 HTML

[英]Convert PDF to HTML without losing any format

原文 2020-03-24 14:41:05 6 1 python/ html/ pdf/ heroku/ pdf2htmlex

I'm developing a Python Flask webapp and I'm trying to convert some user uploaded pdfs to nicely formatted HTML, like the HTML that is being produced when you display a pdf inside an iframe .我正在开发一个 Python Flask webapp，我正在尝试将一些用户上传的 pdf 转换为格式良好的 HTML，例如在iframe显示 pdf 时生成的 HTML。

I tried several things so far:到目前为止，我尝试了几件事：

the pdfminer.six library, produced messy HTML, pdfminer.six库，产生了凌乱的 HTML，
trying to grab the produced HTML, when rendering a PDF with pdf.js, which is apparently hidden in a Shadow DOM with no access to its inner HTML当使用 pdf.js 渲染 PDF 时，试图获取生成的 HTML，这显然隐藏在Shadow DOM 中，无法访问其内部 HTML
finally I came across pdf2htmlEX ( https://github.com/pdf2htmlEX/pdf2htmlEX ) which produced exactly what I wanted.最后我遇到了pdf2htmlEX ( https://github.com/pdf2htmlEX/pdf2htmlEX )，它产生了我想要的东西。

Locally, this solution worked great, however in the production state (Heroku) I was unable to install it correctly.在本地，此解决方案效果很好，但是在生产状态 (Heroku) 中，我无法正确安装它。 The project is deprecated and the documentation is limited and terrible.该项目已被弃用，文档有限且糟糕。 The problem has something to do with broken dependencies.这个问题与破坏的依赖关系有关。

So, how to convert PDFs to HTML effectively without losing any format using Python or any other tool?那么，如何使用 Python 或任何其他工具有效地将 PDF 转换为 HTML 而不会丢失任何格式？

Thanks a lots.非常感谢。

if anyone is willing to help me getting the pdf2htmlEX to work on heroku, leave a comment and I will post more details in a different post如果有人愿意帮助我让pdf2htmlEX在 heroku 上工作，请发表评论，我将在不同的帖子中发布更多详细信息

1 个解决方案

This is not going to be trivial.这不会是微不足道的。 But I'll give some pointers.但我会给出一些指示。

You need an app.json in which you define your buildpacks.您需要一个app.json来定义您的 buildpack。
https://devcenter.heroku.com/articles/app-json-schema#buildpacks https://devcenter.heroku.com/articles/app-json-schema#buildpacks

If this project is available via apt it's going to be easy.如果这个项目可以通过apt那就很容易了。 You just use the Heroku's Apt buildpack define an Aptfile that says which packages it needs to install.您只需使用Heroku 的 Apt buildpack定义一个Aptfile ，说明它需要安装哪些包。 Example例子
Then it installs it automatically and you are done.然后它会自动安装它，你就完成了。

If it is not available as a package you will need to create your own buildpack.如果它不能作为包提供，您将需要创建自己的 buildpack。
https://devcenter.heroku.com/articles/buildpack-api https://devcenter.heroku.com/articles/buildpack-api
Example used here . 此处使用的示例。