简体   繁体   English

在不丢失任何格式的情况下将 PDF 转换为 HTML

[英]Convert PDF to HTML without losing any format

I'm developing a Python Flask webapp and I'm trying to convert some user uploaded pdfs to nicely formatted HTML, like the HTML that is being produced when you display a pdf inside an iframe .我正在开发一个 Python Flask webapp,我正在尝试将一些用户上传的 pdf 转换为格式良好的 HTML,例如在iframe显示 pdf 时生成的 HTML。

I tried several things so far:到目前为止,我尝试了几件事:

  • the pdfminer.six library, produced messy HTML, pdfminer.six库,产生了凌乱的 HTML,
  • trying to grab the produced HTML, when rendering a PDF with pdf.js, which is apparently hidden in a Shadow DOM with no access to its inner HTML当使用 pdf.js 渲染 PDF 时,试图获取生成的 HTML,这显然隐藏在Shadow DOM 中,无法访问其内部 HTML
  • finally I came across pdf2htmlEX ( https://github.com/pdf2htmlEX/pdf2htmlEX ) which produced exactly what I wanted.最后我遇到了pdf2htmlEX ( https://github.com/pdf2htmlEX/pdf2htmlEX ),它产生了我想要的东西。

Locally, this solution worked great, however in the production state (Heroku) I was unable to install it correctly.在本地,此解决方案效果很好,但是在生产状态 (Heroku) 中,我无法正确安装它。 The project is deprecated and the documentation is limited and terrible.该项目已被弃用,文档有限且糟糕。 The problem has something to do with broken dependencies.这个问题与破坏的依赖关系有关。

So, how to convert PDFs to HTML effectively without losing any format using Python or any other tool?那么,如何使用 Python 或任何其他工具有效地将 PDF 转换为 HTML 而不会丢失任何格式?

Thanks a lots.非常感谢。

if anyone is willing to help me getting the pdf2htmlEX to work on heroku, leave a comment and I will post more details in a different post如果有人愿意帮助我让pdf2htmlEX在 heroku 上工作,请发表评论,我将在不同的帖子中发布更多详细信息

This is not going to be trivial.这不会是微不足道的。 But I'll give some pointers.但我会给出一些指示。

You need an app.json in which you define your buildpacks.您需要一个app.json来定义您的 buildpack。
https://devcenter.heroku.com/articles/app-json-schema#buildpacks https://devcenter.heroku.com/articles/app-json-schema#buildpacks

If this project is available via apt it's going to be easy.如果这个项目可以通过apt那就很容易了。 You just use the Heroku's Apt buildpack define an Aptfile that says which packages it needs to install.您只需使用Heroku 的 Apt buildpack定义一个Aptfile ,说明它需要安装哪些包。 Example例子
Then it installs it automatically and you are done.然后它会自动安装它,你就完成了。

If it is not available as a package you will need to create your own buildpack.如果它不能作为包提供,您将需要创建自己的 buildpack。
https://devcenter.heroku.com/articles/buildpack-api https://devcenter.heroku.com/articles/buildpack-api
Example used here . 此处使用的示例

Another solution is to dockerize your project and execute it as a docker container.另一种解决方案是将您的项目 dockerize 并将其作为 docker 容器执行。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将html表转换为字典而不丢失结构 - Convert html table to dictionary without losing structure 在不丢失样式的情况下将 html 转换为 docx - convert html to docx without losing styles 如何在不失去任何透明度的情况下将图像转换为灰度? - How can I convert an image to grayscale without losing any transparency? 我可以将任何字符串转换为float而不会在Python中丢失精度吗? - Can I convert any string to float without losing precision in Python? 使用Python格式PDF / X-1a将HTML转换为PDF - Convert HTML to PDF using Python with format PDF/X-1a 打印数据框而不会丢失格式 - print dataframe without losing format 将ISO 8601时间格式转换为UNIX时间戳(纪元)并再次返回而不会损失小数秒? - Convert ISO 8601 time format to UNIX timestamp (epoch) and back again without losing fractional seconds? 如何在不丢失 metagraphdef 的情况下将 a.meta.index 和.data 文件转换为 SavedModel (.pb) 格式? - How do I convert a .meta .index and .data file into SavedModel (.pb) format without losing metagraphdef? 我的数据框包含一个带有“W”和“S”的时间列,用于区分“夏季”和“冬季”。 如何在不丢失任何数据的情况下转换它? - My dataframe contains a time column with 'W' and 'S' differentiating 'Summer' and 'Winter'. how to convert this without losing any data? 我如何从网页中抓取 HTML 代码,因为它使用的是 beautifulsoup 而不丢失文本格式? - How do I scrape the HTML code from a webpage as it is using beautifulsoup without losing text format?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM