繁体 English 中英

在不丢失任何格式的情况下将 PDF 转换为 HTML

[英]Convert PDF to HTML without losing any format

原文 2020-03-24 14:41:05 1 1 python/ html/ pdf/ heroku/ pdf2htmlex

我正在开发一个 Python Flask webapp，我正在尝试将一些用户上传的 pdf 转换为格式良好的 HTML，例如在iframe显示 pdf 时生成的 HTML。

到目前为止，我尝试了几件事：

pdfminer.six库，产生了凌乱的 HTML，
当使用 pdf.js 渲染 PDF 时，试图获取生成的 HTML，这显然隐藏在Shadow DOM 中，无法访问其内部 HTML
最后我遇到了pdf2htmlEX ( https://github.com/pdf2htmlEX/pdf2htmlEX )，它产生了我想要的东西。

在本地，此解决方案效果很好，但是在生产状态 (Heroku) 中，我无法正确安装它。 该项目已被弃用，文档有限且糟糕。 这个问题与破坏的依赖关系有关。

那么，如何使用 Python 或任何其他工具有效地将 PDF 转换为 HTML 而不会丢失任何格式？

非常感谢。

如果有人愿意帮助我让pdf2htmlEX在 heroku 上工作，请发表评论，我将在不同的帖子中发布更多详细信息

1 个解决方案

这不会是微不足道的。 但我会给出一些指示。

您需要一个app.json来定义您的 buildpack。
https://devcenter.heroku.com/articles/app-json-schema#buildpacks

如果这个项目可以通过apt那就很容易了。 您只需使用Heroku 的 Apt buildpack定义一个Aptfile ，说明它需要安装哪些包。 例子
然后它会自动安装它，你就完成了。

如果它不能作为包提供，您将需要创建自己的 buildpack。
https://devcenter.heroku.com/articles/buildpack-api
此处使用的示例。

另一种解决方案是将您的项目 dockerize 并将其作为 docker 容器执行。

将html表转换为字典而不丢失结构

[英]Convert html table to dictionary without losing structure

在不丢失样式的情况下将 html 转换为 docx

[英]convert html to docx without losing styles

如何在不失去任何透明度的情况下将图像转换为灰度？

[英]How can I convert an image to grayscale without losing any transparency?

我可以将任何字符串转换为float而不会在Python中丢失精度吗？

[英]Can I convert any string to float without losing precision in Python?

使用Python格式PDF / X-1a将HTML转换为PDF

[英]Convert HTML to PDF using Python with format PDF/X-1a

打印数据框而不会丢失格式

[英]print dataframe without losing format

将ISO 8601时间格式转换为UNIX时间戳（纪元）并再次返回而不会损失小数秒？

[英]Convert ISO 8601 time format to UNIX timestamp (epoch) and back again without losing fractional seconds?

如何在不丢失 metagraphdef 的情况下将 a.meta.index 和.data 文件转换为 SavedModel (.pb) 格式？

[英]How do I convert a .meta .index and .data file into SavedModel (.pb) format without losing metagraphdef?

我的数据框包含一个带有“W”和“S”的时间列，用于区分“夏季”和“冬季”。如何在不丢失任何数据的情况下转换它？

[英]My dataframe contains a time column with 'W' and 'S' differentiating 'Summer' and 'Winter'. how to convert this without losing any data?

我如何从网页中抓取 HTML 代码，因为它使用的是 beautifulsoup 而不丢失文本格式？

[英]How do I scrape the HTML code from a webpage as it is using beautifulsoup without losing text format?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将html表转换为字典而不丢失结构在不丢失样式的情况下将 html 转换为 docx 如何在不失去任何透明度的情况下将图像转换为灰度？我可以将任何字符串转换为float而不会在Python中丢失精度吗？使用Python格式PDF / X-1a将HTML转换为PDF 打印数据框而不会丢失格式将ISO 8601时间格式转换为UNIX时间戳（纪元）并再次返回而不会损失小数秒？如何在不丢失 metagraphdef 的情况下将 a.meta.index 和.data 文件转换为 SavedModel (.pb) 格式？我的数据框包含一个带有“W”和“S”的时间列，用于区分“夏季”和“冬季”。如何在不丢失任何数据的情况下转换它？我如何从网页中抓取 HTML 代码，因为它使用的是 beautifulsoup 而不丢失文本格式？

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM