简体   繁体   中英

Convert PDF to HTML without losing any format

I'm developing a Python Flask webapp and I'm trying to convert some user uploaded pdfs to nicely formatted HTML, like the HTML that is being produced when you display a pdf inside an iframe .

I tried several things so far:

  • the pdfminer.six library, produced messy HTML,
  • trying to grab the produced HTML, when rendering a PDF with pdf.js, which is apparently hidden in a Shadow DOM with no access to its inner HTML
  • finally I came across pdf2htmlEX ( https://github.com/pdf2htmlEX/pdf2htmlEX ) which produced exactly what I wanted.

Locally, this solution worked great, however in the production state (Heroku) I was unable to install it correctly. The project is deprecated and the documentation is limited and terrible. The problem has something to do with broken dependencies.

So, how to convert PDFs to HTML effectively without losing any format using Python or any other tool?

Thanks a lots.

if anyone is willing to help me getting the pdf2htmlEX to work on heroku, leave a comment and I will post more details in a different post

This is not going to be trivial. But I'll give some pointers.

You need an app.json in which you define your buildpacks.
https://devcenter.heroku.com/articles/app-json-schema#buildpacks

If this project is available via apt it's going to be easy. You just use the Heroku's Apt buildpack define an Aptfile that says which packages it needs to install. Example
Then it installs it automatically and you are done.

If it is not available as a package you will need to create your own buildpack.
https://devcenter.heroku.com/articles/buildpack-api
Example used here .

Another solution is to dockerize your project and execute it as a docker container.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM