简体繁体中英

Convert PDF to HTML without losing any format

原文 2020-03-24 14:41:05 9 1 python/ html/ pdf/ heroku/ pdf2htmlex

I'm developing a Python Flask webapp and I'm trying to convert some user uploaded pdfs to nicely formatted HTML, like the HTML that is being produced when you display a pdf inside an iframe .

I tried several things so far:

the pdfminer.six library, produced messy HTML,
trying to grab the produced HTML, when rendering a PDF with pdf.js, which is apparently hidden in a Shadow DOM with no access to its inner HTML
finally I came across pdf2htmlEX ( https://github.com/pdf2htmlEX/pdf2htmlEX ) which produced exactly what I wanted.

Locally, this solution worked great, however in the production state (Heroku) I was unable to install it correctly. The project is deprecated and the documentation is limited and terrible. The problem has something to do with broken dependencies.

So, how to convert PDFs to HTML effectively without losing any format using Python or any other tool?

Thanks a lots.

if anyone is willing to help me getting the pdf2htmlEX to work on heroku, leave a comment and I will post more details in a different post

1 answers

This is not going to be trivial. But I'll give some pointers.

You need an app.json in which you define your buildpacks.
https://devcenter.heroku.com/articles/app-json-schema#buildpacks

If this project is available via apt it's going to be easy. You just use the Heroku's Apt buildpack define an Aptfile that says which packages it needs to install. Example
Then it installs it automatically and you are done.

If it is not available as a package you will need to create your own buildpack.
https://devcenter.heroku.com/articles/buildpack-api
Example used here .

Another solution is to dockerize your project and execute it as a docker container.

Convert html table to dictionary without losing structure

convert html to docx without losing styles

How can I convert an image to grayscale without losing any transparency?

Can I convert any string to float without losing precision in Python?

Convert HTML to PDF using Python with format PDF/X-1a

print dataframe without losing format

Convert ISO 8601 time format to UNIX timestamp (epoch) and back again without losing fractional seconds?

How do I convert a .meta .index and .data file into SavedModel (.pb) format without losing metagraphdef?

My dataframe contains a time column with 'W' and 'S' differentiating 'Summer' and 'Winter'. how to convert this without losing any data?

How do I scrape the HTML code from a webpage as it is using beautifulsoup without losing text format?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Convert html table to dictionary without losing structure convert html to docx without losing styles How can I convert an image to grayscale without losing any transparency? Can I convert any string to float without losing precision in Python? Convert HTML to PDF using Python with format PDF/X-1a print dataframe without losing format Convert ISO 8601 time format to UNIX timestamp (epoch) and back again without losing fractional seconds? How do I convert a .meta .index and .data file into SavedModel (.pb) format without losing metagraphdef? My dataframe contains a time column with 'W' and 'S' differentiating 'Summer' and 'Winter'. how to convert this without losing any data? How do I scrape the HTML code from a webpage as it is using beautifulsoup without losing text format?

Related Tags

Convert PDF to HTML without losing any format

Question

1 answers

solution1 1 ACCPTED 2020-03-24 16:19:04

solution1
1 ACCPTED 2020-03-24 16:19:04