简体   繁体   English

从pdf文件中提取所有图像和文本

[英]Extracting all images and text from pdf file

I need to create json from pdf to render the pdf content as HTML with all the images and text. 我需要从pdf创建json,以将pdf内容呈现为具有所有图像和文本的HTML。 I have tried the modules below to do that. 我已经尝试过下面的模块来做到这一点。 I am able to extract only plain images now, but not able to extract the graphical images and background shadow images. 我现在只能提取普通图像,但不能提取图形图像和背景阴影图像。 Is there any module to get these? 有没有获取这些的模块?

Modules tried 尝试过的模块

-PDFMiner (python)
-Mammoth(Node)   
-pdf2json(Node)   
-PDFBox(Java)

Have a look at http://pythonhosted.org/PyMuPDF/ . 看看http://pythonhosted.org/PyMuPDF/ Apparently this product renders pages in various formats, including json. 显然,该产品以各种格式(包括json)呈现页面。 Although I have limited experience with it, the recipe at http://code.activestate.com/recipes/580703-extract-images-of-a-pdf-optionally-by-page-using-p/history/1/ shows how to use PyMuPDF to extract images from a PDF. 尽管我的经验有限,但该食谱位于http://code.activestate.com/recipes/580703-extract-images-of-a-pdf-optionally-by-page-using-p/history/1/如何使用PyMuPDF从PDF提取图像。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM