Tag[pdf-parsing] Recent Newest Questions

how to recognize a graph in pdf using python?

new to pdf parsing. I want to recognize a graph in a pdf file, so I could skip it and not extract this type of text. all I know about the pdf is that ...

Extract pdf text at specific location from each page of document using NodeJs

I have pdf document that will have multiple pages in it. Each page will have unique ID in footer. My job is to separate each page in document into sep ...

How to parsing text for Vietnamese pdf with python?

I want to parse the pdf to text. But when I use pypdf2 or pymupdf to extract text from this pdf, I have a problem: It returns special characters when ...

Arabic pdf text extraction

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -. I tried many packages, tools and none of them worked, python packages, p ...

Apache PDFBox - vertical match between image and text position

I need help to achieve a mapping between text and image objects in a PDF document. As the first figure shows, my PDF documents have 3 images arranged ...

(while reading XRef): Error: Invalid XRef stream header?

hi i am trying to read pdf in node js . when i try to read this pdf. it start showing this error. here is my code as well but when i try to pars ...

Parse PDF shape data in python

I am trying to put together a script to fix PDFs a large number of PDFs that have been exported from Autocad via their DWG2PDF print driver. When usi ...

Python - Google Cloud Document AI API- Not reading the whole .pdf file

I am trying to read a pdf stored in gcs i Python using Google Document AI API and return the text from the pdf as a string.I do not want the parser to ...

Do PDF name objects require capitalization?

Page 17 of the PDF 1.7 spec indicates that /lime#20Green should produce Lime Green. Is this an errata? I see nothing in the spec about capitalizing th ...

How order text extracted from pdf?

I'm building a pdf parser that extract text and save it into a txt file. I'm doing that by tracing all content objects, then decode the streams using ...

'Nonetype object is not itreable' when trying to extract from PDF

I am trying to extract data from a PDF, but I keep getting a type error because my object is not iterable (on the statement for line in text: but I do ...

Process images extracted with PdfPig

Images extracted using PdfPig are the type of XObject Image or InlineImage (both inherit from IPdfImage). I would like to save and display them in a s ...

Call to undefined method Smalot\PdfParser\Encoding::__toString()

I am using Pdfparser Library for parsing pdfs. While parsing, Some pages of the 20-page pdf file are read and some pages are not. This is code I am us ...

Check If Location Value Is Present In Array

I am writing a script to parse LinkedIn-CV. I am stuck at the work experience section. Currently I am able to extract the work experience text from th ...

Is there a way to pass credentials programmatically for using google documentAI without reading from a disk?

I am trying to run the demo code given in pdf parsing of GCP document AI. To run the code, exporting google credentials as a command line works fine. ...

Iterate Over Files (PDFs) to Run a Function

I am trying to read PDF files from a directory (path) to extract individual images from each PDF and write to the same directory. However, I am unable ...

PDF Hidden objects

I am studying Marked content in PDF. I came across one PDF file which has Marked content but few object from marked content are hidden. So here one b ...

How to read PDF contents in selenium

I'm trying to verify the contents in PDF, I'm getting the URL using href and passing it in the below code. URL is with HTTPS, so I'm facing below issu ...

Extract value from PDF file to variable

I am trying to get "Invoice number", in this case INV-3337 from PDF file and would like to store it as variable for future use in the code. Currentl ...

How to extract table text from pdfs using pdfminer python

I am looking for script to extract table text from pdfs using pdfminer. I have tried tabula but I am looking to integrate the normal text and table te ...