Tag[pdf-extraction] Recent Newest Questions

RTL (Arabic) ligatures problem when extracting text from PDF

When extracting Arabic text from a PDF file using librairies like PyMuPDF or PDFMiner, the words are returned in backward order which is a normal beha ...

Python extract text between two tables as title for the table(outside tables) from pdf with tabula

I am trying to extract tables from a pdf files, after trying with multiple different packages, tabula is the best one to extract the tables from my pd ...

How is the text from this pdf encoded?

I have some pdfs with data about machine parts and i am trying to extract sizes. I extracted the text from a pdf via pypdfium2. Most of the text is ...

How to extract anchor text/ words from every hyperlinks from pdf using python?

I am trying to extract hyperlink present in each page with their anchor text from pdf using PymuPdf library. I am able to extract hyperlinks with thei ...

Convert PDF to text file using VBA and Adobe Acrobat XI standard

Part 3 of a previous post. The task: I am attempting to iterate over a series of URLs presented in excel and generate complete text files for each. ...

How to get text file from Adobe Extract API which is giving zip file having structured JSON?

In .net using Adobe Extract API for pdf to text, I'm getting structured json information (zipped). How can I get the normal text file using this infor ...

Get all PDF files name under same folder and save in excel according to PDF file name

I have PDF files in same folder. How to get all PDF file names and save as excel file according to PDF file name. This is what I have tried ...

How to extract a table without all borders into text with Python?

I am trying to extract a table like this into a Dataframe. How to do that (and extract even the names splitted on several lines) with Python? Also, I ...

iTextSharp extraction cyrillic characters

In my project I need to read a PDF document. This pdf contains ukrainian & russian characters. the PDFReader read all characters in this pdf but t ...

How to make and train a Model which read data after extracting pdf

Here i share my code main.py Result :Abdul Moeez :E-mail- amoeez14@gmail.com : Phone +1111111111 : Address Karachi, Sindh, Pakistan Ho ...

Document Understanding is extracting data from all the pages of pdf in UiPath

I am using Document Understanding in UiPath to extract data from multiple pdf's. Each pdf file contains multiple copies of the same page which I canno ...

Problems to extract table data using camelot without error message

. Answers to this question are eligible for a +100 reputation bounty. c ...

Extracting comments/annotations from PDF sequentially - Python

I am trying to extract comments from a PDF using Python. These are the two pieces of code that I have tested: One using PyPDF2: and the other usin ...

Camelot Cannot extract entire table

Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi). Camelot seems to be ...

How to improve Hindi text extraction?

I am trying to extract Hindi text from a PDF. I tried all the methods to exract from the PDF, but none of them worked. There are explanations why it d ...

How to convert DeviceRGB to System.Drawing.Color?

I am trying to get fill color of paths using itext7 using fillclr= pathrenderinfo.getfillcolor.getcolorvalue() but it gives the value in format of dev ...

How to find table grid lines in PDF files?

To more accurately extract table-like data embedded within table cells, I would like to be able to identify table cell boundaries in PDFs like this: ...

Extract text from PDF url with io and PyPDF2 gives no output

I'm trying to extract the text from the pdf url. If I download the PDF I can easily extract the text with the function slate. However, when trying to ...

PDF to text in Python returning empty results in image files

I've got this pdf file. Image based low resolution pdf file. I'm trying to extract the data in it and all options I've tried seem not to work. Option ...

Python - OpenCV pytesseract not extracting string from cropped image

I have an image (attached) and want to extract certain fields from the form. For example the name 'Sarah', her email address etc. I have the region of ...