Tag[pdf-scraping] Recent Newest Questions

Scraping specific pdfs from different websites

First question here. I need to download a specific pdf from every url. I need just the pdf of the european commission proposal from each url that I ha ...

Converting a scanned pdf to a searchable pdf in R

I have a pdf that's about 50 pages of scanned tables. I need to eventually scrape it into R so I can clean the data and export it as a .csv. I have ex ...

Run-time error '5' VBA when running against specific PDF

I have the following Code in VBA following an answer to my last question, which iterates over a list of URLs and generates a text file using the word ...

Is there a way to remove unwanted spaces from a string using Python or some NLP technique?? (NOT trailing or extra spaces)

s = "Over 20 years, this investment is cost neutral as it is covered by a modest ‚comfort ch arge™ Œ less than the equivalent energy bills would have ...

Using Text Mining in R to find a specific set of words in a set of PDFS

I am looking at a set of 10 PDFs, and I want to write code that will tell me the number of times a couple words I've predetermined appear in the docum ...

Scrapy script that was supposed to scrape pdf, doc files is not working properly

I am trying to implement a similar script on my project following this blog post here: https://www.imagescape.com/blog/scraping-pdf-doc-and-docx-scrap ...

Extracting and Organizing Text From A PDF

I'm currently trying to scrape a bunch of information from PDF pages. I have managed to get some text extracted but haven't been able to extract every ...

pdfminer: extract only text according to font size

I only want to extract text that has font size 9.800000000000068 and 10.000000000000057 from my pdf files. The code below returns a list of the font s ...

How to webscrape PDFs that are hidden under the selection option?

I am trying to download >100 pdf from a website using python. However, those pdfs are hidden under the selection option. For example: Option 1 ...

How to separate words from an element in a list?

My list looks like the following: ['https://www.enbridge.com/Projects-and-Infrastructure/For-Shippers/Tariffs/Enbridge-Bakken-Pipeline-Company-Inc-Bak ...

PDF scraping: get company and subsidiaries tables

I am trying to scrape this PDF containing information about company subsidiaries. I have seen many posts using the R package Tabulizer but this, unfor ...

trying to scrape from long PDF with different table formats

I am trying to scrape from a 276-page PDF available here: https://www.acf.hhs.gov/sites/default/files/documents/ocse/fy_2018_annual_report.pdf Not on ...

Python PDF Scraping

Task: PDF which is a bank statement,contains columns i.e (Date,Description,Deposits,Withdrawals,Balance) parsing the columns with their respective fi ...

file handling + word scraping (trying to find all the words in a file that end with 'y')

ERROR: Traceback (most recent call last): File "c:\Users\Pranjal\Desktop\tstp\zen_scraper.py", line 5, in words = re.findall("$y",file) File "C:\Progr ...

Extract larger body of character data with stringr?

I am working to scrape text data from around 1000 pdf files. I have managed to import them all into R-studio, used str_subset and str_extract_all to a ...

Referencing the last page in a PDF with tabula?

I want to reference the last page from a bunch of PDF documents and parse tables from it, however the number of pages in the documents can vary. What ...

Scraping PDF in R with Nested Information

I am attempting to scrape a rather difficult PDF in R using both pdftools::pdf_text and tabulizer::extract_tables. However, in my situation, neither o ...

How do I iterate through files in my directory so they can be opened/read using PyPDF2?

I am working on an invoice scraper for work, where I have successfully written all the code to scrape the fields that I need using PyPDF2. However, I ...

Scraping large and complex PDF tables

I've been trying to scrape some data off of PDFs regarding 2020 election results in California for my own morbid curiosity. I need to scrape many tab ...

Python PdfMiner - How to get the info on the orientation of each word/sentence included in a pdf?

Target: I want to extract the info on the orientation of each word or sentence from a PDF like the attached one. The reason for this is that i want to ...