Extracting text from PDF scientific papers

Question

I would like to extract text from a scientific document in PDF format. I first used PyPDF2 but random spaces appear in the middle of several words. I am currently using PyMUPDF

import fitz
import re

def extract_pdf_text(pdf_file_path):
    doc = fitz.open(pdf_file_path)
    text = ""
    for page in doc:
        text += page.get_text("text")#.replace("\n", " ")
    return text

pdf_path = "/home/xxx/Papers/xxxxx.pdf"
text = extract_pdf_text(pdf_path)
text = re.sub(r"�", " ", text)
url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
text = re.sub(url_pattern, 'replaced_link.', text)
text = re.sub(r"\s+", " ", text)

removing �, replacing url by fix word and remove extra space

The goal is to separate the text into sentences (I use Spacy). But it failed in some places because the extracted text sticks with a space two distinct parts of the pdf (eg title and author). I would like to paste them with a "\n" instead.

If I extract and dispatch in sentences I get

["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771 HDSKG:", "Harvesting Domain Speciﬁc Knowledge Graph from Content of Webpages Conference Paper · February 2017"]

Instead of

["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771", "HDSKG:Harvesting Domain Speciﬁc Knowledge Graph from Content of Webpages", "Conference Paper · February 2017"]

Answer 1

Thanks to Jorj McKie, get_text(sort=True) worked a bit, mb for text = re.sub(r"\s+", " ", text) that was removing the "\n".

Extracting text from PDF scientific papers

Question

1 answers

solution1
0 2023-02-01 23:30:08

Extracting text from PDF scientific papers

Question

1 answers

solution1 0 2023-02-01 23:30:08

solution1
0 2023-02-01 23:30:08