简体   繁体   中英

Extracting text from PDF scientific papers

I would like to extract text from a scientific document in PDF format. I first used PyPDF2 but random spaces appear in the middle of several words. I am currently using PyMUPDF

import fitz
import re

def extract_pdf_text(pdf_file_path):
    doc = fitz.open(pdf_file_path)
    text = ""
    for page in doc:
        text += page.get_text("text")#.replace("\n", " ")
    return text

pdf_path = "/home/xxx/Papers/xxxxx.pdf"
text = extract_pdf_text(pdf_path)
text = re.sub(r"�", " ", text)
url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
text = re.sub(url_pattern, 'replaced_link.', text)
text = re.sub(r"\s+", " ", text)

removing �, replacing url by fix word and remove extra space

The goal is to separate the text into sentences (I use Spacy). But it failed in some places because the extracted text sticks with a space two distinct parts of the pdf (eg title and author). I would like to paste them with a "\n" instead.

If I extract and dispatch in sentences I get

["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771 HDSKG:", "Harvesting Domain Specific Knowledge Graph from Content of Webpages Conference Paper · February 2017"]

Instead of

["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771", "HDSKG:Harvesting Domain Specific Knowledge Graph from Content of Webpages", "Conference Paper · February 2017"]

Thanks to Jorj McKie, get_text(sort=True) worked a bit, mb for text = re.sub(r"\s+", " ", text) that was removing the "\n".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM