I would like to extract text from a scientific document in PDF format. I first used PyPDF2 but random spaces appear in the middle of several words. I am currently using PyMUPDF
import fitz
import re
def extract_pdf_text(pdf_file_path):
doc = fitz.open(pdf_file_path)
text = ""
for page in doc:
text += page.get_text("text")#.replace("\n", " ")
return text
pdf_path = "/home/xxx/Papers/xxxxx.pdf"
text = extract_pdf_text(pdf_path)
text = re.sub(r"�", " ", text)
url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
text = re.sub(url_pattern, 'replaced_link.', text)
text = re.sub(r"\s+", " ", text)
removing �, replacing url by fix word and remove extra space
The goal is to separate the text into sentences (I use Spacy). But it failed in some places because the extracted text sticks with a space two distinct parts of the pdf (eg title and author). I would like to paste them with a "\n" instead.
If I extract and dispatch in sentences I get
["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771 HDSKG:", "Harvesting Domain Specific Knowledge Graph from Content of Webpages Conference Paper · February 2017"]
Instead of
["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771", "HDSKG:Harvesting Domain Specific Knowledge Graph from Content of Webpages", "Conference Paper · February 2017"]
Thanks to Jorj McKie, get_text(sort=True)
worked a bit, mb for text = re.sub(r"\s+", " ", text)
that was removing the "\n".
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.