[英]Extracting text from PDF scientific papers
I would like to extract text from a scientific document in PDF format.我想从 PDF 格式的科学文档中提取文本。 I first used PyPDF2 but random spaces appear in the middle of several words.
我首先使用 PyPDF2,但随机空格出现在几个单词的中间。 I am currently using PyMUPDF
我目前正在使用 PyMUPDF
import fitz
import re
def extract_pdf_text(pdf_file_path):
doc = fitz.open(pdf_file_path)
text = ""
for page in doc:
text += page.get_text("text")#.replace("\n", " ")
return text
pdf_path = "/home/xxx/Papers/xxxxx.pdf"
text = extract_pdf_text(pdf_path)
text = re.sub(r"�", " ", text)
url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
text = re.sub(url_pattern, 'replaced_link.', text)
text = re.sub(r"\s+", " ", text)
removing �, replacing url by fix word and remove extra space删除 �,用 fix word 替换 url 并删除多余的空格
The goal is to separate the text into sentences (I use Spacy).目标是将文本分成句子(我使用 Spacy)。 But it failed in some places because the extracted text sticks with a space two distinct parts of the pdf (eg title and author).
但它在某些地方失败了,因为提取的文本在 pdf 的两个不同部分(例如标题和作者)中有一个空格。 I would like to paste them with a "\n" instead.
我想用“\n”来粘贴它们。
If I extract and dispatch in sentences I get如果我在句子中提取和发送我得到
["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771 HDSKG:", "Harvesting Domain Specific Knowledge Graph from Content of Webpages Conference Paper · February 2017"] [“查看本出版物的讨论、统计数据和作者简介:https://www.researchgate.net/publication/313756771 HDSKG:”,“从网页会议论文内容中获取领域特定知识图谱·2017 年 2 月”]
Instead of代替
["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771", "HDSKG:Harvesting Domain Specific Knowledge Graph from Content of Webpages", "Conference Paper · February 2017"] [“查看此出版物的讨论、统计数据和作者简介:https://www.researchgate.net/publication/313756771”,“HDSKG:从网页内容中获取特定领域的知识图”,“会议论文·2017 年 2 月“]
Thanks to Jorj McKie, get_text(sort=True)
worked a bit, mb for text = re.sub(r"\s+", " ", text)
that was removing the "\n".感谢 Jorj McKie,
get_text(sort=True)
工作了一点,mb for text = re.sub(r"\s+", " ", text)
删除了 "\n"。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.