简体   繁体   English

从 PDF 篇科学论文中提取文本

[英]Extracting text from PDF scientific papers

I would like to extract text from a scientific document in PDF format.我想从 PDF 格式的科学文档中提取文本。 I first used PyPDF2 but random spaces appear in the middle of several words.我首先使用 PyPDF2,但随机空格出现在几个单词的中间。 I am currently using PyMUPDF我目前正在使用 PyMUPDF

import fitz
import re

def extract_pdf_text(pdf_file_path):
    doc = fitz.open(pdf_file_path)
    text = ""
    for page in doc:
        text += page.get_text("text")#.replace("\n", " ")
    return text

pdf_path = "/home/xxx/Papers/xxxxx.pdf"
text = extract_pdf_text(pdf_path)
text = re.sub(r"�", " ", text)
url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
text = re.sub(url_pattern, 'replaced_link.', text)
text = re.sub(r"\s+", " ", text)

removing �, replacing url by fix word and remove extra space删除 �,用 fix word 替换 url 并删除多余的空格

The goal is to separate the text into sentences (I use Spacy).目标是将文本分成句子(我使用 Spacy)。 But it failed in some places because the extracted text sticks with a space two distinct parts of the pdf (eg title and author).但它在某些地方失败了,因为提取的文本在 pdf 的两个不同部分(例如标题和作者)中有一个空格。 I would like to paste them with a "\n" instead.我想用“\n”来粘贴它们。

If I extract and dispatch in sentences I get如果我在句子中提取和发送我得到

["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771 HDSKG:", "Harvesting Domain Specific Knowledge Graph from Content of Webpages Conference Paper · February 2017"] [“查看本出版物的讨论、统计数据和作者简介:https://www.researchgate.net/publication/313756771 HDSKG:”,“从网页会议论文内容中获取领域特定知识图谱·2017 年 2 月”]

Instead of代替

["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771", "HDSKG:Harvesting Domain Specific Knowledge Graph from Content of Webpages", "Conference Paper · February 2017"] [“查看此出版物的讨论、统计数据和作者简介:https://www.researchgate.net/publication/313756771”,“HDSKG:从网页内容中获取特定领域的知识图”,“会议论文·2017 年 2 月“]

Thanks to Jorj McKie, get_text(sort=True) worked a bit, mb for text = re.sub(r"\s+", " ", text) that was removing the "\n".感谢 Jorj McKie, get_text(sort=True)工作了一点,mb for text = re.sub(r"\s+", " ", text)删除了 "\n"。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM