从 PDF 篇科学论文中提取文本

Question

I would like to extract text from a scientific document in PDF format.我想从 PDF 格式的科学文档中提取文本。 I first used PyPDF2 but random spaces appear in the middle of several words.我首先使用 PyPDF2，但随机空格出现在几个单词的中间。 I am currently using PyMUPDF我目前正在使用 PyMUPDF

import fitz
import re

def extract_pdf_text(pdf_file_path):
    doc = fitz.open(pdf_file_path)
    text = ""
    for page in doc:
        text += page.get_text("text")#.replace("\n", " ")
    return text

pdf_path = "/home/xxx/Papers/xxxxx.pdf"
text = extract_pdf_text(pdf_path)
text = re.sub(r"�", " ", text)
url_pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
text = re.sub(url_pattern, 'replaced_link.', text)
text = re.sub(r"\s+", " ", text)

removing �, replacing url by fix word and remove extra space删除 �，用 fix word 替换 url 并删除多余的空格

The goal is to separate the text into sentences (I use Spacy).目标是将文本分成句子（我使用 Spacy）。 But it failed in some places because the extracted text sticks with a space two distinct parts of the pdf (eg title and author).但它在某些地方失败了，因为提取的文本在 pdf 的两个不同部分（例如标题和作者）中有一个空格。 I would like to paste them with a "\n" instead.我想用“\n”来粘贴它们。

If I extract and dispatch in sentences I get如果我在句子中提取和发送我得到

["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771 HDSKG:", "Harvesting Domain Speciﬁc Knowledge Graph from Content of Webpages Conference Paper · February 2017"] [“查看本出版物的讨论、统计数据和作者简介：https://www.researchgate.net/publication/313756771 HDSKG：”，“从网页会议论文内容中获取领域特定知识图谱·2017 年 2 月”]

Instead of代替

["See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/313756771", "HDSKG:Harvesting Domain Speciﬁc Knowledge Graph from Content of Webpages", "Conference Paper · February 2017"] [“查看此出版物的讨论、统计数据和作者简介：https://www.researchgate.net/publication/313756771”，“HDSKG：从网页内容中获取特定领域的知识图”，“会议论文·2017 年 2 月“]

Answer 1

Thanks to Jorj McKie, get_text(sort=True) worked a bit, mb for text = re.sub(r"\s+", " ", text) that was removing the "\n".感谢 Jorj McKie， get_text(sort=True)工作了一点，mb for text = re.sub(r"\s+", " ", text)删除了 "\n"。

从 PDF 篇科学论文中提取文本

问题描述

1 个解决方案

解决方案1
0 2023-02-01 23:30:08

从 PDF 篇科学论文中提取文本

问题描述

1 个解决方案

解决方案1 0 2023-02-01 23:30:08

解决方案1
0 2023-02-01 23:30:08