[英]Extracting text ending with escape characters in python
我正在嘗試通過 python 解析 PDF 論文的關鍵細節,並提取論文的標題、作者及其 email
from PyPDF2 import PdfReader
reader = PdfReader("paper.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
返回 PDF 的原始文本
'Title\nGoes\nHere\nAuthor Name (sdsd@mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'
我有一個 function 刪除換行符和制表符等
def remove_newlines_tabs(text):
"""
This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
arguments:
input_text: "text" of type "String".
return:
value: "text" after removal of newlines, tabs, \\n, \\ characters.
Example:
Input : This is her \\ first day at this place.\n Please,\t Be nice to her.\\n
Output : This is her first day at this place. Please, Be nice to her.
"""
# Replacing all the occurrences of \n,\\n,\t,\\ with a space.
Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
return Formatted_text
返回
'Title Goes Here Author Name (sdsd@mail.net) University of Teeyab September 6, 2022 Some text in the Document. '
這使得提取 email 變得容易。 如何提取 PDF 和作者的標題? 標題是最重要的,但我不確定最好的方法......
這是基於以下假設使用正則表達式的解決方案
\n
()
包裹import re
test_string = 'Title\nGoes\nHere\nAuthor Name (sdsd@mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'
# \w matches characters, numbers, and underscore
# \s matches whitespace and \t\n\r\f\v
# first, let's extract string that appears before parentheses
result = re.search(r"([\w\s]+)", test_string)
print(result) # <re.Match object; span=(0, 28), match='Title\nGoes\nHere\nAuthor Name '>
# clean up leading and trailing whitespaces using strip() and
# split the string by \n to separate title and author
title_author = result[0].strip().split("\n")
print(title_author) # ['Title', 'Goes', 'Here', 'Author Name']
# join the words of title as a single string
title = " ".join(title_author[:-1])
author = title_author[-1]
print(title) # Title Goes Here
print(author) # Author Name
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.