簡體   English   中英

提取 python 中以轉義字符結尾的文本

[英]Extracting text ending with escape characters in python

我正在嘗試通過 python 解析 PDF 論文的關鍵細節,並提取論文的標題、作者及其 email

from PyPDF2 import PdfReader

reader = PdfReader("paper.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text() + "\n"

返回 PDF 的原始文本

'Title\nGoes\nHere\nAuthor Name (sdsd@mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'

我有一個 function 刪除換行符和制表符等

def remove_newlines_tabs(text):
    """
    This function will remove all the occurrences of newlines, tabs, and combinations like: \\n, \\.
    
    arguments:
        input_text: "text" of type "String". 
                    
    return:
        value: "text" after removal of newlines, tabs, \\n, \\ characters.
        
    Example:
    Input : This is her \\ first day at this place.\n Please,\t Be nice to her.\\n
    Output : This is her first day at this place. Please, Be nice to her. 
    
    """
    
    # Replacing all the occurrences of \n,\\n,\t,\\ with a space.
    Formatted_text = text.replace('\\n', ' ').replace('\n', ' ').replace('\t',' ').replace('\\', ' ').replace('. com', '.com')
    return Formatted_text

返回

'Title Goes Here Author Name (sdsd@mail.net) University of Teeyab September 6, 2022 Some text in the Document. '

這使得提取 email 變得容易。 如何提取 PDF 和作者的標題? 標題是最重要的,但我不確定最好的方法......

這是基於以下假設使用正則表達式的解決方案

  • 標題的每個單詞都由換行符分隔\n
  • 作者的每一個字都用空格隔開
  • email 地址總是用括號()包裹
import re


test_string = 'Title\nGoes\nHere\nAuthor Name (sdsd@mail.net)\nUniversity of Teeyab\nSeptember 6, 2022\nSome text in the Document.\n'

# \w matches characters, numbers, and underscore
# \s matches whitespace and \t\n\r\f\v
# first, let's extract string that appears before parentheses
result = re.search(r"([\w\s]+)", test_string)
print(result) # <re.Match object; span=(0, 28), match='Title\nGoes\nHere\nAuthor Name '>

# clean up leading and trailing whitespaces using strip() and
# split the string by \n to separate title and author
title_author = result[0].strip().split("\n")
print(title_author) # ['Title', 'Goes', 'Here', 'Author Name']

# join the words of title as a single string
title = " ".join(title_author[:-1])
author = title_author[-1]

print(title) # Title Goes Here
print(author) # Author Name

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM