簡體   English   中英

Python- 從 DataFrame 中的文本變量中提取文本的特定部分

[英]Python- Extract specific portion of text from a text variable in a DataFrame

我正在嘗試從 dataframe 中的“文本”變量中提取文本的特定部分,需要一些幫助!

我有當前的 DataFrame:

文件路徑 文件名 文本
/Users/user/Dropbox/SEC 調查... _0000886982_18795_2687.txt 0000950123-11-059690.txt: 20110...
/Users/user/Dropbox/SEC 調查... _0001068875_16706_4152.txt 0001193125-05-191846.txt: 20050...

我正在嘗試從 Python 中的“文本”變量中提取部分文本,該變量始終遵循以“項目 5.02”開頭的行。

我需要“Item 5.02”行和以下任何術語的第一次出現之間的文本:“Item 8.01”、“Item 9.01”或“SIGNATURES”。 在某些情況下,文本可能沒有“Item 8.01”,但可能包含“Item 9.01”。 在某些情況下,它可能包含所有術語。 在“Item 5.02”行之后總是至少有一個這些術語。 此外,“Item 5.02”行和其中一個術語之間的文本可能位於多個段落中。 我只需要以“Item 5.02”開頭的行和其中一個術語的第一次出現之間的文本!

以下是“Item 5.02”行后面跟着“Item 9.01”行的示例:

    Item 5.02 Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers; Compensatory Arrangement of Certain Officers.

    On September 29, 2015, AAR CORP. (the Company) announced that Michael J. Sharp was elected Chief Financial Officer of the Company on September 28, 2015, with such election to be effective on October 5, 2015. Mr. Sharp will replace John C. Fortson, who is resigning effective October 5, 2015 to take a Chief Financial Officer position with a non-aviation company.

    Mr. Sharp, 53, is a 19-veteran of the Company and will continue to serve as the Companys Vice President and Chief Accounting Officer. Mr. Sharp previously served as interim Chief Financial Officer of the Company from October 2012 to July 2013. Prior to joining the Company, Mr. Sharp worked in management positions with Kraft Foods and KPMG, LLP.

    As Chief Financial Officer of the Company, Mr. Sharp will receive the following compensation for the fiscal year ending May 31, 2016: an annual base salary of $400,000; an annual cash bonus opportunity equal to 70% of his annual base salary if certain performance goals are met at a target level; and total stock awards valued at $500,000 on the date of grant. Mr. Sharp continues to be eligible for other benefits provided to executive officers of the Company as described in the Companys proxy statement filed with the Securities and Exchange Commission on August 28, 2015. Mr. Sharp has a severance and change in control agreement with the Company (see Exhibit 10.10 to the Companys annual report on Form 10-K for the fiscal year ended May 31, 2001).

    A copy of the Companys press release announcing Mr. Sharps appointment is attached hereto as Exhibit 99.1 and is incorporated herein by reference.

    Item 9.01 Financial Statements and Exhibits.

我想提取以下內容:

    On September 29, 2015, AAR CORP. (the Company) announced that Michael J. Sharp was elected Chief Financial Officer of the Company on September 28, 2015, with such election to be effective on October 5, 2015. Mr. Sharp will replace John C. Fortson, who is resigning effective October 5, 2015 to take a Chief Financial Officer position with a non-aviation company. Mr. Sharp, 53, is a 19-veteran of the Company and will continue to serve as the Companys Vice President and Chief Accounting Officer. Mr. Sharp previously served as interim Chief Financial Officer of the Company from October 2012 to July 2013. Prior to joining the Company, Mr. Sharp worked in management positions with Kraft Foods and KPMG, LLP. As Chief Financial Officer of the Company, Mr. Sharp will receive the following compensation for the fiscal year ending May 31, 2016: an annual base salary of $400,000; an annual cash bonus opportunity equal to 70% of his annual base salary if certain performance goals are met at a target level; and total stock awards valued at $500,000 on the date of grant. Mr. Sharp continues to be eligible for other benefits provided to executive officers of the Company as described in the Companys proxy statement filed with the Securities and Exchange Commission on August 28, 2015. Mr. Sharp has a severance and change in control agreement with the Company (see Exhibit 10.10 to the Companys annual report on Form 10-K for the fiscal year ended May 31, 2001). A copy of the Companys press release announcing Mr. Sharps appointment is attached hereto as Exhibit 99.1 and is incorporated herein by reference.

到目前為止,我有以下代碼,它沒有考慮所有條款,只在“Item 5.02”行之后拉出第一段。 代碼:

    def extractPassage(text):
        lines = text.split("\n\n")
        for i,line in enumerate(lines):
            if line.startswith("Item 5.02"):
                return lines[i+1]
        #raise Exception("No line found starting with Item 5.02")

    pd_00['important_text'] = pd_00['text'].apply(extractPassage)

非常感謝所有幫助!

您可以使用

pattern = r'\bItem\s+5\.02\s*([\w\W]*?)(?=\s*(?:Item\s+[89]\.01|SIGNATURES)\b)'
pd_00['important_text'] = pd_00['text'].str.findall(pattern)

如果每條記錄需要一個匹配項:

pattern = r'\bItem\s+5\.02\s*([\w\W]*?)(?=\s*(?:Item\s+[89]\.01|SIGNATURES)\b)'
pd_00['important_text'] = pd_00['text'].str.extract(pattern, expand=False)

請參閱正則表達式演示

詳情

  • \b - 單詞邊界
  • Item - 一個固定的詞
  • \s+ - 一個或多個空格
  • 5\.02 - 5.02字符串
  • \s* - 零個或多個空格
  • ([\w\W]*?) - 第 1 組:盡可能少的零個或多個字符
  • (?=\s*(?:Item\s+[89]\.01|SIGNATURES)\b) - 需要(緊挨當前位置右側)的正向前瞻:
    • \s* - 零個或多個空格
    • (?:Item\s+[89]\.01|SIGNATURES) - Item ,一個或多個空格, 89然后.01 ,或SIGNATURES
    • \b - 單詞邊界。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM