简体   繁体   中英

Python: Using regex to find the last pair of occurence

Attached is a text file that I want to parse. I want to select the text in the last combination of the words' occurrence:

  • (1) Item 7 Management Discussion Analysis

  • (2) Item 8 Financial Statements

I would usually use regex as follow:

re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements",text, re.DOTALL)

You can see in the text file, the combination of Item 7 and Item 8 occurs often but if I find the last match (1) and last match (2), I increase by a lot the probability to grab the desired text.

The desired text in my text file starts with:

"'This Item 7, Management's Discussion and Analysis of Financial Condition and Results of Operations, and other parts of this Form 10-K contain forward-looking statements, within the meaning of the Private Securities Litigation Reform Act of 1995, that involve risks and..... "

and ends with:

"Item 8. Financial Statements and Supplementary Data"

How can I adapt my regex code to grab this last pair between Item 7 and Item 8?

UPDATE:

I also try to parse this file using the same items.

This code has been rewritten. It now works with both the original data file (Output2.txt) and the newly added data file (Output2012.txt).

import re

discussions = []
for input_file_name in ['Output2.txt', 'Output2012.txt']:
    with open(input_file_name) as f:
        doc = f.read()

    item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
    discussion_text = r"[\S\s]*"
    item8 = r"Item 8\.*\s*Financial Statements"

    discussion_pattern = item7 + discussion_text + item8
    results = re.findall(discussion_pattern, doc)

    # Some input files have table of contents and others don't 
    # just keep the last match
    discussion = results[len(results)-1]

    discussions.append((input_file_name, discussion))

The discussions variable contains the results for each of the data files.


This is the original solution. It does not work for the new file but does show the use of named groups. I am not familiar with StackOverflow protocol here. Should I delete this old code?

By using longer match strings, the number of matches can be reduced to just 2 for both item 7 and item 8 - the table of contents and the actual section.

So search for the second occurence of item 7, and keep all text until item 8. This code uses Python named groups.

import re

with open('Output2.txt') as f:
    doc = f.read()

item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
item8 = r"Item 8\.*\s*Financial Statements"

discussion_pattern = re.compile(
    r"(?P<item7>" + item7 + ")"
    r"([\S\s]*)"
    r"(?P<item7heading>" + item7 +")"
    r"(?P<discussion>[\S\s]*)"
    r"(?P<item8heading>" + item8 + ")"
)       

match = re.search(discussion_pattern, doc)
discussion = match.group('discussion')

Use this pattern with s option

.*(Item 7.*?Item 8)  

result at capturing group #1
Demo

.               # Any character except line break
*               # (zero or more)(greedy)
(               # Capturing Group (1)
  Item 7        # "Item 7"
  .             # Any character except line break
  *?            # (zero or more)(lazy)
  Item 8        # "Item 8"
)               # End of Capturing Group (1)
                # "  "
re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements(?!.*?(?:Item(?:(?!Item).)*7)|(?:Item(?:(?!Item).)*8))",text, re.DOTALL)

试试看。添加了一个前瞻。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM