简体   繁体   English

Python:使用正则表达式查找最后一对事件

[英]Python: Using regex to find the last pair of occurence

Attached is a text file that I want to parse. 附件是我要解析的文本文件 I want to select the text in the last combination of the words' occurrence: 我想在单词出现的最后一个组合中选择文本:

  • (1) Item 7 Management Discussion Analysis (1)项目7管理讨论分析

  • (2) Item 8 Financial Statements (2)项目8财务报表

I would usually use regex as follow: 我通常会使用regex如下:

re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements",text, re.DOTALL)

You can see in the text file, the combination of Item 7 and Item 8 occurs often but if I find the last match (1) and last match (2), I increase by a lot the probability to grab the desired text. 您可以在文本文件中看到,第7项和第8项的组合经常发生,但是如果我找到了最后一个匹配项(1)和最后一个匹配项(2),则获取所需文本的可能性就会大大提高。

The desired text in my text file starts with: 我的文本文件中所需的文本开头为:

"'This Item 7, Management's Discussion and Analysis of Financial Condition and Results of Operations, and other parts of this Form 10-K contain forward-looking statements, within the meaning of the Private Securities Litigation Reform Act of 1995, that involve risks and..... " “'第7项,管理层对财务状况和经营成果的讨论和分析,以及表格10-K的其他部分,包含1995年《私人证券诉讼改革法案》所定义的前瞻性陈述,涉及风险和.....“

and ends with: 并以:

"Item 8. Financial Statements and Supplementary Data" “项目8.财务报表和补充数据”

How can I adapt my regex code to grab this last pair between Item 7 and Item 8? 我该如何调整我的正则表达式代码,以抓住项目7和项目8之间的最后一对?

UPDATE: 更新:

I also try to parse this file using the same items. 我也尝试使用相同的项目来解析此文件

This code has been rewritten. 该代码已被重写。 It now works with both the original data file (Output2.txt) and the newly added data file (Output2012.txt). 现在,它既可以使用原始数据文件(Output2.txt),也可以使用新添加的数据文件(Output2012.txt)。

import re

discussions = []
for input_file_name in ['Output2.txt', 'Output2012.txt']:
    with open(input_file_name) as f:
        doc = f.read()

    item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
    discussion_text = r"[\S\s]*"
    item8 = r"Item 8\.*\s*Financial Statements"

    discussion_pattern = item7 + discussion_text + item8
    results = re.findall(discussion_pattern, doc)

    # Some input files have table of contents and others don't 
    # just keep the last match
    discussion = results[len(results)-1]

    discussions.append((input_file_name, discussion))

The discussions variable contains the results for each of the data files. Discussions变量包含每个数据文件的结果。


This is the original solution. 这是原始的解决方案。 It does not work for the new file but does show the use of named groups. 它不适用于新文件,但确实显示了命名组的使用。 I am not familiar with StackOverflow protocol here. 我在这里不熟悉StackOverflow协议。 Should I delete this old code? 我应该删除此旧代码吗?

By using longer match strings, the number of matches can be reduced to just 2 for both item 7 and item 8 - the table of contents and the actual section. 通过使用更长的匹配字符串,项目7和项目8(目录和实际部分)的匹配数可以减少到2。

So search for the second occurence of item 7, and keep all text until item 8. This code uses Python named groups. 因此,搜索项目7的第二次出现,并将所有文本保留到项目8。此代码使用Python命名组。

import re

with open('Output2.txt') as f:
    doc = f.read()

item7 = r"Item 7\.*\s*Management.s Discussion and Analysis of Financial Condition and Results of Operations"
item8 = r"Item 8\.*\s*Financial Statements"

discussion_pattern = re.compile(
    r"(?P<item7>" + item7 + ")"
    r"([\S\s]*)"
    r"(?P<item7heading>" + item7 +")"
    r"(?P<discussion>[\S\s]*)"
    r"(?P<item8heading>" + item8 + ")"
)       

match = re.search(discussion_pattern, doc)
discussion = match.group('discussion')

Use this pattern with s option 将此模式与s选项一起使用

.*(Item 7.*?Item 8)  

result at capturing group #1 捕获第1组的结果
Demo 演示

.               # Any character except line break
*               # (zero or more)(greedy)
(               # Capturing Group (1)
  Item 7        # "Item 7"
  .             # Any character except line break
  *?            # (zero or more)(lazy)
  Item 8        # "Item 8"
)               # End of Capturing Group (1)
                # "  "
re.findall(r"Item(?:(?!Item).)*7(?:(?!Item|7).)*Management(?:(?!Item|7|Management).)*Analysis[\s\S]*Item(?:(?!Item).)*8(?:(?!Item|8).)*Financial(?:(?!Item|8|Financial).)*Statements(?!.*?(?:Item(?:(?!Item).)*7)|(?:Item(?:(?!Item).)*8))",text, re.DOTALL)

试试看。添加了一个前瞻。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM