简体   繁体   English

通过在Python中使用正则表达式提取具有开始和结束匹配项的字符串文本的一部分

[英]Extracting portion of the string text with start and end matches by using regular expressions in Python

I am trying to extract only one portion of the string text by using regular expressions in Python with two specific matches. 我试图通过在Python中使用带有两个特定匹配项的正则表达式来仅提取字符串文本的一部分。 To be specific, here is an example text: 具体来说,以下是示例文本:

example = """
    The forward-looking statements are made as of the date of this report,
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio.
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area.
    """

, and I would like to extract the 'between' portion of the text starting from a start match 'ITEM 1.' ,我想从起始匹配项“ ITEM 1”开始提取文本的“介于”部分。 and an end match 'ITEM 2.', so the final result should look like this: 并以“ ITEM 2”作为结尾匹配,因此最终结果应如下所示:

final_result = """
    ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a community bank operating 
    in Northwest Ohio since 1897.
    """

As a matter of fact, the above example text is one particular example of a large collection of the similar texts, so I hope that the answer would be more or less general so that I could adapt your answer to different textual conditions that other string texts might have. 实际上,以上示例文本是大量相似文本的一个特定示例,因此我希望答案大致相同,以便我可以将您的答案适应于其他字符串文本不同的文本条件可能有。 Thank you in advance! 先感谢您!

import re

example = """
The forward-looking statements are made as of the date of this report,
and the Company assumes no obligation to update the forward-looking statements 
or to update the reasons why actual results could differ from those projected 
in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
incorporated under the laws of Ohio in 1985 and elected to become a financial 
holding company under the Federal Reserve in 2014. Our primary subsidiary, 
The Farmers & Merchants State Bank (Bank) is a community bank operating 
in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio.
The Bank operates from the facilities at 307 North Defiance Street. 
In addition, the Bank owns the property from 200 to 208 Ditto Street, 
Archbold, Ohio, which it uses for Bank parking and a community mini-park area.
"""


def get_text_between(text, mark1, mark2):
    regex = '({}.*?){}'.format(mark1, mark2)
    match = re.search(regex, example, re.DOTALL)
    if match:
        return match.group(1)
    return None

if __name__ == '__main__':
    text = get_text_between(example, 'ITEM 1', 'ITEM 2')
    if text:
        print(text)
example = """
    The forward-looking statements are made as of the date of this report,
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio.
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area.
    """
import re
example2 = " ".join(example.split("\n"))
match = re.search("(ITEM 1.*?)ITEM 2",example2)
if match:
  print(match.group(1))

This should work 这应该工作

This way you can buffer part of string you want to extract. 这样,您可以缓冲要提取的部分字符串。

import re;
example = """
    The forward-looking statements are made as of the date of this report,
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio.
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area.
"""
final_result = "";
search = re.search('(ITEM\ 1[\s\S]*)ITEM\ 2', example);
if search:
    final_result = search.group(1);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM