简体   繁体   English

python3提取txt文件中两个字符串之间的字符串

[英]python3 extract string between two strings in a txt file

I am new to Python. 我是Python的新手。 I am trying to extract one string ("concluded that our disclosure controls were effective as of") from a txt file ("infile.txt"). 我试图从一个txt文件(“infile.txt”)中提取一个字符串(“我们的披露控件的结论是有效的”)。 The file is relatively large, and I need to look for the above string in one particular section (between the "ITEM & nbsp;9A" and the "ITEM & nbsp;9B"). 该文件相对较大,我需要在一个特定部分中查找上述字符串(在“ITEM  9A”和“ITEM  9B”之间)。 An example of such section follows: 以下是一个例子:

</A>ITEM&nbsp;9A. CONTROLS AND PROCEDURES. </B></FONT></P> <P STYLE="margin-top:6px;margin-bottom:0px"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B>Evaluation of Disclosure Controls and Procedures </B></FONT> STYLE="margin-top:6px;margin-bottom:0px; text-indent:4%"><FONT STYLE="font-family:Times New Roman" SIZE="2">Under the supervision and with the participation of our management, including our Chief Executive Officer and Chief Financial Officer, we conducted an evaluation of the effectiveness of our disclosure controls and procedures (as defined in Rules 13a-15(e) and 15d-15(e) under the Securities Exchange Act of 1934, as amended (Exchange Act)), as of the end of the period covered by this Annual Report on Form 10-K. Management recognizes that any controls and procedures, no matter how well designed and operated, can provide only reasonable assurance of achieving their objectives and management necessarily applies its judgment in evaluating the cost-benefit relationship of possible controls and procedures. Based on such evaluation, our Chief Executive Officer and Chief Financial Officer concluded that our disclosure controls and procedures were effective as of September&nbsp;28, 2012. </FONT></P> <P STYLE="margin-top:18px;margin-bottom:0px"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B>Management&#146;s Annual Report on Internal Control over Financial Reporting </B></FONT> <P STYLE="margin-top:6px;margin-bottom:0px; text-indent:4%"><FONT STYLE="font-family:Times New Roman" SIZE="2">This Annual Report does not include a report of management&#146;s assessment regarding internal control over financial reporting or an attestation report of the company&#146;s registered public accounting firm due to a transition period established by rules of the Securities and Exchange Commission for newly public companies. </FONT> <P STYLE="margin-top:18px;margin-bottom:0px"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B>Changes in Internal Control over Financial Reporting </B></FONT></P> <P STYLE="margin-top:6px;margin-bottom:0px; text-indent:4%"><FONT STYLE="font-family:Times New Roman" SIZE="2">There were no changes in our internal control over financial reporting (as defined in Rule&nbsp;13a-15(f) under the Exchange Act) during the quarter ended September&nbsp;28, 2012, that have materially affected, or are reasonably likely to materially affect, our internal control over financial reporting. </FONT> <P STYLE="margin-top:18px;margin-bottom:0px"><FONT STYLE="font-family:Times New Roman" SIZE="2"><B><A NAME="tx431171_16"></A>ITEM&nbsp;9B. OTHER INFORMATION.

If the section has the desired string "concluded that our disclosure controls were effective as of" (the above section has it aprox. in the middle), then I would like to print a "1" in a separate "output.csv" file, if it does not, print "not found". 如果该部分具有所需的字符串“得出结论我们的披露控制是有效的”(上面的章节中间有aprox。),那么我想在单独的“output.csv”文件中打印“1”如果没有,则打印“未找到”。 The starting point of the section does not always coincide with the start of a line. 该部分的起点并不总是与一条线的起点重合。 I am sorry but could not figure out how to start.... I am using Python 3.6. 我很抱歉,但无法弄清楚如何开始....我正在使用Python 3.6。

Thank you very much in advance! 非常感谢你提前!

You can use regular expressions to extract text between a given opener and closer: 您可以使用正则表达式在给定的开启者和近似者之间提取文本:

import re

opener = re.escape(r"ITEM&nbsp;9A")
closer = re.escape(r"ITEM&nbsp;9B")

You can look over the extracts by with re.finditer and then filter the extracts with the target string using the in-operator: 您可以使用re.finditer查看提取,然后使用in-operator使用目标字符串过滤提取:

target_string = "concluded that our disclosure controls were effective as of"
for mo in re.finditer(opener + '(.*?)' + closer, inputstring, re.DOTALL):
    extract = mo.group(1)
    if target_string in extract:
        ...

Hopefully, this is enough to get you started :-) 希望这足以让你入手:-)

You can use re.findall : 你可以使用re.findall

import re

the_data = re.findall("</A>ITEM&nbsp;9A. (.*?)</B>", string_data_from_file)

if len(the_data) >0:
    print "1"

else:
    print "Not found"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM