简体   繁体   中英

Extracting specific section from txt file - python

I want to extract "MANAGEMENT'S DISCUSSION AND ANALYSIS" section from the website https://www.sec.gov/Archives/edgar/data/3662/0000950170-98-000413.txt

I want to generalize the process so it works with other files on the same website: https://www.sec.gov/

This is something that you could do while iterating over the lines one by one in the file. You could start recording lines into a list at the beginning of the section before stopping recording at the end of the section or the start of the next section. After the correct section has been incorporated into a list of lines, you could 'join' the list with newline characters to output the particular section of interest. For your particular example here is something you could do...

import re
import sys

recording = False
your_file = "sec.txt"
start_pattern = "^ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS"
stop_pattern = "^ITEM 8."
output_section = []

for line in open(your_file).readlines():
    if recording is False:
        if re.search(start_pattern, line) is not None:
            recording = True
            output_section.append(line.strip())
    elif recording is True:
        if re.search(stop_pattern, line) is not None:
            recording = False
            sys.exit()
        output_section.append(line.strip())

print '\n'.join(output_section)

That final print statement should print out the section bounded by lines that start with "ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS" and "ITEM 8." Note that the carrot character matches the beginning of the line. Just tested this locally by downloading the document you pointed to as sec.txt and it worked for me.

You could generalize this for other documents by setting start_pattern and stop_pattern with arguments passed to the command line. For example by merging the following with the code I posted above:

import sys

start_pattern = sys.argv[1]
stop_pattern = sys.argv[2]

Then you could call your script like this to get the same result as hard-coded above:

python name_of_your_script.py "^ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS" "^ITEM 8."

I hope this helps.

使用它,您可以从特定部分提取内容:

extract = re.findall(r'(?<=ITEM 7.)(?s)(.*?)(?=ITEM 8.)',text) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM