python将下一个字符串刮到给定的字符串

Question

我有+1000个txt文件要抓取（Python）。 我已经创建了列出所有.txt文件路径的file_list变量。 我要抓取五个字段：file_form，日期，公司，公司ID和价格范围。 对于前四个变量，我没有问题，因为它们在每个.txt文件开头都以单独的行非常结构化：

FILE FORM:      10-K
DATE:           20050630
COMPANY:        APPLE INC
COMPANY CIK:    123456789

我将以下代码用于这四个代码：

    import sys, os, re
    exemptions=[]    
        for eachfile in file_list:
                line2 = ""  # for the following loop I need the .txt in lines. Right now, the file is read one in all. Create var with lines
                with open(eachfile, 'r') as f:
                    for line in f:
                        line2 = line2 + line  # append each line. Shortcut: "line2 += line"
                        if "FILE FORM" in line:
                            exemptions.append(line.strip('\n').replace("FILE FORM:", "")) #append line stripping 'S-1\n' from field in + replace FILE FORM with blanks
                        elif "COMPANY" in line:
                            exemptions.append(line.rstrip('\n').replace("COMPANY:", ""))  # rstrip=strips trailing characters '\n'
                        elif "DATE" in line:
                            exemptions.append(line.rstrip('\n').replace("DATE:", ""))  # add field 
                        elif "COMPANY CIK" in line:
                            exemptions.append(line.rstrip('\n').replace("COMPANY CIK:", ""))  # add field
print(exemptions)

如上例所示，这些为我提供了具有所有关联值的列表exemptions 。 但是，“价格范围”字段位于.txt文件的中间，其句子如下：

We anticipate that the initial public offering price will be between $&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and
$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;per share.

而且我不知道如何保持$whateveritis;and $whateveritis;per share. 作为我最后的第五个变量。 好消息是，许多文件使用相同的结构，有时我会用$ amounts代替“＆nbsp”。 示例： We anticipate that the initial public offering price will be between $12.00 and $15.00  per share. We anticipate that the initial public offering price will be between $12.00 and $15.00  per share. 。

我希望这个“ 12.00; and; 15.00”作为我在exemptions列表中的第五个变量（或者类似的东西，之后我可以很容易地在csv文件中工作）。

提前非常感谢您。

Answer 1

看起来您已经导入了正则表达式，那么为什么不使用它呢？ 诸如\\$[\\d.]+\\ and \\$[\\d.]+类的正则表达式应该与价格匹配，然后您可以从那里轻松地对其进行优化：

import sys, os, re
    exemptions=[]    
    for eachfile in file_list:
            line2 = ""
            with open(eachfile, 'r') as f:
                for line in f:
                    line2 = line2 + line

                    m = re.search('\$[\d.]+\&nbsp;and \$[\d.]+', line)

                    if "FILE FORM" in line:
                        .
                        .
                        .
                    elif m:
                        exemptions.append(m.group(0))   # m.group(0) will be the first occurrence and you can refine it from there

print(exemptions)

python将下一个字符串刮到给定的字符串

问题描述

1 个解决方案

解决方案1
0 2019-08-16 23:18:07

python将下一个字符串刮到给定的字符串

问题描述

1 个解决方案

解决方案1 0 2019-08-16 23:18:07

解决方案1
0 2019-08-16 23:18:07