[英]python scrape next strings to a given string
我有+1000个txt文件要抓取(Python)。 我已经创建了列出所有.txt文件路径的file_list
变量。 我要抓取五个字段:file_form,日期,公司,公司ID和价格范围。 对于前四个变量,我没有问题,因为它们在每个.txt文件开头都以单独的行非常结构化:
FILE FORM: 10-K
DATE: 20050630
COMPANY: APPLE INC
COMPANY CIK: 123456789
我将以下代码用于这四个代码:
import sys, os, re
exemptions=[]
for eachfile in file_list:
line2 = "" # for the following loop I need the .txt in lines. Right now, the file is read one in all. Create var with lines
with open(eachfile, 'r') as f:
for line in f:
line2 = line2 + line # append each line. Shortcut: "line2 += line"
if "FILE FORM" in line:
exemptions.append(line.strip('\n').replace("FILE FORM:", "")) #append line stripping 'S-1\n' from field in + replace FILE FORM with blanks
elif "COMPANY" in line:
exemptions.append(line.rstrip('\n').replace("COMPANY:", "")) # rstrip=strips trailing characters '\n'
elif "DATE" in line:
exemptions.append(line.rstrip('\n').replace("DATE:", "")) # add field
elif "COMPANY CIK" in line:
exemptions.append(line.rstrip('\n').replace("COMPANY CIK:", "")) # add field
print(exemptions)
如上例所示,这些为我提供了具有所有关联值的列表exemptions
。 但是,“价格范围”字段位于.txt文件的中间,其句子如下:
We anticipate that the initial public offering price will be between $ and
$ per share.
而且我不知道如何保持$whateveritis;and $whateveritis;per share.
作为我最后的第五个变量。 好消息是,许多文件使用相同的结构,有时我会用$ amounts代替“&nbsp”。 示例: We anticipate that the initial public offering price will be between $12.00 and $15.00 per share.
We anticipate that the initial public offering price will be between $12.00 and $15.00 per share.
。
我希望这个“ 12.00; and; 15.00”作为我在exemptions
列表中的第五个变量(或者类似的东西,之后我可以很容易地在csv文件中工作)。
提前非常感谢您。
看起来您已经导入了正则表达式,那么为什么不使用它呢? 诸如\\$[\\d.]+\\ and \\$[\\d.]+
类的正则表达式应该与价格匹配,然后您可以从那里轻松地对其进行优化:
import sys, os, re
exemptions=[]
for eachfile in file_list:
line2 = ""
with open(eachfile, 'r') as f:
for line in f:
line2 = line2 + line
m = re.search('\$[\d.]+\ and \$[\d.]+', line)
if "FILE FORM" in line:
.
.
.
elif m:
exemptions.append(m.group(0)) # m.group(0) will be the first occurrence and you can refine it from there
print(exemptions)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.