
[英]How to extract (speaker, text) tuples from earning call transcripts with regex?
[英]How to use "for loop" in Python to extract year and firm name (for earning call transcripts) from a txt file
我有一个这种类型的 txt 文件:
Thomson Reuters StreetEvents Event Transcript
E D I T E D V E R S I O N
Q3 2003 ABM Industries Earnings Conference Call
SEPTEMBER 10, 2003 / 1:00PM GMT
================================================================================
Corporate Participants
================================================================================
我的txt文件保存在:C:\sam\2003-Sep-10-ABM.N-140985434256-Transcript.txt。
我只想提取成绩单年份(如 2003 年)和公司名称(如 ABM Industries)。 我使用了下面的代码,但最终都是多年。
代码:
import re
f = open("C:\\sam\\2003-Sep-10-ABM.N-140985434256-Transcript.txt", 'r')
content = f.read()
pattern = "\d{4}"
years = re.findall(pattern, content)
for year in years:
print(year)
我的Output:2003 2003 2003 2003 2003 2003 2003 2003 2002 2003 2003 2003 2003 2003 2003 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2003 2003 2003 2003 2003 2003 2003 2003 2003 2003 2003 2004 2004 2004 2004 2004 2019 2019
预期 Output: 2003 ABM Industries
如果我理解正确的话,这应该有效:
import re
content = """Q3 2003 ABM Industries Earnings Conference Call
SEPTEMBER 10, 2003 / 1:00PM GMT"""
pattern = "\d{4}+\s\w+\s\w+"
years = re.findall(pattern, content)[0]
print(years)
Output:“2003 ABM 工业”
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.