[英]Python - Regex - combination of letters and numbers (undefined length)
I am trying to get a File-ID from a text file .我正在尝试从文本文件中获取文件 ID。 In the above example the filename is
d735023ds1.htm
which I want to get in order to build another url.在上面的示例中,文件名是
d735023ds1.htm
,我想获取它以构建另一个 url。 Those filenames differ however in their length and I would need a universal regex expression to cover all possibilities.这些文件名的长度不同,我需要一个通用的正则表达式来涵盖所有可能性。
d804478ds1a.htm. d804478ds1a.htm。
d618448ds1a.htm. d618448ds1a.htm。
d618448.htm d618448.htm
for cik in leftover_cik_list:
r = requests.get(filing.url)
content = str(r.content)
fileID = None
for line in content.split("\n"):
if fileID == None:
fileIDIndex = line.find("<FILENAME>")
if fileIDIndex != -1:
trimmedText = line[fileIDIndex:]
result = RegEx.search(r"^[\w\d.htm]*$", trimmedText)
if result:
fileID = result.group()
print ("fileID",fileID)
document_link = "https://www.sec.gov/Archives/edgar/data/{0}/{1}/{2}.htm".format(cik, accession_number, fileID)
print ("Document Link to S-1:", document_link)
import re
...
result = re.search('^d\d{1,6}.+\.htm$', trimmedText)
if result:
fileID = result.group()
^d = Start with ad ^d = 以广告开头
\\d{1,6} = Look for 1-6 digits, if there could be an unlimited amount of digits replace with \\d{1,} \\d{1,6} = 寻找 1-6 位数字,如果可以有无限数量的数字替换为 \\d{1,}
.+ = Wild card .+ = 通配符
\\.htm$ = End in .htm \\.htm$ = 以 .htm 结尾
You should try re.match()
which searches for a pattern at the beginning of the input string.您应该尝试
re.match()
在输入字符串的开头搜索模式。 Also, your regex is not good, you have to add an anti-shash before .
另外,你的正则表达式不好,你必须在
.
, as point means "any character" in regex. ,因为点在正则表达式中表示“任何字符”。
import re
result = re.match('[\w]+\.htm', trimmedText)
Try this regex:试试这个正则表达式:
import re
files = [
"d804478ds1a.htm",
"d618448ds1a.htm",
"d618448.htm"
]
for f in files:
match = re.search(r"d\w+\.htm", f)
print(match.group())
d804478ds1a.htm
d618448ds1a.htm
d618448.htm
The assumptions in the above are that the file name starts with a d
, ends with .htm
and contains only letters, digits and underscores.上面的假设是文件名以
d
开头,以.htm
结尾,并且只包含字母、数字和下划线。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.