Python - 正則表達式 - 字母和數字的組合（未定義長度）

Question

我正在嘗試從文本文件中獲取文件 ID。 在上面的示例中，文件名是d735023ds1.htm ，我想獲取它以構建另一個 url。 這些文件名的長度不同，我需要一個通用的正則表達式來涵蓋所有可能性。

示例文件名

d804478ds1a.htm。
d618448ds1a.htm。
d618448.htm

我的代碼

for cik in leftover_cik_list:

    r = requests.get(filing.url)
    content = str(r.content)
    fileID = None

    for line in content.split("\n"):
    
        if fileID == None:
            fileIDIndex = line.find("<FILENAME>")
            
            if fileIDIndex != -1:
                trimmedText = line[fileIDIndex:]
                result = RegEx.search(r"^[\w\d.htm]*$", trimmedText)
            
                if result:
                    fileID = result.group()

    print ("fileID",fileID)

    document_link = "https://www.sec.gov/Archives/edgar/data/{0}/{1}/{2}.htm".format(cik, accession_number, fileID)

    print ("Document Link to S-1:", document_link)

Answer 1

import re

...
result = re.search('^d\d{1,6}.+\.htm$', trimmedText)
if result:
    fileID = result.group()

^d = 以廣告開頭

\\d{1,6} = 尋找 1-6 位數字，如果可以有無限數量的數字替換為 \\d{1,}

.+ = 通配符

\\.htm$ = 以 .htm 結尾

Answer 2

您應該嘗試re.match()在輸入字符串的開頭搜索模式。 另外，你的正則表達式不好，你必須在. ，因為點在正則表達式中表示“任何字符”。

import re
result = re.match('[\w]+\.htm', trimmedText)

Answer 3

試試這個正則表達式：

import re
files = [
    "d804478ds1a.htm",
    "d618448ds1a.htm",
    "d618448.htm"
]
for f in files:
    match = re.search(r"d\w+\.htm", f)
    print(match.group())

d804478ds1a.htm
d618448ds1a.htm
d618448.htm

上面的假設是文件名以d開頭，以.htm結尾，並且只包含字母、數字和下划線。

Python - 正則表達式 - 字母和數字的組合（未定義長度）

問題描述

示例文件名

我的代碼

3 個解決方案

解決方案1
0 2020-03-20 09:14:35

解決方案2
0 2020-03-20 10:02:22

解決方案3
0 2020-03-20 10:04:21

Python - 正則表達式 - 字母和數字的組合（未定義長度）

問題描述

示例文件名

我的代碼

3 個解決方案

解決方案1 0 2020-03-20 09:14:35

解決方案2 0 2020-03-20 10:02:22

解決方案3 0 2020-03-20 10:04:21

解決方案1
0 2020-03-20 09:14:35

解決方案2
0 2020-03-20 10:02:22

解決方案3
0 2020-03-20 10:04:21