简体   繁体   中英

Python - Regex - combination of letters and numbers (undefined length)

I am trying to get a File-ID from a text file . In the above example the filename is d735023ds1.htm which I want to get in order to build another url. Those filenames differ however in their length and I would need a universal regex expression to cover all possibilities.

Example filenames

d804478ds1a.htm.
d618448ds1a.htm.
d618448.htm

My code

for cik in leftover_cik_list:

    r = requests.get(filing.url)
    content = str(r.content)
    fileID = None

    for line in content.split("\n"):
    
        if fileID == None:
            fileIDIndex = line.find("<FILENAME>")
            
            if fileIDIndex != -1:
                trimmedText = line[fileIDIndex:]
                result = RegEx.search(r"^[\w\d.htm]*$", trimmedText)
            
                if result:
                    fileID = result.group()

    print ("fileID",fileID)

    document_link = "https://www.sec.gov/Archives/edgar/data/{0}/{1}/{2}.htm".format(cik, accession_number, fileID)

    print ("Document Link to S-1:", document_link)
import re

...
result = re.search('^d\d{1,6}.+\.htm$', trimmedText)
if result:
    fileID = result.group()

^d = Start with ad

\\d{1,6} = Look for 1-6 digits, if there could be an unlimited amount of digits replace with \\d{1,}

.+ = Wild card

\\.htm$ = End in .htm

You should try re.match() which searches for a pattern at the beginning of the input string. Also, your regex is not good, you have to add an anti-shash before . , as point means "any character" in regex.

import re
result = re.match('[\w]+\.htm', trimmedText)

Try this regex:

import re
files = [
    "d804478ds1a.htm",
    "d618448ds1a.htm",
    "d618448.htm"
]
for f in files:
    match = re.search(r"d\w+\.htm", f)
    print(match.group())

d804478ds1a.htm
d618448ds1a.htm
d618448.htm

The assumptions in the above are that the file name starts with a d , ends with .htm and contains only letters, digits and underscores.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM