简体   繁体   English

Python - 正则表达式 - 字母和数字的组合(未定义长度)

[英]Python - Regex - combination of letters and numbers (undefined length)

I am trying to get a File-ID from a text file .我正在尝试从文本文件中获取文件 ID。 In the above example the filename is d735023ds1.htm which I want to get in order to build another url.在上面的示例中,文件名是d735023ds1.htm ,我想获取它以构建另一个 url。 Those filenames differ however in their length and I would need a universal regex expression to cover all possibilities.这些文件名的长度不同,我需要一个通用的正则表达式来涵盖所有可能性。

Example filenames示例文件名

d804478ds1a.htm. d804478ds1a.htm。
d618448ds1a.htm. d618448ds1a.htm。
d618448.htm d618448.htm

My code我的代码

for cik in leftover_cik_list:

    r = requests.get(filing.url)
    content = str(r.content)
    fileID = None

    for line in content.split("\n"):
    
        if fileID == None:
            fileIDIndex = line.find("<FILENAME>")
            
            if fileIDIndex != -1:
                trimmedText = line[fileIDIndex:]
                result = RegEx.search(r"^[\w\d.htm]*$", trimmedText)
            
                if result:
                    fileID = result.group()

    print ("fileID",fileID)

    document_link = "https://www.sec.gov/Archives/edgar/data/{0}/{1}/{2}.htm".format(cik, accession_number, fileID)

    print ("Document Link to S-1:", document_link)
import re

...
result = re.search('^d\d{1,6}.+\.htm$', trimmedText)
if result:
    fileID = result.group()

^d = Start with ad ^d = 以广告开头

\\d{1,6} = Look for 1-6 digits, if there could be an unlimited amount of digits replace with \\d{1,} \\d{1,6} = 寻找 1-6 位数字,如果可以有无限数量的数字替换为 \\d{1,}

.+ = Wild card .+ = 通配符

\\.htm$ = End in .htm \\.htm$ = 以 .htm 结尾

You should try re.match() which searches for a pattern at the beginning of the input string.您应该尝试re.match()在输入字符串的开头搜索模式。 Also, your regex is not good, you have to add an anti-shash before .另外,你的正则表达式不好,你必须在. , as point means "any character" in regex. ,因为点在正则表达式中表示“任何字符”。

import re
result = re.match('[\w]+\.htm', trimmedText)

Try this regex:试试这个正则表达式:

import re
files = [
    "d804478ds1a.htm",
    "d618448ds1a.htm",
    "d618448.htm"
]
for f in files:
    match = re.search(r"d\w+\.htm", f)
    print(match.group())

d804478ds1a.htm
d618448ds1a.htm
d618448.htm

The assumptions in the above are that the file name starts with a d , ends with .htm and contains only letters, digits and underscores.上面的假设是文件名以d开头,以.htm结尾,并且只包含字母、数字和下划线。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 python 中的正则表达式数字和字母 - Regex numbers and letters in python 如何搜索带有正则表达式的字母,数字和_符号的任意组合的术语? - How to search for a term with any combination of letters, numbers and _ symbol with regex? Python正则表达式提取包含数字和字母的子字符串 - Python regex extracting substrings containing numbers and letters 生成所有可能的数字、字母、符号组合 python - Generate all possible combination of numbers, letters, symbols python 字母组合作为python中的参数 - Combination of letters as parameters in python 正则表达式查找字符串字母,以及字符串数字组合的字符串字母 python - regular expressions find string letters, and string letters with combination of string numbers python 在Python中将字母转换为数字并执行沿长度的操作序列 - Converting letters to numbers and performing sequence of operations along the length in Python 如何使用 python 和 SQL 删除末尾的数字和字母组合,但将数字保留在其他位置? - How to use python and SQL to delete the combination of numbers and letters at the end but keep the numbers in other place? Python re(regex)匹配包含字母,连字符,数字的特定字符串 - Python re (regex) matching particular string containing letters, hyphen, numbers 使用正则表达式获取数学表达式(单个字母,数字,方程式等)(在python中) - Get mathematical expressions (single letters, numbers, equations…) using regex (in python)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM