[英]Extracting numbers from a string using regex in python
我有一个要解析的网址列表:
['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']
我想使用Regex表达式创建一个新列表,其中包含字符串末尾的数字和标点之前的所有字母(某些字符串在两个位置包含数字,如上面列表中的第一个字符串所示)。 因此,新列表如下所示:
['20170303', '20160929a', '20161005a']
这是我没有运气尝试过的:
code = re.search(r'?[0-9a-z]*', urls)
更新:
跑步-
[re.search(r'(\d+)\D+$', url).group(1) for url in urls]
我收到以下错误-
AttributeError: 'NoneType' object has no attribute 'group'
另外,如果有字母,似乎不会在数字后接一个字母。
鉴于:
>>> lios=['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']
你可以做:
for s in lios:
m=re.search(r'(\d+\w*)\D+$', s)
if m:
print m.group(1)
印刷品:
20170303
20160929a
20161005a
基于此正则表达式:
(\d+\w*)\D+$
^ digits
^ any non digits
^ non digits
^ end of string
# python3
from urllib.parse import urlparse
from os.path import basename
def extract_id(url):
path = urlparse(url).path
resource = basename(path)
_id = re.search('\d[^.]*', resource)
if _id:
return _id.group(0)
urls =['https://www.richmondfed.org/-/media/richmondfedorg/press_room/speeches/president_jeff_lacker/2017/pdf/lacker_speech_20170303.pdf','http://www.federalreserve.gov/newsevents/speech/powell20160929a.htm','http://www.federalreserve.gov/newsevents/speech/fischer20161005a.htm']
# /!\ here you have None if pattern doesn't exist ;) in ids list
ids = [extract_id(url) for url in urls]
print(ids)
输出:
['20170303', '20160929a', '20161005a']
import re
patterns = {
'url_refs': re.compile("(\d+[a-z]*)\."), # YCF_L
}
def scan(iterable, pattern=None):
"""Scan for matches in an iterable."""
for item in iterable:
# if you want only one, add a comma:
# reference, = pattern.findall(item)
# but it's less reusable.
matches = pattern.findall(item)
yield matches
然后,您可以执行以下操作:
hits = scan(urls, pattern=patterns['url_refs'])
references = (item[0] for item in hits)
提要references
您的其他功能。 您可以通过这种方式遍历更大的内容,并且我认为可以更快地完成。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.