使用regex在Python中从给定URL提取.zip文件名

Question

我想从给定的URl中提取.zip文件名。 这是我的代码-

import re

print(re.findall(r'href=[\'"]?([^\'" >]+)','<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'))

例如：

输入- <a href="http://www.example.com/files/world_data1.zip">World Data Part 1</a> <a href="http://www.example.com/files/world_data2.zip">World Data Part 2</a>

预期输出world_data1.zip,world_data2.zip 。

我尝试使用各种格式的.zip $，但出现了一个空列表。 谁能帮我这个？

Answer 1

你可以用

import re

html = """'&nbsp;<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'"""

rx = re.compile(r"""href=(["'])(.*?)\1""")
links = [filename 
    for m in rx.finditer(html) 
    for filename in [m.group(2).split('/')[-1]]
    if filename.endswith('.zip')]
print(links)

屈服

['world_data1.zip', 'world_data2.zip']

这个想法是首先获取href属性，用/分割，然后检查最后一部分是否以.zip结尾。
但是，请考虑使用类似BeautifulSoup的解析器和一些xpath查询。
有关表达式，请参见regex101.com上的演示 。

Answer 2

您可以尝试以下方法：

import re

s = '&nbsp;<a href="http://www.example.com/files/world_data1.zip"><b>World Data Part 1</b></a> <br/> <a href="http://www.example.com/files/world_data2.zip"><b>World Data Part 2</b></a>'

print(re.findall(r'href="[^"]+?/([^/"]+\.zip)"', s))

或者，更严格地，使用以下方式：

import os

from pyquery import PyQuery as pq

doc = pq(s)
a_list = doc('a[href]')  # Get all `a` elements that have a `href` attrib.
hrefs = [os.path.basename(a.attrib['href']) for a in a_list]
print(list(filter(lambda x: x.endswith('.zip'), hrefs)))

使用regex在Python中从给定URL提取.zip文件名

问题描述

2 个解决方案

解决方案1
0 2018-02-24 20:51:18

解决方案2
0 2018-02-24 20:51:26

使用regex在Python中从给定URL提取.zip文件名

问题描述

2 个解决方案

解决方案1 0 2018-02-24 20:51:18

解决方案2 0 2018-02-24 20:51:26

解决方案1
0 2018-02-24 20:51:18

解决方案2
0 2018-02-24 20:51:26