在大字符串中搜索文件路径。返回文件路径+文件名

Question

I've got a little project where I'm trying to download a series of wallpapers from a web page. 我有一个小项目，我试图从网页下载一系列壁纸。 I'm new to python. 我是python的新手。

I'm using the urllib library, which is returning a long string of web page data which includes 我正在使用urllib库，它返回一长串网页数据，包括

<a href="http://website.com/wallpaper/filename.jpg">

I know that every filename I need to download has 我知道我需要下载的每个文件名都有

'http://website.com/wallpaper/'

How can i search the page source for this portion of text, and return the rest of the image link, ending with "*.jpg" extension? 如何在页面源中搜索此部分文本，并返回图像链接的其余部分，以“* .jpg”扩展名结尾？

r'http://website.com/wallpaper/ xxxxxx .jpg'

I'm thinking if I could format a regular expression with the xxxx portion not being evaluated? 我在想是否可以格式化正则表达式，而xxxx部分没有被评估？ Just check for the path, and the .jpg extension. 只需检查路径和.jpg扩展名。 Then return the whole string once a match is found 然后在找到匹配后返回整个字符串

Am I on the right track? 我是在正确的轨道上吗？

Answer 1

I think a very basic regex will do. 我认为一个非常基本的正则表达式会做。
Like: 喜欢：

(http:\/\/website\.com\/wallpaper\/[\w\d_-]*?\.jpg)

and if you use $1 this will return the whole String . 如果你使用$1这将返回整个String。

And if you use 如果你使用

(http:\/\/website\.com\/wallpaper\/([\w\d_-]*?)\.jpg)

then $1 will give the whole string and $2 will give the file name only. 然后$1将给出整个字符串， $2将只给出文件名。

Note: escaping ( \\/ ) is language dependent so use what is supported by python. 注意：转义（ \\/ ）取决于语言，因此请使用python支持的内容。

Answer 2

BeautifulSoup is pretty convenient for this sort of thing. BeautifulSoup对于这类事情非常方便。

import re
import urllib3
from bs4 import BeautifulSoup

jpg_regex = re.compile('\.jpg$')
site_regex = re.compile('website\.com\/wallpaper\/')

pool = urllib3.PoolManager()
request = pool.request('GET', 'http://your_website.com/')
soup = BeautifulSoup(request)

jpg_list = list(soup.find_all(name='a', attrs={'href':jpg_regex}))
site_list = list(soup.find_all(name='a', attrs={'href':site_regex}))

result_list = map(lambda a: a.get('href'), jpg_list and site_list)

Answer 3

Don't use a regular expression against HTML. 不要对HTML使用正则表达式。

Instead, use a HTML parsing library. 而是使用HTML解析库。

BeautifulSoup is a library for parsing HTML and urllib2 is a built-in module for fetching URLs BeautifulSoup是一个用于解析HTML的库， urllib2是一个用于获取URL的内置模块

import urllib2
from bs4 import BeautifulSoup as bs

content = urllib2.urlopen('http://website.com/wallpaper/index.html').read()
html = bs(content)
links = [] # an empty list

for link in html.find_all('a'):
   href = link.get('href')
   if '/wallpaper/' in href:
      links.append(href)

Answer 4

Search for the " http://website.com/wallpaper/ " substring in url and then check for ".jpg" in url, as shown below: 在url中搜索“ http://website.com/wallpaper/ ”子字符串，然后在url中检查“.jpg”，如下所示：

domain = "http://website.com/wallpaper/"
url = str("your URL")
format = ".jpg"
for domain in url and format in url:
    //do something

在大字符串中搜索文件路径。返回文件路径+文件名

问题描述

4 个解决方案

解决方案1
3 2015-06-02 04:58:15

解决方案2
3 已采纳 2015-06-02 05:07:00

解决方案3
3 2015-06-02 05:25:17

解决方案4
2 2015-06-02 05:31:58

在大字符串中搜索文件路径。 返回文件路径+文件名

问题描述

4 个解决方案

解决方案1 3 2015-06-02 04:58:15

解决方案2 3 已采纳 2015-06-02 05:07:00

解决方案3 3 2015-06-02 05:25:17

解决方案4 2 2015-06-02 05:31:58

在大字符串中搜索文件路径。返回文件路径+文件名

解决方案1
3 2015-06-02 04:58:15

解决方案2
3 已采纳 2015-06-02 05:07:00

解决方案3
3 2015-06-02 05:25:17

解决方案4
2 2015-06-02 05:31:58