正则表达式python简单的findall起点和终点已知

Question

import bs4
from urllib.request import urlopen
import re
import os
html=urlopen('https://www.flickr.com/search/?text=dog')
soup=bs4.BeautifulSoup(html,'html.parser')
print(soup.title)
x=soup.text
y=[]
for i in re.findall('c1.staticflickr.com\.jpg',x):
    print(i)

i know images start with c1.staticflickr.com and end with .jpg,how can i print each image link,(i am bit rusty on regex i tried adding some stuff but didn't work) 我知道图像以c1.staticflickr.com开头并以.jpg结尾，我如何打印每个图像链接，（我对正则表达式有点生疏，我尝试添加一些东西但没有用）

Answer 1

You have two way to gather what you desire, but it seems regex would be better because the urls have a canonical format. 您有两种收集所需内容的方法，但是正则表达式似乎会更好，因为url具有规范格式。 But if you use bs4 to extract the urls, which will be a bit complex, since they inside style . 但是，如果您使用bs4提取网址，这将有些复杂，因为它们位于style 。

import bs4
import requests
import re

resp = requests.get('https://www.flickr.com/search/?text=dog')
html = resp.text
result = re.findall(r'c1\.staticflickr\.com/.*?\.jpg',html)
print(len(result))
print(result[:5])

soup=bs4.BeautifulSoup(html,'html.parser')
result2 = [ re.findall(r'c1\.staticflickr\.com/.*?\.jpg',ele.get("style"))[0]
            for ele in soup.find_all("div",class_="view photo-list-photo-view requiredToShowOnServer awake")]
print(len(result2))
print(result2[:5])

Edit: you can gain extra information through the special URL, instead of using selenium . 编辑：您可以通过特殊的URL获得更多信息，而不是使用selenium 。 And i did not check if it can get the information which in page one. 而且我没有检查它是否可以获得第一页中的信息。

import requests

url = "https://api.flickr.com/services/rest?sort=relevance&parse_tags=1&content_type=7&extras=can_comment,count_comments,count_faves,description,isfavorite,license,media,needs_interstitial,owner_name,path_alias,realname,rotation,url_c,url_l,url_m,url_n,url_q,url_s,url_sq,url_t,url_z&per_page={per_page}&page={page}&lang=en-US&text=dog&viewerNSID=&method=flickr.photos.search&csrf=&api_key=352afce50294ba9bab904b586b1b4bbd&format=json&hermes=1&hermesClient=1&reqId=c1148a88&nojsoncallback=1"

with requests.Session() as s:
    #resp = s.get(url.format(per_page=100,page=1))
    resp2 = s.get(url.format(per_page=100,page=2))

    for each in resp2.json().get("photos").get("photo")[:5]:
        print(each.get("url_n_cdn"))
        print(each.get("url_m")) # there are more url type in JSON, url_q url_s url_sq url_t url_z

正则表达式python简单的findall起点和终点已知

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-12-04 05:45:19

正则表达式python简单的findall起点和终点已知

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-12-04 05:45:19

解决方案1
0 已采纳 2018-12-04 05:45:19