[英]Python 3: How to extract url image?
The urls I want to extract have same pattern: 我想要提取的网址具有相同的模式:
"begin" : "url_I_want_extract"
They look like: 他们看着像是:
"begin" : "https://k2.website.com/images/0x0/0x0/0/16576946054146395951.jpeg"
"begin" : "https://k2.website.com/images/0x0/0x0/0/9460365509030976330.jpeg"
"begin" : "https://k2.website.com/images/0x0/0x0/0/9361112829030898475.jpeg"
"begin" : "https://k3.website.com/images/0x0/0x0/0/14705723619301900580.jpeg"
"begin" : "https://k3.website.com/images/8x36/922x950/0/1368601155311066426.jpeg"
And I used this code to extract but getting unexpected things. 我使用这段代码来提取但却意外的事情。
r = re.findall('https://k(.?).website.com/images/0x0/0x0/0/(.*?).jpeg', response.text)
The output I got: 我得到的输出:
[('2', '16576946054146395951'), ('2', '9460365509030976330'), ('2', '9361112829030898475'), ('3', '14705723619301900580')]
The output I want: 我想要的输出:
https://k2.website.com/images/0x0/0x0/0/16576946054146395951.jpeg
https://k2.website.com/images/0x0/0x0/0/9460365509030976330.jpeg
https://k2.website.com/images/0x0/0x0/0/9361112829030898475.jpeg
https://k3.website.com/images/0x0/0x0/0/14705723619301900580.jpeg
https://k3.website.com/images/8x36/922x950/0/1368601155311066426.jpeg
How to use regex to scrape Urls after ""begin"" word ? 如何使用正则表达式来填写“开始”字后的网址? Thank you :)
谢谢 :)
The parenthesis surround the capturing groups that are returned by findall
. 括号括起
findall
返回的捕获组。 Right now your capturing groups are k(.>)
and (.*?).jpeg
. 现在你的捕获组是
k(.>)
和(.*?).jpeg
。 Remove those parenthesis and instead capture the entire url. 删除这些括号,然后捕获整个网址。
Also, to match both the url's with "/0x0/0x0/0/" and "/8x36/922x950/0/", replace "/0x0/0x0/0/" in the regex with "/.*/.*/.*/": 另外,要将url与“/ 0x0 / 0x0 / 0 /”和“/ 8x36 / 922x950 / 0 /”匹配,请将正则表达式中的“/ 0x0 / 0x0 / 0 /”替换为“/.*/.*/” * /“:
r = re.findall('(https://k.?.website.com/images/.*/.*/.*/.*?.jpeg)', response.text)
This one may do the trick on a more general server path construction: 这个可以在更通用的服务器路径构造上做到这一点:
https?.*(jpeg|jpg|png|tiff|gif)
Start capturing the http ( with optional 's' for ssl servers ) and finish capture assuring a image file format. 开始捕获http(对于ssl服务器使用可选的's')并完成捕获以确保图像文件格式。 ( Please note that I included 5 types just as an example...)
(请注意,我仅包括5种类型......)
Hope that helps !! 希望有所帮助!!
I think what you're asking for is to extract only the URLs after begin :
. 我认为你要求的是在
begin :
之后只提取URL begin :
. For this you'd want: 为此您需要:
r = re.findall('"begin" : "(https://k.*?.jpeg)"', response.text)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.