Python 3: How to extract url image?

Question

The urls I want to extract have same pattern:

"begin" : "url_I_want_extract"

They look like:

"begin" : "https://k2.website.com/images/0x0/0x0/0/16576946054146395951.jpeg"
"begin" : "https://k2.website.com/images/0x0/0x0/0/9460365509030976330.jpeg"
"begin" : "https://k2.website.com/images/0x0/0x0/0/9361112829030898475.jpeg"
"begin" : "https://k3.website.com/images/0x0/0x0/0/14705723619301900580.jpeg"
"begin" : "https://k3.website.com/images/8x36/922x950/0/1368601155311066426.jpeg"

And I used this code to extract but getting unexpected things.

r = re.findall('https://k(.?).website.com/images/0x0/0x0/0/(.*?).jpeg', response.text)

The output I got:

 [('2', '16576946054146395951'), ('2', '9460365509030976330'), ('2', '9361112829030898475'), ('3', '14705723619301900580')]

The output I want:

https://k2.website.com/images/0x0/0x0/0/16576946054146395951.jpeg
https://k2.website.com/images/0x0/0x0/0/9460365509030976330.jpeg
https://k2.website.com/images/0x0/0x0/0/9361112829030898475.jpeg
https://k3.website.com/images/0x0/0x0/0/14705723619301900580.jpeg
https://k3.website.com/images/8x36/922x950/0/1368601155311066426.jpeg

How to use regex to scrape Urls after ""begin"" word ? Thank you :)

Answer 1

The parenthesis surround the capturing groups that are returned by findall . Right now your capturing groups are k(.>) and (.*?).jpeg . Remove those parenthesis and instead capture the entire url.

Also, to match both the url's with "/0x0/0x0/0/" and "/8x36/922x950/0/", replace "/0x0/0x0/0/" in the regex with "/.*/.*/.*/":

r = re.findall('(https://k.?.website.com/images/.*/.*/.*/.*?.jpeg)', response.text)

Answer 2

This one may do the trick on a more general server path construction:

https?.*(jpeg|jpg|png|tiff|gif)

Start capturing the http ( with optional 's' for ssl servers ) and finish capture assuring a image file format. ( Please note that I included 5 types just as an example...)

Hope that helps !!

Answer 3

I think what you're asking for is to extract only the URLs after begin : . For this you'd want:

r = re.findall('"begin" : "(https://k.*?.jpeg)"', response.text)

Python 3: How to extract url image?

Question

3 answers

solution1
2 2016-08-20 01:46:22

solution2
1 2016-08-20 03:33:42

solution3
1 ACCPTED 2016-08-20 03:38:46

Python 3: How to extract url image?

Question

3 answers

solution1 2 2016-08-20 01:46:22

solution2 1 2016-08-20 03:33:42

solution3 1 ACCPTED 2016-08-20 03:38:46

solution1
2 2016-08-20 01:46:22

solution2
1 2016-08-20 03:33:42

solution3
1 ACCPTED 2016-08-20 03:38:46