简体   繁体   中英

Python bs4: Get only the URLs that have a certain string in it

I am making an image scraper and want to be able to take some of these photos from this link and then save them in a folder named dribblephotos : https://dribbble.com/search/shots/popular/illustration?q=sneaker%20

Here are the links I've retrieved:

https://static.dribbble.com/users/458522/screenshots/6040912/nike_air_huarache_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/105681/screenshots/3944640/hype_1x.png
https://static.dribbble.com/users/105681/avatars/mini/avatar-01-01.png?1377980605
https://static.dribbble.com/users/923409/screenshots/7179093/basketball_marly_gallardo_1x.jpg
https://static.dribbble.com/users/923409/avatars/mini/bc17b2db165c31804e1cbb1d4159462a.jpg?1596192494
https://static.dribbble.com/users/458522/screenshots/6034458/nike_air_jordan_i_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1237425/screenshots/5071294/customize_air_jordan_web_2x.png
https://static.dribbble.com/users/1237425/avatars/mini/87ae45ac7a07dd69fe59985dc51c7f0f.jpeg?1524130139
https://static.dribbble.com/users/1174720/screenshots/6187664/adidas_2x.png
https://static.dribbble.com/users/1174720/avatars/mini/9de08da40078e869f1a680d2e43cdb73.png?1588733495
https://static.dribbble.com/users/179617/screenshots/4426819/ultraboost_1x.png
https://static.dribbble.com/users/179617/avatars/mini/2d545dc6c0dffc930a2b20ca3be88802.jpg?1596735027
https://static.dribbble.com/users/458522/screenshots/6126041/nike_air_max_270_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/60266/screenshots/6698826/nike_shoe_2x.jpg
https://static.dribbble.com/users/60266/avatars/mini/64826d925db1d4178258d17d8826842b.png?1549028805
https://static.dribbble.com/users/78464/screenshots/4950025/8x600_1x.jpg
https://static.dribbble.com/users/78464/avatars/mini/a9ae6a559ab479d179e8bd22591e4028.jpg?1465908886
https://static.dribbble.com/users/458522/screenshots/6118702/adidas_nmd_r1_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/458522/screenshots/6098953/nike_lebron_10_je_icon_qs_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/879147/screenshots/7152093/img_0966_2x.png
https://static.dribbble.com/users/879147/avatars/mini/e095f3837f221bb2ef652dcc966b99f7.jpg?1568473177
https://static.dribbble.com/users/458522/screenshots/6128979/nerd_x_adidas_pharrell_hu_nmd_trail_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/879147/screenshots/11064235/26fa4a2d-9033-4953-b48f-4c0e8a93fc9d_2x.png
https://static.dribbble.com/users/879147/avatars/mini/e095f3837f221bb2ef652dcc966b99f7.jpg?1568473177
https://static.dribbble.com/users/458522/screenshots/6132938/nike_moon_racer_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1823684/screenshots/5973495/jordannn1_2x.png
https://static.dribbble.com/users/1823684/avatars/mini/f6041c082aec67302d4b78b8d203f02b.png?1509719582
https://static.dribbble.com/users/552027/screenshots/4666241/airmax270_1x.jpg
https://static.dribbble.com/users/552027/avatars/mini/35bb0dcb5a6619f68816290898bff6cc.jpg?1535884243
https://static.dribbble.com/users/458522/screenshots/6044426/adidas_pharrell_hu_nmd_trail_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/220914/screenshots/11295053/woman_shoe_tree_floating2_2x.png
https://static.dribbble.com/users/220914/avatars/mini/d364a9c166edb6d96cc059a836219a7d.jpg?1590773568
https://static.dribbble.com/users/4040486/screenshots/7079508/___2x.png
https://static.dribbble.com/users/4040486/avatars/mini/f31e9b50df877df815177e2015135ff7.png?1582521697
https://static.dribbble.com/users/57602/screenshots/12909636/d2_2x.png
https://static.dribbble.com/users/57602/avatars/mini/b4c27f3be2c61d82fbc821433d058b04.jpg?1575089000
https://static.dribbble.com/users/458522/screenshots/6049522/nike_x_john_elliott_lebron_10_soldier_1x.jpg
https://static.dribbble.com/users/458522/avatars/mini/0e524c2621e12569378282793e1ce72b.png?1580329767
https://static.dribbble.com/users/1025917/screenshots/9738550/vans-2020-pixelwolfie-dribbble_2x.png
https://static.dribbble.com/users/1025917/avatars/mini/87fdcb145eab0b47eda29fc873f25f8c.png?1594466719
https://static.dribbble.com/assets/icon-backtotop-1b04df73090f6b0f3192a3b71874ca3b3cc19dff16adc6cf365cd0c75897f6c0.png
https://static.dribbble.com/assets/dribbble-ball-icon-e94956d5f010d19607348176b0ae90def55d61871a43cb4bcb6d771d8d235471.svg
https://static.dribbble.com/assets/icon-shot-x-light-40c073cd65443c99d4ac129b69bf578c8cf97d69b78990c00c4f8c5873b0d601.png
https://static.dribbble.com/assets/icon-shot-prev-light-ca583c76838d54eca11832ebbcaba09ba8b2bf347de2335341d244ecb9734593.png
https://static.dribbble.com/assets/icon-shot-next-light-871a18220c4c5a0325d1353f8e4cc204c3b49beacc63500644556faf25ded617.png
https://static.dribbble.com/assets/dribbble-square-c8c7a278e96146ee5a9b60c3fa9eeba58d2e5063793e2fc5d32366e1b34559d3.png
https://static.dribbble.com/assets/dribbble-ball-192-ec064e49e6f63d9a5fa911518781bee0c90688d052a038f8876ef0824f65eaf2.png
https://static.dribbble.com/assets/icon-overlay-x-2x-b7df2526b4c26d4e8410a7c437c433908be0c7c8c3c3402c3e578af5c50cf5a5.png

However, I only want to be able to grab the URLs that have the string "screenshots" in them. So, I tried making a function to grab certain images that have the "screenshots" in its URL. so for example:

https://static.dribbble.com/users/923409/screenshots/7179093/basketball_marly_gallardo_1x.jpg

At first to see if even worked I made a function to print the specific links I wanted. However it didn't work. Here is my function code:

def art_links():
    images = []
    for img in x:
        images.append(img['src'])
    images = soup2.find_all("screenshots")
    print(images)

Here is my full code:

from bs4 import BeautifulSoup
import requests as rq 
import os 

r2 = rq.get("https://dribbble.com/search/shots/popular/illustration?q=sneaker%20")
soup2 = BeautifulSoup(r2.text, "html.parser")

links = []

x = soup2.select('img[src^="https://static.dribbble.com"]')

for img in x: 
    links.append(img['src'])

def art_links():
    images = []
    for img in x:
        images.append(img['src'])
    images = soup2.find_all("screenshots")
    print(images)
    

os.mkdir('dribblephotos') 


for index, img_link in enumerate(links):
    if "screenshots" in images:
    img_data = r.get(img_link).content
        with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
            f.write(img_data)
        
    else:
        break
art_links()

I'm noticing a little bit of an issue with the syntax of your code by the if statement at the end (not tabbed over under the if), so I reformatted it a bit to try and get it to what you wanted. I think what might be happening is you are breaking in an else statement out of the for loop you have at the end. This makes it so as soon as one entry doesn't have screenshot in the link, it stops the loop entirely instead of continuing. While there is a keyword 'continue' that can be used, it is sufficient to just not put the else statement. You also are checking for "screenshots" in images, but the name of the link that you are trying to check is declared as img_link in your for loop. Try this out for your for loop at the end and see what you get:

for index, img_link in enumerate(links):
if "screenshots" in img_link:
    img_data = rq.get(img_link).content
    with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
        f.write(img_data)

If you still require the links rather than the file download, you should be able to retrieve them as you loop through the images in the for loop and store them in a new list if it was a screenshot link.

UPDATE: This newest one works for me. I removed the function that filters out the ips after putting them into a loop, since this was unnecessary after having already looped through it twice. The first for loop is all you need, iterating twice is unnecessary so I just check on the first time it is iterated through and only save the links to the links list if it is required.

from bs4 import BeautifulSoup
import requests as rq
import os

r2 = rq.get("https://dribbble.com/search/shots/popular/illustration?q=sneaker%20")
soup2 = BeautifulSoup(r2.text, "html.parser")

links = []

x = soup2.select('img[src^="https://static.dribbble.com"]')

os.mkdir('dribblephotos')

# Only one for loop required, shouldn't iterate twice if not required
for index, img in enumerate(x):
    # Store the current url from the image result
    url = img["src"]
    # Check the url for screenshot before putting in the links
    if "screenshot" in url:
        links.append(img['src'])
        # Download the image
        img_data = rq.get(url).content
        # Put the image into the file
        with open("dribblephotos/" + str(index + 1) + '.jpg', 'wb+') as f:
            f.write(img_data)

print(links)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM