I'm making a simple program with Selenium that goes to Flickr.com, searches for the term the user inputs, and then prints out the URLs of all those images.
I'm struggling on the final part, getting just the URLs of the images. I've been using the class_=
search to get the portion of the HTML where the URLs are. This returns the following multiple times when searching for 'apples':
<div class="view photo-list-photo-view requiredToShowOnServer awake"
data-view-signature="photo-list-photo-view__engagementModelName_photo-lite-
models__excludePeople_false__id_6246270647__interactionViewName_photo-list-
photo-interaction- view__isOwner_false__layoutItem_1__measureAFT_true__model_1__modelParams_1_ _parentContainer_1__parentSignature_photolist-
479__requiredToShowOnClient_true__requiredToShowOnServer_true__rowHeightMod _1__searchTerm_apples__searchType_1__showAdvanced_true__showSort_true__show Tools_true__sortMenuItems_1__unifiedSubviewParams_1__viewType_jst"
style="transform: translate(823px, 970px); -webkit-transform: translate(823px, 970px); -ms-transform: translate(823px, 970px); width:
237px; height: 178px; background-image:
url(//c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg)">
<div class="interaction-view"></div>
All I want is for the URL of each image to be like this:
c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg
Since there are no a
or href
trags I'm struggling to filter them out.
I tried doing some regex as well at the end such as the following:
print(soup.find_all(re.compile(r'^url\.jpg$')))
But that didn't work.
Here's my full code below anyway, thanks.
import os
import re
import urllib.request as urllib2
import bs4
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
os.makedirs('My_images', exist_ok=True)
browser = webdriver.Chrome()
browser.implicitly_wait(10)
print("Opening Flickr.com")
siteChoice = 'http://www.flickr.com'
browser.get(siteChoice)
print("Enter your search term: ")
term = input("> ")
searchField = browser.find_element_by_id('search-field')
searchField.send_keys(term)
searchField.submit()
url = siteChoice + '/search/?text=' + term
html = urllib2.urlopen(url)
soup = bs4.BeautifulSoup(html, "html.parser")
print(soup.find_all(class_='view photo-list-photo-view requiredToShowOnServer awake', style = re.compile('staticflickr')))
my changed code:
p = re.compile(r'url\(\/\/([^\)]+)\)')
test_str = str(soup)
all_urls = re.findall(p, test_str)
print('Exporting to file')
with open('flickr_urls.txt', 'w') as f:
for i in all_urls:
f.writelines("%s\n" % i)
print('Done')
Try this
url\(\/\/([^\)]+)\)
import re
p = re.compile(ur'url\(\/\/([^\)]+)\)')
test_str = u"<div class=\"view photo-list-photo-view requiredToShowOnServer awake\" \ndata-view-signature=\"photo-list-photo-view__engagementModelName_photo-lite-\nmodels__excludePeople_false__id_6246270647__interactionViewName_photo-list-\nphoto-interaction- view__isOwner_false__layoutItem_1__measureAFT_true__model_1__modelParams_1_ _parentContainer_1__parentSignature_photolist-\n479__requiredToShowOnClient_true__requiredToShowOnServer_true__rowHeightMod _1__searchTerm_apples__searchType_1__showAdvanced_true__showSort_true__show Tools_true__sortMenuItems_1__unifiedSubviewParams_1__viewType_jst\"\n style=\"transform: translate(823px, 970px); -webkit-transform: translate(823px, 970px); -ms-transform: translate(823px, 970px); width:\n 237px; height: 178px; background-image:\n url(//c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg)\">\n<div class=\"interaction-view\"></div>"
m = re.search(p, test_str)
print m.group(1)
Output:
c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg
To scrap all the png/jpg links from a page with Selenium :
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.flickr.com/")
links = driver.execute_script("return document.body.innerHTML.match(" \
"/https?:\/\/[a-z_\/0-9\-\#=&.\@]+\.(jpg|png)/gi)")
print links
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.