简体   繁体   中英

Extracting multiple URLs with no 'a' or 'href' tags from web page with BS4

I'm making a simple program with Selenium that goes to Flickr.com, searches for the term the user inputs, and then prints out the URLs of all those images.

I'm struggling on the final part, getting just the URLs of the images. I've been using the class_= search to get the portion of the HTML where the URLs are. This returns the following multiple times when searching for 'apples':

<div class="view photo-list-photo-view requiredToShowOnServer awake" 
   data-view-signature="photo-list-photo-view__engagementModelName_photo-lite-
   models__excludePeople_false__id_6246270647__interactionViewName_photo-list-
   photo-interaction-    view__isOwner_false__layoutItem_1__measureAFT_true__model_1__modelParams_1_    _parentContainer_1__parentSignature_photolist-
   479__requiredToShowOnClient_true__requiredToShowOnServer_true__rowHeightMod    _1__searchTerm_apples__searchType_1__showAdvanced_true__showSort_true__show    Tools_true__sortMenuItems_1__unifiedSubviewParams_1__viewType_jst"
   style="transform: translate(823px, 970px); -webkit-transform:     translate(823px, 970px); -ms-transform: translate(823px, 970px); width:
   237px; height: 178px; background-image:
   url(//c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg)">
<div class="interaction-view"></div>

All I want is for the URL of each image to be like this:

c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg

Since there are no a or href trags I'm struggling to filter them out.

I tried doing some regex as well at the end such as the following:

print(soup.find_all(re.compile(r'^url\.jpg$')))

But that didn't work.

Here's my full code below anyway, thanks.

import os
import re
import urllib.request as urllib2
import bs4
from selenium import webdriver
from selenium.webdriver.common.keys import Keys 

os.makedirs('My_images', exist_ok=True)

browser = webdriver.Chrome()
browser.implicitly_wait(10)

print("Opening Flickr.com")

siteChoice = 'http://www.flickr.com'

browser.get(siteChoice)

print("Enter your search term: ")

term = input("> ")

searchField = browser.find_element_by_id('search-field')
searchField.send_keys(term)
searchField.submit()

url = siteChoice + '/search/?text=' + term

html = urllib2.urlopen(url)

soup = bs4.BeautifulSoup(html, "html.parser")

print(soup.find_all(class_='view photo-list-photo-view requiredToShowOnServer awake', style = re.compile('staticflickr')))

my changed code:

p = re.compile(r'url\(\/\/([^\)]+)\)')

test_str = str(soup)

all_urls = re.findall(p, test_str)


print('Exporting to file')


with open('flickr_urls.txt', 'w') as f:
    for i in all_urls:
        f.writelines("%s\n" % i)

print('Done')

Try this

url\(\/\/([^\)]+)\)

Demo

import re
p = re.compile(ur'url\(\/\/([^\)]+)\)')
test_str = u"<div class=\"view photo-list-photo-view requiredToShowOnServer awake\" \ndata-view-signature=\"photo-list-photo-view__engagementModelName_photo-lite-\nmodels__excludePeople_false__id_6246270647__interactionViewName_photo-list-\nphoto-interaction-    view__isOwner_false__layoutItem_1__measureAFT_true__model_1__modelParams_1_    _parentContainer_1__parentSignature_photolist-\n479__requiredToShowOnClient_true__requiredToShowOnServer_true__rowHeightMod    _1__searchTerm_apples__searchType_1__showAdvanced_true__showSort_true__show    Tools_true__sortMenuItems_1__unifiedSubviewParams_1__viewType_jst\"\n style=\"transform: translate(823px, 970px); -webkit-transform:     translate(823px, 970px); -ms-transform: translate(823px, 970px); width:\n 237px; height: 178px; background-image:\n url(//c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg)\">\n<div class=\"interaction-view\"></div>"

m = re.search(p, test_str)
print m.group(1)

Output:

c3.staticflickr.com/7/6114/6246270647_edc7387cfc_m.jpg

To scrap all the png/jpg links from a page with Selenium :

from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.flickr.com/")
links = driver.execute_script("return document.body.innerHTML.match(" \
  "/https?:\/\/[a-z_\/0-9\-\#=&.\@]+\.(jpg|png)/gi)")
print links

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM