[英]How to scrape src from img html in python
我正在尝试抓取img的src,但是我发现的代码返回了许多img src,但不是我想要的那个。 我不知道自己在做什么错。 我在“ https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html ”上抓取TripAdvisor
这是我要从中提取的HTML代码段:
<div class="restaurants-detail-overview-cards-LocationOverviewCard__cardColumn--2ALwF"><h6>Placering og kontaktoplysninger</h6><span><div><span data-test-target="staticMapSnapshot" class=""><img class="restaurants-detail-overview-cards-LocationOverviewCard__mapImage--22-Al" src="https://trip-raster.citymaps.io/staticmap?scale=1&zoom=15&size=347x137&language=da&center=55.687988,12.596316&markers=icon:http%3A%2F%2Fc1.tacdn.com%2F%2Fimg2%2Fmaps%2Ficons%2Fcomponent_map_pins_v1%2FR_Pin_Small.png|55.68799,12.596316"></span></div></span>
我希望代码返回:(来自src的子字符串)
55.68799,12.596316
我努力了:
import pandas as pd
pd.options.display.max_colwidth = 200
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
import re
web_url = "https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html"
url = urlopen(web_url)
url_html = url.read()
soup = bs(url_html, 'lxml')
soup.find_all('img')
for link in soup.find_all('img'):
print(link.get('src'))
回报是沿着这个路线,但不是我需要的src:
https://static.tacdn.com/img2/branding/rebrand/TA_logo_secondary.svg
https://static.tacdn.com/img2/branding/rebrand/TA_logo_primary.svg
https://static.tacdn.com/img2/branding/rebrand/TA_logo_secondary.svg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
硒是一种解决方法,我对其进行了测试,并且看起来很有魅力。 这个给你:
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html")
links = driver.find_elements_by_xpath("//*[@src]")
urls = []
for link in links:
url = link.get_attribute('src')
if '|' in url:
urls.append(url.split('|')[1]) # saves in a list only the numbers you want i.e. 55.68799,12.596316
print(url)
print(urls)
以上['55.68799,12.596316']
如果您以前没有使用过selenium
,则可以找到一个网络驱动程序https://chromedriver.storage.googleapis.com/index.html?path=2.46/
或在这里
https://sites.google.com/a/chromium.org/chromedriver/downloads
您可以只请求并重新执行此操作。 只是src的坐标部分是基于位置的变量。
import requests, re
p = re.compile(r'"coords":"(.*?)"')
r = requests.get('https://www.tripadvisor.dk/Restaurant_Review-g189541-d15804886-Reviews-The_Pescatarian-Copenhagen_Zealand.html')
coords = p.findall(r.text)[1]
src = f'https://trip-raster.citymaps.io/staticmap?scale=1&zoom=15&size=347x137&language=da¢er={coords}&markers=icon:http://c1.tacdn.com//img2/maps/icons/component_map_pins_v1/R_Pin_Small.png|{coords}'
print(src)
print(coords)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.