[英]Random </div> interfering with Beautiful Soup
I'm attempting to extract a table from a link.我正在尝试从链接中提取表格。 I've done this on a variety of websites and I'm experiencing a strange error.
我已经在各种网站上完成了此操作,但遇到了一个奇怪的错误。
import requests
from bs4 import BeautifulSoup
#Preliminary get request to website
url = 'https://www.target.com/store-locator/find-stores/10470'
headers = {"User-Agent": "'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'"}
response = requests.get(url, headers=headers, timeout=(3,30))
soup = BeautifulSoup(response.content, 'html.parser')
# Up to here, everything works as would be expected.
# This will return a NoneType Object and nothing will be found despite seeing it when the page is inspected.
desired_table = soup.find('div', class_="Row-uds8za-0 gUzGLa h-padding-h-default")
I believe what is going on is that there is an extra /div .我相信发生的事情是有一个额外的/div 。 If you inspect the page under the web browser and follow div id="root" , to div id="viewport" , to div id="mainContainer" , to div data-component="COMPONENT-222040" , then you'll see an extra /div .
如果您检查 web 浏览器下的页面并按照div id="root" ,到div id="viewport" ,到div id="mainContainer" ,到div data-component="COMPONENT-222040" ,那么你会看到一个额外的/div 。
If I were to say如果我要说
root_table = soup.find(id="root")
print(root_table.prettyify())
then, you can see that the html ends on this extra /div despite there being more information that I want access to.然后,您可以看到 html 在这个额外的 /div 上结束,尽管我想要访问更多信息。
Any advice on how to solve this problem would be very much appreciated.任何有关如何解决此问题的建议将不胜感激。
The data about shops is loaded dynamically via Javascript.有关商店的数据通过 Javascript 动态加载。 You can simulate Ajax calls with
requests
library.您可以使用
requests
库模拟 Ajax 调用。
For example:例如:
import re
import json
import requests
url = 'https://www.target.com/store-locator/find-stores/10470'
ajax_url = 'https://redsky.target.com/v3/stores/nearby/{zip_no}?key={api_key}&limit=20&within=100&unit=mile'
api_key = re.search(r'"apiKey":"(.*?)"', requests.get(url).text).group(1)
zip_no = url.rsplit('/', maxsplit=1)[-1]
data = requests.get(ajax_url.format(zip_no=zip_no, api_key=api_key)).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for location in data[0]['locations']:
name = location['location_names'][0]['name']
h = location['rolling_operating_hours']['regular_event_hours']['days'][0]['hours'][0]
hours = '{} - {}'.format(h['begin_time'], h['end_time'])
print('{:<50}{}'.format(name, hours))
Prints:印刷:
Bronx Riverdale 08:00:00 - 21:00:00
Mt Vernon 08:00:00 - 21:00:00
Bronx-Throggs Neck 08:00:00 - 21:00:00
Bronx Terminal 08:00:00 - 21:00:00
Closter NJ 08:00:00 - 21:00:00
Harlem 08:00:00 - 21:00:00
College Point 08:00:00 - 21:00:00
Edgewater 08:00:00 - 21:00:00
Hackensack 08:00:00 - 21:00:00
Port Washington North 07:00:00 - 21:00:00
Flushing 08:00:00 - 21:00:00
Upper East Side 70th and 3rd 07:00:00 - 21:00:00
Paramus 08:00:00 - 21:00:00
North Bergen Commons 08:00:00 - 21:00:00
White Plains 08:00:00 - 21:00:00
Queens Place 08:00:00 - 21:00:00
Manhattan Herald Square 07:00:00 - 21:00:00
Kips Bay 07:00:00 - 21:00:00
Forest Hills 07:00:00 - 21:00:00
Manhattan East Village 07:00:00 - 21:00:00
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.