简体   繁体   English

随机的

[英]Random </div> interfering with Beautiful Soup

I'm attempting to extract a table from a link.我正在尝试从链接中提取表格。 I've done this on a variety of websites and I'm experiencing a strange error.我已经在各种网站上完成了此操作,但遇到了一个奇怪的错误。

import requests
from bs4 import BeautifulSoup

#Preliminary get request to website
url = 'https://www.target.com/store-locator/find-stores/10470'
headers = {"User-Agent": "'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'"}

response = requests.get(url, headers=headers, timeout=(3,30))

soup = BeautifulSoup(response.content, 'html.parser')

# Up to here, everything works as would be expected. 

# This will return a NoneType Object and nothing will be found despite seeing it when the page is inspected. 
desired_table = soup.find('div', class_="Row-uds8za-0 gUzGLa h-padding-h-default")

I believe what is going on is that there is an extra /div .我相信发生的事情是有一个额外的/div If you inspect the page under the web browser and follow div id="root" , to div id="viewport" , to div id="mainContainer" , to div data-component="COMPONENT-222040" , then you'll see an extra /div .如果您检查 web 浏览器下的页面并按照div id="root" ,到div id="viewport" ,到div id="mainContainer" ,到div data-component="COMPONENT-222040" ,那么你会看到一个额外的/div

If I were to say如果我要说

root_table = soup.find(id="root")
print(root_table.prettyify())

then, you can see that the html ends on this extra /div despite there being more information that I want access to.然后,您可以看到 html 在这个额外的 /div 上结束,尽管我想要访问更多信息。

Any advice on how to solve this problem would be very much appreciated.任何有关如何解决此问题的建议将不胜感激。

The data about shops is loaded dynamically via Javascript.有关商店的数据通过 Javascript 动态加载。 You can simulate Ajax calls with requests library.您可以使用requests库模拟 Ajax 调用。

For example:例如:

import re
import json
import requests

url = 'https://www.target.com/store-locator/find-stores/10470'
ajax_url = 'https://redsky.target.com/v3/stores/nearby/{zip_no}?key={api_key}&limit=20&within=100&unit=mile'

api_key = re.search(r'"apiKey":"(.*?)"', requests.get(url).text).group(1)
zip_no = url.rsplit('/', maxsplit=1)[-1]

data = requests.get(ajax_url.format(zip_no=zip_no, api_key=api_key)).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for location in data[0]['locations']:
    name    = location['location_names'][0]['name']
    h = location['rolling_operating_hours']['regular_event_hours']['days'][0]['hours'][0]
    hours   = '{} - {}'.format(h['begin_time'], h['end_time'])
    print('{:<50}{}'.format(name, hours))

Prints:印刷:

Bronx Riverdale                                   08:00:00 - 21:00:00
Mt Vernon                                         08:00:00 - 21:00:00
Bronx-Throggs Neck                                08:00:00 - 21:00:00
Bronx Terminal                                    08:00:00 - 21:00:00
Closter NJ                                        08:00:00 - 21:00:00
Harlem                                            08:00:00 - 21:00:00
College Point                                     08:00:00 - 21:00:00
Edgewater                                         08:00:00 - 21:00:00
Hackensack                                        08:00:00 - 21:00:00
Port Washington North                             07:00:00 - 21:00:00
Flushing                                          08:00:00 - 21:00:00
Upper East Side 70th and 3rd                      07:00:00 - 21:00:00
Paramus                                           08:00:00 - 21:00:00
North Bergen Commons                              08:00:00 - 21:00:00
White Plains                                      08:00:00 - 21:00:00
Queens Place                                      08:00:00 - 21:00:00
Manhattan Herald Square                           07:00:00 - 21:00:00
Kips Bay                                          07:00:00 - 21:00:00
Forest Hills                                      07:00:00 - 21:00:00
Manhattan East Village                            07:00:00 - 21:00:00

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM