随机的

Question

I'm attempting to extract a table from a link.我正在尝试从链接中提取表格。 I've done this on a variety of websites and I'm experiencing a strange error.我已经在各种网站上完成了此操作，但遇到了一个奇怪的错误。

import requests
from bs4 import BeautifulSoup

#Preliminary get request to website
url = 'https://www.target.com/store-locator/find-stores/10470'
headers = {"User-Agent": "'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'"}

response = requests.get(url, headers=headers, timeout=(3,30))

soup = BeautifulSoup(response.content, 'html.parser')

# Up to here, everything works as would be expected. 

# This will return a NoneType Object and nothing will be found despite seeing it when the page is inspected. 
desired_table = soup.find('div', class_="Row-uds8za-0 gUzGLa h-padding-h-default")

I believe what is going on is that there is an extra /div .我相信发生的事情是有一个额外的/div 。 If you inspect the page under the web browser and follow div id="root" , to div id="viewport" , to div id="mainContainer" , to div data-component="COMPONENT-222040" , then you'll see an extra /div .如果您检查 web 浏览器下的页面并按照div id="root" ，到div id="viewport" ，到div id="mainContainer" ，到div data-component="COMPONENT-222040" ，那么你会看到一个额外的/div 。

If I were to say如果我要说

root_table = soup.find(id="root")
print(root_table.prettyify())

then, you can see that the html ends on this extra /div despite there being more information that I want access to.然后，您可以看到 html 在这个额外的 /div 上结束，尽管我想要访问更多信息。

Any advice on how to solve this problem would be very much appreciated.任何有关如何解决此问题的建议将不胜感激。

Answer 1

The data about shops is loaded dynamically via Javascript.有关商店的数据通过 Javascript 动态加载。 You can simulate Ajax calls with requests library.您可以使用requests库模拟 Ajax 调用。

For example:例如：

import re
import json
import requests

url = 'https://www.target.com/store-locator/find-stores/10470'
ajax_url = 'https://redsky.target.com/v3/stores/nearby/{zip_no}?key={api_key}&limit=20&within=100&unit=mile'

api_key = re.search(r'"apiKey":"(.*?)"', requests.get(url).text).group(1)
zip_no = url.rsplit('/', maxsplit=1)[-1]

data = requests.get(ajax_url.format(zip_no=zip_no, api_key=api_key)).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for location in data[0]['locations']:
    name    = location['location_names'][0]['name']
    h = location['rolling_operating_hours']['regular_event_hours']['days'][0]['hours'][0]
    hours   = '{} - {}'.format(h['begin_time'], h['end_time'])
    print('{:<50}{}'.format(name, hours))

Prints:印刷：

Bronx Riverdale                                   08:00:00 - 21:00:00
Mt Vernon                                         08:00:00 - 21:00:00
Bronx-Throggs Neck                                08:00:00 - 21:00:00
Bronx Terminal                                    08:00:00 - 21:00:00
Closter NJ                                        08:00:00 - 21:00:00
Harlem                                            08:00:00 - 21:00:00
College Point                                     08:00:00 - 21:00:00
Edgewater                                         08:00:00 - 21:00:00
Hackensack                                        08:00:00 - 21:00:00
Port Washington North                             07:00:00 - 21:00:00
Flushing                                          08:00:00 - 21:00:00
Upper East Side 70th and 3rd                      07:00:00 - 21:00:00
Paramus                                           08:00:00 - 21:00:00
North Bergen Commons                              08:00:00 - 21:00:00
White Plains                                      08:00:00 - 21:00:00
Queens Place                                      08:00:00 - 21:00:00
Manhattan Herald Square                           07:00:00 - 21:00:00
Kips Bay                                          07:00:00 - 21:00:00
Forest Hills                                      07:00:00 - 21:00:00
Manhattan East Village                            07:00:00 - 21:00:00

随机的

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-05-27 16:57:10

随机的

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-05-27 16:57:10

解决方案1
0 已采纳 2020-05-27 16:57:10