Python 使用 Beautifulsoup 进行网络抓取：lowes stores

Question

我是新手。 我被要求从网站获取商店编号、城市、state 的列表： https://www.lowes.com/Lowes-Stores

以下是我到目前为止所尝试过的。 由于该结构没有属性，我不确定如何继续我的代码。 请指导！

import requests
from bs4 import BeautifulSoup
import json
from pandas import DataFrame as df

url = "https://www.lowes.com/Lowes-Stores"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

page = requests.get(url, headers=headers)
page.encoding = 'ISO-885901'
soup = BeautifulSoup(page.text, 'html.parser')

lowes_list = soup.find_all(class_ = "list unstyled")
for i in lowes_list[:2]:
    print(i)

example = lowes_list[0]
example_content = example.contents
example_content

Answer 1

您已在 for 循环中找到包含 state 商店查找所需链接的列表元素。 您需要从每个“li”元素内的“a”标签获取 href 属性。

这只是第一步，因为您需要点击这些链接才能获得每个 state 的商店结果。

因为你知道这个 state 链接结果的结构，你可以简单地做：

for i in lowes_list:
     list_items = i.find_all('li')
     for x in list_items:
         for link in x.find_all('a'):
             print(link['href'])

肯定有更有效的方法可以做到这一点，但列表非常小而且可行。

获得每个 state 的链接后，您可以为每个创建另一个请求以访问这些商店结果页面。 然后从每个州页面上的那些搜索结果链接中获取 href 属性。 这

<a href="/store/AK-Anchorage/0289">Anchorage Lowe's</a>

包含城市和商店编号。

这是一个完整的例子。 我包含了很多评论来说明这些观点。

您几乎拥有第 27 行之前的所有内容，但您需要点击每个 state 的链接。解决这些问题的一个好方法是首先在您的 web 浏览器中测试路径，同时打开开发工具，观察 HTML，这样您就有从哪里开始编写代码的好主意。

该脚本将获取您需要的数据，但不提供任何数据呈现。

import requests
from bs4 import BeautifulSoup as bs


url = "https://www.lowes.com/Lowes-Stores"
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}

page = requests.get(url, headers=headers, timeout=5)
page.encoding = "ISO-885901"
soup = bs(page.text, "html.parser")

lowes_state_lists = soup.find_all(class_="list unstyled")

# we will store the links for each state in this array
state_stores_links = []

# now we populate the state_stores_links array by finding the href in each li tag
for ul in lowes_state_lists:
    list_items = ul.find_all("li")
    # now we have all the list items from the page, we have to extract the href
    for li in list_items:
        for link in li.find_all("a"):
            state_stores_links.append(link["href"])

# This next part is what the original question was missing, following the state links to their respective search result pages. 

# at this point we have to request a new page for each state and store the results
# you can use pandas, but an dict works too.
states_stores = {}


for link in state_stores_links:
    # splitting up the link on the / gives us the parts of the URLs.
    # by inspecting with Chrome DevTools, we can see that each state follows the same pattern (state name and state abbreviation)
    link_components = link.split("/")
    state_name = link_components[2]
    state_abbreviation = link_components[3]

    # let's use the state_abbreviation as the dict's key, and we will have a stores array that we can do reporting on
    # the type and shape of this dict is irrelevant at this point.  This example illustrates how to obtain the info you're after
    # in the end the states_stores[state_abbreviation]['stores'] array will dicts each with a store_number and a city key
    states_stores[state_abbreviation] = {"state_name": state_name, "stores": []}

    try:
        # simple error catching in case something goes wrong, since we are sending many requests.
        # our link is just the second half of the URL, so we have to craft the new one.
        new_link = "https://www.lowes.com" + link
        state_search_results = requests.get(new_link, headers=headers, timeout=5)
        stores = []
        if state_search_results.status_code == 200:
            store_directory = bs(state_search_results.content, "html.parser")
            store_directory_div = store_directory.find("div", class_="storedirectory")
            # now we get the links inside the storedirectory div
            individual_store_links = store_directory_div.find_all("a")
            # we now have all the stores for this state! Let's parse and save them into our store dict
            # the store's city is after the state's abbreviation followed by a dash, the store number is the last thing in the link
            # example: "/store/AK-Wasilla/2512"
            for store in individual_store_links:
                href = store["href"]
                try:
                    # by splitting the href which looks to be consistent throughout the site, we can get the info we need
                    split_href = href.split("/")
                    store_number = split_href[3]
                    # the store city is after the -, so we have to split that element up into its two parts and access the second part.
                    store_city = split_href[2].split("-")[1]
                    # creating our store dict
                    store_object = {"city": store_city, "store_number": store_number}
                    # adding the dict to our state's dict
                    states_stores[state_abbreviation]["stores"].append(store_object)
                except Exception as e:
                    print(
                        "Error getting store info from {0}. Exception: {1}".format(
                            split_href, e
                        )
                    )

            # let's print something so we can confirm our script is working
            print(
                "State store count for {0} is: {1}".format(
                    states_stores[state_abbreviation]["state_name"],
                    len(states_stores[state_abbreviation]["stores"]),
                )
            )
        else:
            print(
                "Error fetching: {0}, error code: {1}".format(
                    link, state_search_results.status_code
                )
            )
    except Exception as e:
        print("Error fetching: {0}. Exception: {1}".format(state_abbreviation, e))

Python 使用 Beautifulsoup 进行网络抓取：lowes stores

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-08-19 23:29:42

Python 使用 Beautifulsoup 进行网络抓取：lowes stores

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-08-19 23:29:42

解决方案1
2 已采纳 2020-08-19 23:29:42