使用python 3.6.3用beautifulsoup4抓取多個頁面

Question

我正在嘗試遍歷多個頁面，但我的代碼沒有提取任何內容。 我對刮刮有點陌生，所以請耐心等待。 我制作了一個容器，以便我可以針對每個列表。 我還創建了一個變量來定位您將按下以轉到下一頁的錨標記。 我真的很感激我能得到的任何幫助。 謝謝。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

for page in range(0,25):
    file = "breakfeast_chicago.csv"
    f = open(file, "w")
    Headers = "Nambusiness_name, business_address, business_city, business_region, business_phone_number\n"
f.write(Headers)

my_url = 'https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}'.format(page)

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()   

# html parsing
page_soup = soup(page_html, "html.parser")

# grabs each listing
containers = page_soup.findAll("div",{"class": "result"})

new = page_soup.findAll("a", {"class":"next ajax-page"})

for i in new:
    try:
        for container in containers:
            b_name = i.find("container.h2.span.text").get_text()
            b_addr = i.find("container.p.span.text").get_text()

            city_container = container.findAll("span",{"class": "locality"})
            b_city = i.find("city_container[0].text ").get_text()

            region_container = container.findAll("span",{"itemprop": "postalCode"})
            b_reg = i.find("region_container[0].text").get_text()

            phone_container = container.findAll("div",{"itemprop": "telephone"})
            b_phone = i.find("phone_container[0].text").get_text()

            print(b_name, b_addr, b_city, b_reg, b_phone)
            f.write(b_name + "," +b_addr + "," +b_city.replace(",", "|") + "," +b_reg + "," +b_phone + "\n")
    except: AttributeError
f.close()

Answer 1

如果使用 BS4 嘗試： find_all

嘗試使用import pdb;pdb.set_trace()放入跟蹤並嘗試調試在 for 循環中選擇的內容。

此外，如果通過 javascript 加載某些內容可能會被隱藏。

每個用於“點擊”的錨標記或 href 只是另一個網絡請求，如果您打算點擊該鏈接，請考慮減慢每個請求之間的請求數量，這樣您就不會被阻止。

Answer 2

您可以嘗試像下面的腳本。 它將通過分頁遍歷不同的頁面，並從每個容器中收集姓名和電話號碼。

import requests
from bs4 import BeautifulSoup

my_url = "https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}"
for link in [my_url.format(page) for page in range(1,5)]:
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")

    for item in soup.select(".info"):
        try:
            name = item.select(".business-name [itemprop='name']")[0].text
        except Exception:
            name = ""
        try:
            phone = item.select("[itemprop='telephone']")[0].text
        except Exception:
            phone = ""

        print(name,phone)

使用python 3.6.3用beautifulsoup4抓取多個頁面

問題描述

2 個解決方案

解決方案1
1 2017-11-27 23:03:54

解決方案2
0 已采納 2017-11-28 06:23:48

使用python 3.6.3用beautifulsoup4抓取多個頁面

問題描述

2 個解決方案

解決方案1 1 2017-11-27 23:03:54

解決方案2 0 已采納 2017-11-28 06:23:48

解決方案1
1 2017-11-27 23:03:54

解決方案2
0 已采納 2017-11-28 06:23:48