使用python 3.6.3用beautifulsoup4抓取多个页面

Question

I am trying to loop through multiple pages and my code doesn't extract anything.我正在尝试遍历多个页面，但我的代码没有提取任何内容。 I am kind of new to scraping so bear with me.我对刮刮有点陌生，所以请耐心等待。 I made a container so I can target each listing.我制作了一个容器，以便我可以针对每个列表。 I also made a variable to target the anchor tag that you would press to go to the next page.我还创建了一个变量来定位您将按下以转到下一页的锚标记。 I would really appreciate any help I could get.我真的很感激我能得到的任何帮助。 Thanks.谢谢。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

for page in range(0,25):
    file = "breakfeast_chicago.csv"
    f = open(file, "w")
    Headers = "Nambusiness_name, business_address, business_city, business_region, business_phone_number\n"
f.write(Headers)

my_url = 'https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}'.format(page)

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()   

# html parsing
page_soup = soup(page_html, "html.parser")

# grabs each listing
containers = page_soup.findAll("div",{"class": "result"})

new = page_soup.findAll("a", {"class":"next ajax-page"})

for i in new:
    try:
        for container in containers:
            b_name = i.find("container.h2.span.text").get_text()
            b_addr = i.find("container.p.span.text").get_text()

            city_container = container.findAll("span",{"class": "locality"})
            b_city = i.find("city_container[0].text ").get_text()

            region_container = container.findAll("span",{"itemprop": "postalCode"})
            b_reg = i.find("region_container[0].text").get_text()

            phone_container = container.findAll("div",{"itemprop": "telephone"})
            b_phone = i.find("phone_container[0].text").get_text()

            print(b_name, b_addr, b_city, b_reg, b_phone)
            f.write(b_name + "," +b_addr + "," +b_city.replace(",", "|") + "," +b_reg + "," +b_phone + "\n")
    except: AttributeError
f.close()

Answer 1

If using BS4 try : find_all如果使用 BS4 尝试： find_all

Try dropping into a trace using import pdb;pdb.set_trace() and try to debug what is being selected in the for loop.尝试使用import pdb;pdb.set_trace()放入跟踪并尝试调试在 for 循环中选择的内容。

Also, some content may be hidden if it is loaded via javascript.此外，如果通过 javascript 加载某些内容可能会被隐藏。

Each anchor tag or href for "clicking" is just another network request, and if you plan to follow the link consider slowing down the number of requests in between each request, so you don't get blocked.每个用于“点击”的锚标记或 href 只是另一个网络请求，如果您打算点击该链接，请考虑减慢每个请求之间的请求数量，这样您就不会被阻止。

Answer 2

You can try like the below script.您可以尝试像下面的脚本。 It will traverse different pages through pagination and collect name and phone numbers from each container.它将通过分页遍历不同的页面，并从每个容器中收集姓名和电话号码。

import requests
from bs4 import BeautifulSoup

my_url = "https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}"
for link in [my_url.format(page) for page in range(1,5)]:
    res = requests.get(link)
    soup = BeautifulSoup(res.text, "lxml")

    for item in soup.select(".info"):
        try:
            name = item.select(".business-name [itemprop='name']")[0].text
        except Exception:
            name = ""
        try:
            phone = item.select("[itemprop='telephone']")[0].text
        except Exception:
            phone = ""

        print(name,phone)

使用python 3.6.3用beautifulsoup4抓取多个页面

问题描述

2 个解决方案

解决方案1
1 2017-11-27 23:03:54

解决方案2
0 已采纳 2017-11-28 06:23:48

使用python 3.6.3用beautifulsoup4抓取多个页面

问题描述

2 个解决方案

解决方案1 1 2017-11-27 23:03:54

解决方案2 0 已采纳 2017-11-28 06:23:48

解决方案1
1 2017-11-27 23:03:54

解决方案2
0 已采纳 2017-11-28 06:23:48