![](/img/trans.png)
[英]How to loop & scraping data for multiple pages using python and beautifulsoup4
[英]Scraping multiple pages with beautifulsoup4 using python 3.6.3
我正在尝试遍历多个页面,但我的代码没有提取任何内容。 我对刮刮有点陌生,所以请耐心等待。 我制作了一个容器,以便我可以针对每个列表。 我还创建了一个变量来定位您将按下以转到下一页的锚标记。 我真的很感激我能得到的任何帮助。 谢谢。
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
for page in range(0,25):
file = "breakfeast_chicago.csv"
f = open(file, "w")
Headers = "Nambusiness_name, business_address, business_city, business_region, business_phone_number\n"
f.write(Headers)
my_url = 'https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}'.format(page)
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# grabs each listing
containers = page_soup.findAll("div",{"class": "result"})
new = page_soup.findAll("a", {"class":"next ajax-page"})
for i in new:
try:
for container in containers:
b_name = i.find("container.h2.span.text").get_text()
b_addr = i.find("container.p.span.text").get_text()
city_container = container.findAll("span",{"class": "locality"})
b_city = i.find("city_container[0].text ").get_text()
region_container = container.findAll("span",{"itemprop": "postalCode"})
b_reg = i.find("region_container[0].text").get_text()
phone_container = container.findAll("div",{"itemprop": "telephone"})
b_phone = i.find("phone_container[0].text").get_text()
print(b_name, b_addr, b_city, b_reg, b_phone)
f.write(b_name + "," +b_addr + "," +b_city.replace(",", "|") + "," +b_reg + "," +b_phone + "\n")
except: AttributeError
f.close()
如果使用 BS4 尝试: find_all
尝试使用import pdb;pdb.set_trace()
放入跟踪并尝试调试在 for 循环中选择的内容。
此外,如果通过 javascript 加载某些内容可能会被隐藏。
每个用于“点击”的锚标记或 href 只是另一个网络请求,如果您打算点击该链接,请考虑减慢每个请求之间的请求数量,这样您就不会被阻止。
您可以尝试像下面的脚本。 它将通过分页遍历不同的页面,并从每个容器中收集姓名和电话号码。
import requests
from bs4 import BeautifulSoup
my_url = "https://www.yellowpages.com/search?search_terms=Stores&geo_location_terms=Chicago%2C%20IL&page={}"
for link in [my_url.format(page) for page in range(1,5)]:
res = requests.get(link)
soup = BeautifulSoup(res.text, "lxml")
for item in soup.select(".info"):
try:
name = item.select(".business-name [itemprop='name']")[0].text
except Exception:
name = ""
try:
phone = item.select("[itemprop='telephone']")[0].text
except Exception:
phone = ""
print(name,phone)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.