Python 中的网页抓取（beautifulsoup）

Question

我正在尝试进行网络爬虫，目前我一直在坚持如何继续使用代码。 我正在尝试创建一个抓取前 80 个 Yelp 的代码。 评论，因为每页只有 20 条评论。 我还一直在研究如何创建一个循环来将网页更改为接下来的 20 条评论。

from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
    url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
    # get webpage data from url
    response = requests.get(url)
    #sleep for 2 seconds
    time.sleep(2)
    # get html document from web page data
    html_doc = response.text
    # parser
    soup = BeautifulSoup(html_doc, "lxml")
    page_title = soup.title.text
    #get a tag content based on class
    p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
    #print the text within the tag
    return p_tag.text

Answer 1

一般说明/提示：在要抓取的页面上使用“检查”工具。

至于您的问题，如果您访问该网站并解析 BeautifulSoup 然后在函数中使用汤 object ，它也会更好地工作 - 访问一次，解析任意多次。 通过这种方式，您不会经常被网站列入黑名单。 下面是一个示例结构。

url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
get_description(soup)
get_reviews(soup)

如果您检查页面，每条评论都会显示为模板的副本。 如果您将每条评论视为单独的 object 并对其进行解析，您可以获得您正在寻找的评论。 审核模板有 class id:lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT

至于分页，分页编号包含在一个模板中，class="lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j"

各个页码链接包含在 a-href 标记中，因此只需编写一个 for 循环来遍历链接。

Answer 2

要获得下一页，您必须点击“下一页”链接。 这里的问题是链接和以前一样加上# 。 打开 Inspector [Ctrl-Shift-I in Chrome, Firefox] 并切换到网络选项卡，然后单击下一步按钮，您将看到类似以下内容的请求：

https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40

看起来像：

{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place ......

这是 JSON。 唯一的问题是，您需要通过向 Yelp 的服务器发送标头来欺骗 Yelp 的服务器，使其误以为您正在浏览该网站，否则您会得到看起来不像评论的不同数据。

它们在 Chrome 中看起来像这样

我通常的方法是将不带冒号前缀的标题（忽略:authority等）直接复制粘贴到名为raw_headers的三引号字符串中，然后运行

headers = dict([[h.partition(':')[0], h.partition(':')[2]] for h in raw_headers.split('\n')])

在它们之上，并将它们作为参数传递给请求：

requests.get(url, headers=headers)

有些标头不是必需的，cookies 可能会过期，并且可能会出现各种其他问题，但这至少给了你一个战斗的机会。

Python 中的网页抓取（beautifulsoup）

问题描述

2 个解决方案

解决方案1
0 2019-11-14 04:01:29

解决方案2
0 2019-11-14 04:21:34

Python 中的网页抓取（beautifulsoup）

问题描述

2 个解决方案

解决方案1 0 2019-11-14 04:01:29

解决方案2 0 2019-11-14 04:21:34

解决方案1
0 2019-11-14 04:01:29

解决方案2
0 2019-11-14 04:21:34