注释在网页上可见，但是BeautifulSoup返回的html对象不包含注释部分

Question

I tried to extract the text content of comments from a web page using its URL link, and I used BeautifulSoup for scraping. 我尝试使用其URL链接从网页中提取评论的文本内容，然后使用BeautifulSoup进行抓取。 The content of comments is visible on the page when I clicked the URL link, but the HTML object returned by BeautifulSoup did not contain these tags and texts. 当我单击URL链接时，注释的内容在页面上可见，但是BeautifulSoup返回的HTML对象不包含这些标记和文本。

I used BeautifulSoup with 'html.parser' to do the web scraping. 我将BeautifulSoup与'html.parser'一起使用进行网络抓取。 I successfully extracted the number of likes/views/comments of the video in the given webpage, but the information of comment sections was not included in the HTML file. 我成功地提取了给定网页中视频的喜欢/观看/评论的数量，但是评论部分的信息未包含在HTML文件中。 The browser I used was Chrome, and the system is Ubuntu 18.04.1 LTS. 我使用的浏览器是Chrome，系统是Ubuntu 18.04.1 LTS。

This is the codes I used (in python): 这是我使用的代码（在python中）：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import os

webpage_link = "https://www.airvuz.com/video/Majestic-Beast-Nanuk?id=59b2a56141ab4823e61ea901"

try:
    page = urlopen(webpage_link)
except urllib.error.HTTPError as err:  # webpage cannot be found
    print("ERROR! %s" %(webpage_link))

soup = BeautifulSoup(page, 'html.parser')

The expected result is the soup object contains all the content which is visible on the webpage especially the text content of comments (like "Not being there I enjoyed a lot seeing the life style of white bear. Thanks to the provider for such documentary." and "WOOOW... amazing..."); 预期的结果是，汤对象包含所有在网页上可见的内容，特别是评论的文本内容（例如“不在那儿，我对白熊的生活方式非常满意。这要归功于此类纪录片的提供者。”和“哇...好极了...”）； however, I could not find the corresponding nodes in the soup object. 但是，我在汤对象中找不到相应的节点。 Any help would be appreciated! 任何帮助，将不胜感激！

Answer 1

The comments are generated by JavasSript via an ajax request. 注释是由JavasSript通过ajax请求生成的。 You can send the same request and get the comments from the json response. 您可以发送相同的请求并从json响应中获取评论。 You can find the request using the network tab in the inspect tool. 您可以使用检查工具中的“网络”标签找到请求。

from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment
import json
webpage_link = "https://www.airvuz.com/api/comments/video/59b2a56141ab4823e61ea901?page=1&limit=20"
page = urlopen(webpage_link).read()
comments_json=data = json.loads(page)
for comment_info in comments_json['data']:
    print(comment_info['comment'].strip())

Output 输出量

Not being there I enjoyed a lot seeing the life style of white bear. Thanks to the provider for  such documentary.
WOOOW... amazing...
I've been photographing polar bears for years, but to see this footage from a drones perspective was epic! Well done and congratz on the Nominee! Well deserved.
You are da man Florian!
Absolutely outstanding!
This is incredible
jaw dropping
This is wow amazing, love it.
So cool! Did the bears react to the drone at all?
Congratulations! It's awesome! I am watching in tears....
Awesome!
perfect video awesome
It is very, very beautiful !!! Sincere congratulations
Made my day, exquisite, thank you
Wow
Super!
Marvelous!
Man this is incredible!
Material is good, but  edi is bad. This history about  beer's family...
Muy bueno!

注释在网页上可见，但是BeautifulSoup返回的html对象不包含注释部分

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-03-25 01:26:12

注释在网页上可见，但是BeautifulSoup返回的html对象不包含注释部分

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-03-25 01:26:12

解决方案1
0 已采纳 2019-03-25 01:26:12