使用 BeautifulSoup 在 Python 中按元素抓取 HTML

Question

我试图总结我从html中刮取的值，但是总和看起来很奇怪。（它显然低于实际值。）

我查看了其他人的代码，我注意到他们使用re.findall()来查找 html 中的数字。

我的问题是，为什么我不能直接从 html 中抓取内容元素？ 我的代码在上面，底部是其他人的代码与我的代码不同的代码的一部分。

提前感谢您的回答！

# load in the required packages for reading HTML

from urllib.request import urlopen
from bs4 import BeautifulSoup #parser for HTML
import ssl
import re
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#open the url
url = 'http://py4e-data.dr-chuck.net/comments_874984.html'
html = urlopen(url, context = ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrive the information form url
spans = soup('span')
sum = 0
for span in spans:
    x = span.contents[0]
    for n in x:
        sum = sum + int(n)
print(sum)

sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)

Answer 1

如果我理解正确，这应该会让你到达那里：

counter = 0
for comment in soup.select('span.comments'):
    counter+=int(comment.text)
print(counter)

甚至更短：

comments = [int(comment.text) for comment in soup.select('span.comments')]
print(sum(comments))

两种情况下的输出：

使用 BeautifulSoup 在 Python 中按元素抓取 HTML

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-10-09 22:01:51

使用 BeautifulSoup 在 Python 中按元素抓取 HTML

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-10-09 22:01:51

解决方案1
1 已采纳 2020-10-09 22:01:51