简体   繁体   English

使用 BeautifulSoup 在 Python 中按元素抓取 HTML

[英]Scraping HTML by elements in Python with BeautifulSoup

I tried to sum up the values that I scraped from the html, however the sum seem very strange.(It obviously lower than the actual value.)我试图总结我从html中刮取的值,但是总和看起来很奇怪。(它显然低于实际值。)

I have looked over other people code and I noticed that they use the re.findall() to find the numbers in html.我查看了其他人的代码,我注意到他们使用re.findall()来查找 html 中的数字。

My question is that why I could not directly crawl the content element from the html?我的问题是,为什么我不能直接从 html 中抓取内容元素? my code is in above and the bottom one is part of code that other people's code different from mine code.我的代码在上面,底部是其他人的代码与我的代码不同的代码的一部分。

Thank you for your answer in advance!提前感谢您的回答!

# load in the required packages for reading HTML

from urllib.request import urlopen
from bs4 import BeautifulSoup #parser for HTML
import ssl
import re
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

#open the url
url = 'http://py4e-data.dr-chuck.net/comments_874984.html'
html = urlopen(url, context = ctx).read()
soup = BeautifulSoup(html, "html.parser")

# Retrive the information form url
spans = soup('span')
sum = 0
for span in spans:
    x = span.contents[0]
    for n in x:
        sum = sum + int(n)
print(sum)
sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)

If I understand you correctly, this should get you there:如果我理解正确,这应该会让你到达那里:

counter = 0
for comment in soup.select('span.comments'):
    counter+=int(comment.text)
print(counter)

or even shorter:甚至更短:

comments = [int(comment.text) for comment in soup.select('span.comments')]
print(sum(comments))

Output, in both cases:两种情况下的输出:

2266

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM