在 Python 中使用 beautifulsoup 从网站中提取数字

Question

我正在尝试使用 urllib 来抓取一个 html 页面，然后使用 beautifulsoup 来提取数据。 我想从comments_42.html中获取所有数字并打印出它们的总和，然后显示数据的数量。 这是我的代码，我正在尝试使用正则表达式，但它对我不起作用。

import urllib
from bs4 import BeautifulSoup
url = 'http://python-data.dr-chuck.net/comments_42.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('span')
for tag in tags:
    print tag

Answer 1

使用 BeautifulSoup 的 findAll() 方法提取所有带有“评论”类的跨度标签，因为它们包含您需要的信息。 然后，您可以根据您的要求对它们执行任何操作。

soup = BeautifulSoup(html,"html.parser")
data = soup.findAll("span", { "class":"comments" })
numbers = [d.text for d in data]

这是输出：

[u'100', u'97', u'87', u'86', u'86', u'78', u'75', u'74', u'72', u'72',   u'72', u'70', u'70', u'66', u'66', u'65', u'65', u'63', u'61', u'60', u'60', u'59', u'59', u'57', u'56', u'54', u'52', u'52', u'51', u'47', u'47', u'41', u'41', u'41', u'38', u'35', u'32', u'31', u'24', u'19', u'19', u'18', u'17', u'16', u'13', u'8', u'7', u'1', u'1', u'1']

Answer 2

我和你一样在 Coursera 上同样的课程。 您不介意尝试上述解决方案，而是尝试这个解决方案。 我觉得这个在我们学到的范围内，直到上述问题。 它绝对对我有用。

import urllib
import re
from bs4 import *

url = 'http://python-data.dr-chuck.net/comments_216543.html'
html = urllib.urlopen(url).read()

soup = BeautifulSoup(html,"html.parser")
sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print sum

Answer 3

@Learner 的解决方案是完全正确的！ 但是如果你想对名字和评论做更多的事情，你可以这样做，它返回名字和评论的列表：

from BeautifulSoup import BeautifulSoup
import re
import urllib
url = 'http://python-data.dr-chuck.net/comments_42.html'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
all  = soup.findAll('span',{'class':'comments'},text=re.compile(r'[0-9]{0,4}')) #use regex to extract only numbers
cleaned = filter(lambda x: x!=u'\n',all)[4:]
In [18]: cleaned
Out[18]: 
[u'Leven',
 u'100',
 u'Mahdiya',
 u'97',
 u'Ajayraj',
 u'87',
 u'Lillian',
 u'86',
 u'Aon',
 u'86',
 u'Ruaraidh',
 u'78',
 u'Gursees',
 u'75',
 u'Emmanuel',
 u'74',
 u'Christy',
 u'72',
 u'Annoushka',
 u'72',
 u'Inara',
 u'72',
 u'Caite',
 u'70',
 u'Rosangel',
 u'70',
 u'Iana',
 u'66',
 u'Anise',
 u'66',
 u'Jaosha',
 u'65',
 u'Cadyn',
 u'65',
 u'Edward',
 u'63',
 u'Charlotte',
 u'61',
 u'Sammy',
 u'60',
 u'Zarran',
 u'60',.....] #

Answer 4

不要忘记您必须导入正则表达式才能在代码中使用它们。

进口重新

Answer 5

用基本的方法来做……

# Retrieve all of the anchor tags
tags = soup('span')
sum = 0
count = 0
for tag in tags:
# Look at the parts of a tag

    #print tag.contents[0]
    num = float(tag.contents[0])
    #print num
    sum = sum + num
    count = count + 1

print 'count:',count   
print 'sum:',sum

Answer 6

我在curser 上做了这个，它给了我所有正确的答案。 希望能帮助到你 ;）

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html,"html.parser")

# Retrieve all of the anchor tags
tags = soup('span')
sum = 0
count = 0
for tag in tags:
# Look at the parts of a tag

    #print tag.contents[0]
    num = float(tag.contents[0])
    #print num
    sum = sum + num
    count = count + 1

print ('count:', count)  
print ('sum:', sum)

Answer 7

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re
lst = list()
sum = 0

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

tags = soup('span')
for tag in tags:
    strtag = str(tag)
    lst = re.findall('[0-9+]+',strtag)
    sum = sum + int(lst[0])
print(sum)

Answer 8

我只是混合了前两个解决方案并创建了一个通用解决方案。

看看它

import urllib.request
import re

from bs4 import BeautifulSoup

url = input('Enter: ')
tag = input("input the html tag to search: ")
parameter = input("Enter the html  parameter of the tag for better selection (optional): ")
p_value = input("Enter the parameter value (optional): ")
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
if not parameter == "" and not p_value == "":
    numbers = soup(tag, {parameter: p_value})
else:
    numbers = soup(tag)
sumation = 0
for number in numbers:
    n = str(number)
    x = re.findall('([0-9]+)', n)
    for item in x:
        sumation += int(item)
print(sumation)

它将需要 html 地址、要搜索的 html 标签和两个可选输入来进一步缩小搜索范围。

Tag将要搜索的 html 标签作为输入
Parameter采用 html 参数，如class 、 id等。
p_value将类名或 id 名作为输入

Answer 9

import urllib.request,urllib.parse,urllib.error

import re

from bs4 import BeautifulSoup

url = input('Enter - ')


html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html,"html.parser")

tags=soup('span')

sum=0

for tag in tags:

    x=re.findall("[0-9]+",tag)



    for i in x:

        z=int(i)


        sum=sum+i


print(sum)

在 Python 中使用 beautifulsoup 从网站中提取数字

问题描述

9 个解决方案

解决方案1
8 已采纳 2015-12-13 09:14:26

解决方案2
2 2016-01-14 11:52:50

解决方案3
0 2015-12-13 09:35:03

解决方案4
0 2015-12-21 02:35:48

解决方案5
0 2016-01-20 05:36:55

解决方案6
0 2017-07-27 23:52:43

解决方案7
0 2020-05-08 06:39:30

解决方案8
0 2020-09-24 14:37:41

我只是混合了前两个解决方案并创建了一个通用解决方案。

看看它

它将需要 html 地址、要搜索的 html 标签和两个可选输入来进一步缩小搜索范围。

解决方案9
-1 2017-09-20 23:18:42

在 Python 中使用 beautifulsoup 从网站中提取数字

问题描述

9 个解决方案

解决方案1 8 已采纳 2015-12-13 09:14:26

解决方案2 2 2016-01-14 11:52:50

解决方案3 0 2015-12-13 09:35:03

解决方案4 0 2015-12-21 02:35:48

解决方案5 0 2016-01-20 05:36:55

解决方案6 0 2017-07-27 23:52:43

解决方案7 0 2020-05-08 06:39:30

解决方案8 0 2020-09-24 14:37:41

我只是混合了前两个解决方案并创建了一个通用解决方案。

看看它

它将需要 html 地址、要搜索的 html 标签和两个可选输入来进一步缩小搜索范围。

解决方案9 -1 2017-09-20 23:18:42

解决方案1
8 已采纳 2015-12-13 09:14:26

解决方案2
2 2016-01-14 11:52:50

解决方案3
0 2015-12-13 09:35:03

解决方案4
0 2015-12-21 02:35:48

解决方案5
0 2016-01-20 05:36:55

解决方案6
0 2017-07-27 23:52:43

解决方案7
0 2020-05-08 06:39:30

解决方案8
0 2020-09-24 14:37:41

解决方案9
-1 2017-09-20 23:18:42