使用 Python 和 BeautifulSoup 從 HTML 中抓取數字

Question

這是我的作業：

在本作業中，您將編寫一個類似於http://www.py4e.com/code3/urllink2.py的 Python 程序。 該程序將使用 urllib 從下面的數據文件中讀取 HTML，並解析數據，提取數字並計算文件中數字的總和。

我們為此作業提供了兩個文件。 一個是樣本文件，我們在其中為您提供測試總和，另一個是您需要為作業處理的實際數據。

樣本數據： http://py4e-data.dr-chuck.net/comments_42.html （總和=2553）

實際數據：http: //py4e-data.dr-chuck.net/comments_228869.html （總和以10結尾）

您無需將這些文件保存到您的文件夾，因為您的程序將直接從 URL 讀取數據。 注意：每個學生的作業都有一個不同的數據 url - 因此只能使用您自己的數據 url 進行分析。

我想修復我的代碼，因為這是我到目前為止所學到的。 我收到一個錯誤作為名稱

沒有定義 url

..如果我使用進口商品而不是套接字問題。

import urllib
import re
from bs4 import BeautifulSoup


url = input('Enter - ')
html = urlib.request(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")


sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)

Answer 1

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url = input('Enter - ')
html = urlopen(url,).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags
tags = soup('span')
numlist = list()
for tag in tags:
    # Look at the parts of a tag
    y = str(tag)
    num = re.findall('[0-9]+',y)
    numlist = numlist + num

sum = 0
for i in numlist:
    sum = sum + int(i)

print(sum)

Answer 2

錯字：你有urlib ，它應該是urllib 。 context=ctx不是必需的：

import re
import urllib
from bs4 import BeautifulSoup

# url = 'http://py4e-data.dr-chuck.net/comments_42.html'
url = 'http://py4e-data.dr-chuck.net/comments_228869.html'

soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html.parser')
s = sum(int(td.text) for td in soup.select('td:last-child')[1:])

print(s)

印刷：

編輯：運行你的腳本：

import urllib.request
import re
from bs4 import BeautifulSoup


html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_228869.html').read()
soup = BeautifulSoup(html, "html.parser")

sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)

印刷：

Answer 3

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)

Answer 4

import urllib
import re
from bs4 import BeautifulSoup


urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_228869.html').read()
soup = BeautifulSoup(html, "html.parser")



sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)

並且有一個錯誤'urllib'沒有屬性'request'...我現在是一個假人。

使用 Python 和 BeautifulSoup 從 HTML 中抓取數字

問題描述

3 個解決方案

解決方案1
2 2020-07-01 11:58:36

解決方案2
1 已采納 2019-07-23 14:00:06

解決方案3
1 2020-06-04 13:27:12

解決方案4
0 2019-07-23 14:53:23

使用 Python 和 BeautifulSoup 從 HTML 中抓取數字

問題描述

3 個解決方案

解決方案1 2 2020-07-01 11:58:36

解決方案2 1 已采納 2019-07-23 14:00:06

解決方案3 1 2020-06-04 13:27:12

解決方案4 0 2019-07-23 14:53:23

解決方案1
2 2020-07-01 11:58:36

解決方案2
1 已采納 2019-07-23 14:00:06

解決方案3
1 2020-06-04 13:27:12

解決方案4
0 2019-07-23 14:53:23