简体   繁体   English

使用 Python 和 BeautifulSoup 从 HTML 中抓取数字

[英]Scraping numbers from HTML using Python and BeautifulSoup

Here is my homework:这是我的作业:

In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py .在本作业中,您将编写一个类似于http://www.py4e.com/code3/urllink2.py的 Python 程序。 The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file.该程序将使用 urllib 从下面的数据文件中读取 HTML,并解析数据,提取数字并计算文件中数字的总和。

We provide two files for this assignment.我们为此作业提供了两个文件。 One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.一个是样本文件,我们在其中为您提供测试总和,另一个是您需要为作业处理的实际数据。

Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553)样本数据: http://py4e-data.dr-chuck.net/comments_42.html (总和=2553)

Actual data: http://py4e-data.dr-chuck.net/comments_228869.html (Sum ends with 10)实际数据:http: //py4e-data.dr-chuck.net/comments_228869.html (总和以10结尾)

You do not need to save these files to your folder since your program will read the data directly from the URL.您无需将这些文件保存到您的文件夹,因为您的程序将直接从 URL 读取数据。 Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.注意:每个学生的作业都有一个不同的数据 url - 因此只能使用您自己的数据 url 进行分析。

I want to fix mine code as it is what I have learned so far.我想修复我的代码,因为这是我到目前为止所学到的。 I am getting an error as a name我收到一个错误作为名称

urlib is not defined没有定义 url

.. if I play with imports than I have a problem with sockets. ..如果我使用进口商品而不是套接字问题。

import urllib
import re
from bs4 import BeautifulSoup


url = input('Enter - ')
html = urlib.request(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")


sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url = input('Enter - ')
html = urlopen(url,).read()
soup = BeautifulSoup(html, "html.parser")

# Retrieve all of the anchor tags
tags = soup('span')
numlist = list()
for tag in tags:
    # Look at the parts of a tag
    y = str(tag)
    num = re.findall('[0-9]+',y)
    numlist = numlist + num

sum = 0
for i in numlist:
    sum = sum + int(i)

print(sum)

Typo: you're having urlib , it should be urllib .错字:你有urlib ,它应该是urllib The context=ctx isn't necessary: context=ctx不是必需的:

import re
import urllib
from bs4 import BeautifulSoup

# url = 'http://py4e-data.dr-chuck.net/comments_42.html'
url = 'http://py4e-data.dr-chuck.net/comments_228869.html'

soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html.parser')
s = sum(int(td.text) for td in soup.select('td:last-child')[1:])

print(s)

Prints:印刷:

2410

EDIT: Running your script:编辑:运行你的脚本:

import urllib.request
import re
from bs4 import BeautifulSoup


html = urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_228869.html').read()
soup = BeautifulSoup(html, "html.parser")

sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)

Prints:印刷:

2410
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")

sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)
import urllib
import re
from bs4 import BeautifulSoup


urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_228869.html').read()
soup = BeautifulSoup(html, "html.parser")



sum=0
# Retrieve all of the anchor tags
tags = soup('span')
for tag in tags:
    # Look at the parts of a tag
    y=str(tag)
    x= re.findall("[0-9]+",y)
    for i in x:
        i=int(i)
        sum=sum+i
print(sum)

and there is an error 'urllib' has no attribute 'request'... I am beeing a dummy now.. 并且有一个错误'urllib'没有属性'request'...我现在是一个假人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM