简体   繁体   English

从使用Python中的BeautifulSoup创建的列表中获取整数

[英]Get the integers from a list, created with BeautifulSoup in Python

I'm a beginner in Python and i need some help about this code : 我是Python的初学者,我需要一些有关此代码的帮助:

from urllib.request import *
from bs4 import BeautifulSoup
import re

req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup=BeautifulSoup(a,'html.parser')
nombres=[]
tout = (soup.find_all('td'))
str_tout=str(tout)     
tout = [float(s) for s in re.findall(r'\d+\.\d+', str_tout)]
nombres.append(tout)
print(nombres)

From a website, i need to get all the numeric values contained in it (it's juste a part contained in the whole code). 从一个网站,我需要获取其中包含的所有数值(这只是整个代码中包含的一部分)。 I have succeeded in extracting the floats, but i can't get the integers. 我已经成功提取了浮点数,但是我无法获取整数。 I have tried many things but i didn't figure out how to do. 我已经尝试了很多事情,但是我不知道该怎么做。 Thanks for your help. 谢谢你的帮助。

EDIT : For this link ( https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/9GYIGO.html ), the method given just below isn't working because in the list, there are integers, floats but also characters. 编辑:对于此链接( https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/9GYIGO.html ),下面给出的方法不起作用,因为列表中有整数,浮点数和字符。 And some chain of characters start with a number, which is complicating the thing. 而且某些字符链以数字开头,这使事情变得复杂。 How can i catch the integers but not the characters starting with a number? 如何捕获整数而不捕获以数字开头的字符?

Integers don't have the form \\d+\\.\\d+ , so let's make the decimal point and digits optional with ^\\d+(?:\\.\\d+)?$ (note the non-capturing group. It is important). 整数的格式不是\\d+\\.\\d+ ,因此让小数点和数字与^\\d+(?:\\.\\d+)?$可选(请注意非捕获组。这一点很重要)。

Then, I'd try to match each td.text by itself: 然后,我将尝试td.text匹配每个td.text

req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup = BeautifulSoup(a,'html.parser')
nombres = []
tds = soup.find_all('td')
for td in tds:
    if re.match(r'^\d+(?:\.\d+)?$', td.text):
        nombres.append(float(td.text))
print(nombres)

This outputs 这个输出

[89.169, 54.893, 19.212, 87.045, 2.248, 99.947, 6190.0, 83.096]

As a last improvement I'd use a list comprehenssion with a compiled regex to improve the performance a bit: 作为最后的改进,我将结合使用列表理解和已编译的正则表达式来稍微提高性能:

req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup = BeautifulSoup(a,'html.parser')
tds = soup.find_all('td')
numbers_regex = re.compile(r'^\d+(?:\.\d+)?$')
nombres = [float(td.text) for td in tds if numbers_regex.match(td.text)]

You should keep doing with you own way, and you can complete your job by using split . 您应该继续按自己的方式做事,并且可以使用split来完成工作。

from urllib.request import *
from bs4 import BeautifulSoup
import re

req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup = BeautifulSoup(a,'html.parser')
nombres = []
tout = [ele.text for ele in soup.find_all('td')]
tout = [text if not re.findall(r"^\d+\.\d+",text) else int(text.split(".")[0]) for text in tout]
print(tout)
# [89, 54, 19, 'OIK3XF02PS', 87, 2, 99, '6190', 83, 'E2RYAFAE']

If you are looking for regex for matching the integers. 如果您正在寻找正则表达式来匹配整数。

^[1-9][0-9]{0,2}$ ^ [1-9] [0-9] {0,2} $

All positive non-zero integers between 1 and 999. You can adjust the upper range of this expression by changing the second number (ie 2) in the {0,2} part of the expression. 所有介于1和999之间的非零正整数。您可以通过更改表达式{0,2}部分中的第二个数字(即2)来调整此表达式的上限。

Courtsy: http://regexlib.com 礼貌: http: //regexlib.com

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM