简体   繁体   中英

Get the integers from a list, created with BeautifulSoup in Python

I'm a beginner in Python and i need some help about this code :

from urllib.request import *
from bs4 import BeautifulSoup
import re

req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup=BeautifulSoup(a,'html.parser')
nombres=[]
tout = (soup.find_all('td'))
str_tout=str(tout)     
tout = [float(s) for s in re.findall(r'\d+\.\d+', str_tout)]
nombres.append(tout)
print(nombres)

From a website, i need to get all the numeric values contained in it (it's juste a part contained in the whole code). I have succeeded in extracting the floats, but i can't get the integers. I have tried many things but i didn't figure out how to do. Thanks for your help.

EDIT : For this link ( https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/9GYIGO.html ), the method given just below isn't working because in the list, there are integers, floats but also characters. And some chain of characters start with a number, which is complicating the thing. How can i catch the integers but not the characters starting with a number?

Integers don't have the form \\d+\\.\\d+ , so let's make the decimal point and digits optional with ^\\d+(?:\\.\\d+)?$ (note the non-capturing group. It is important).

Then, I'd try to match each td.text by itself:

req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup = BeautifulSoup(a,'html.parser')
nombres = []
tds = soup.find_all('td')
for td in tds:
    if re.match(r'^\d+(?:\.\d+)?$', td.text):
        nombres.append(float(td.text))
print(nombres)

This outputs

[89.169, 54.893, 19.212, 87.045, 2.248, 99.947, 6190.0, 83.096]

As a last improvement I'd use a list comprehenssion with a compiled regex to improve the performance a bit:

req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup = BeautifulSoup(a,'html.parser')
tds = soup.find_all('td')
numbers_regex = re.compile(r'^\d+(?:\.\d+)?$')
nombres = [float(td.text) for td in tds if numbers_regex.match(td.text)]

You should keep doing with you own way, and you can complete your job by using split .

from urllib.request import *
from bs4 import BeautifulSoup
import re

req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup = BeautifulSoup(a,'html.parser')
nombres = []
tout = [ele.text for ele in soup.find_all('td')]
tout = [text if not re.findall(r"^\d+\.\d+",text) else int(text.split(".")[0]) for text in tout]
print(tout)
# [89, 54, 19, 'OIK3XF02PS', 87, 2, 99, '6190', 83, 'E2RYAFAE']

If you are looking for regex for matching the integers.

^[1-9][0-9]{0,2}$

All positive non-zero integers between 1 and 999. You can adjust the upper range of this expression by changing the second number (ie 2) in the {0,2} part of the expression.

Courtsy: http://regexlib.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM