I'm a beginner in Python and i need some help about this code :
from urllib.request import *
from bs4 import BeautifulSoup
import re
req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup=BeautifulSoup(a,'html.parser')
nombres=[]
tout = (soup.find_all('td'))
str_tout=str(tout)
tout = [float(s) for s in re.findall(r'\d+\.\d+', str_tout)]
nombres.append(tout)
print(nombres)
From a website, i need to get all the numeric values contained in it (it's juste a part contained in the whole code). I have succeeded in extracting the floats, but i can't get the integers. I have tried many things but i didn't figure out how to do. Thanks for your help.
EDIT : For this link ( https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/9GYIGO.html ), the method given just below isn't working because in the list, there are integers, floats but also characters. And some chain of characters start with a number, which is complicating the thing. How can i catch the integers but not the characters starting with a number?
Integers don't have the form \\d+\\.\\d+
, so let's make the decimal point and digits optional with ^\\d+(?:\\.\\d+)?$
(note the non-capturing group. It is important).
Then, I'd try to match each td.text
by itself:
req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup = BeautifulSoup(a,'html.parser')
nombres = []
tds = soup.find_all('td')
for td in tds:
if re.match(r'^\d+(?:\.\d+)?$', td.text):
nombres.append(float(td.text))
print(nombres)
This outputs
[89.169, 54.893, 19.212, 87.045, 2.248, 99.947, 6190.0, 83.096]
As a last improvement I'd use a list comprehenssion with a compiled regex to improve the performance a bit:
req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup = BeautifulSoup(a,'html.parser')
tds = soup.find_all('td')
numbers_regex = re.compile(r'^\d+(?:\.\d+)?$')
nombres = [float(td.text) for td in tds if numbers_regex.match(td.text)]
You should keep doing with you own way, and you can complete your job by using split
.
from urllib.request import *
from bs4 import BeautifulSoup
import re
req = Request("https://adrianchifu.com/teachings/AMSE/MAG1/project/Xlrda/dsuR/2/J9ED27Y.html")
a = urlopen(req).read()
soup = BeautifulSoup(a,'html.parser')
nombres = []
tout = [ele.text for ele in soup.find_all('td')]
tout = [text if not re.findall(r"^\d+\.\d+",text) else int(text.split(".")[0]) for text in tout]
print(tout)
# [89, 54, 19, 'OIK3XF02PS', 87, 2, 99, '6190', 83, 'E2RYAFAE']
If you are looking for regex for matching the integers.
^[1-9][0-9]{0,2}$
All positive non-zero integers between 1 and 999. You can adjust the upper range of this expression by changing the second number (ie 2) in the {0,2} part of the expression.
Courtsy: http://regexlib.com
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.