I am able to fetch the text in td
element by BeautifulSoup. However, this includes characters that I don't want. I just want the numbers, how can I remove the characters?
The code looks like this:
import requests
import pandas as pd
from bs4 import BeautifulSoup
record = []
hksi = ['CKH']
url = "http://www.etnet.com.hk/www/tc/futures/futures_stockoptions.php?atscode={}&month=202101"
for s in hksi:
response = requests.get(url.format(s))
info = response.text
soup = BeautifulSoup(info, "lxml")
bid = soup.find('td', {'style': 'padding:10px 0 5px 10px; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
ratio = soup.find('td', {'style': 'padding:10px 0 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
ask = soup.find('td', {'style': 'padding:10px 10px 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
record.append({
'symbol' : s,
'bid' : bid,
'ask' : ask,
'ratio': ratio
})
print(bid)
Output:
認購總數 1441
I want to remove the "認購總數", and the output should look like this:
1441
In order to account for the three variables bid
, ratio
and ask
, a neat approach is to simply use re.sub()
with this substitution:
val = re.sub(r'[^!-~]', '', val)
for each of bid
, ratio
and ask
.
This removes anything but printable ASCII characters, and also removes spaces. If you want to keep the spaces, then instead do:
val = re.sub(r'[^ -~]', '', val)
You could also make the pattern more specific, keeping only digits, .
, :
, %
or whatever characters make sense depending on other fields you may need to extract, eg
val = re.sub(r'[^0-9:\.%]', '', val)
Here is a full working version:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
record = []
hksi = ['CKH']
url = "http://www.etnet.com.hk/www/tc/futures/futures_stockoptions.php?atscode={}&month=202101"
for s in hksi:
response = requests.get(url.format(s))
info = response.text
soup = BeautifulSoup(info, "lxml")
bid = soup.find('td', {'style': 'padding:10px 0 5px 10px; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
ratio = soup.find('td', {'style': 'padding:10px 0 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
ask = soup.find('td', {'style': 'padding:10px 10px 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
record.append({
'symbol' : s,
'bid' : bid,
'ask' : ask,
'ratio': ratio
})
for val in [bid, ratio, ask]:
val = re.sub(r'[^!-~]', '', val)
print(val)
You can use regex to only get the numbers from your string. Using ' (\d+)'
will only match digits that follow a space.
import re
bid = '認購總數 1441'
number = re.findall(' (\d+)', bid)
print(int(number[0]))
Output:
1441
Alternatively, if bid
will always have the same structure, ie characters followed by space followed by digits, you can split on space and get the last element:
bid = '認購總數 1441'
number = bit.split(' ')[-1]
You can use regular expression, to search only for something that matches the patter you expect:
If you look for something like 2.3 or 2:3
use (case 2.3):
\d+(?:\.)+\d+
or use (case 2:3):
\d+(?:\:)+\d+
This code will work with input like 23:2:
import requests
import pandas as pd
import re
from bs4 import BeautifulSoup
record = []
hksi = ['CKH']
url = "http://www.etnet.com.hk/www/tc/futures/futures_stockoptions.php?atscode={}&month=202101"
for s in hksi:
response = requests.get(url.format(s))
info = response.text
soup = BeautifulSoup(info, "lxml")
bid = soup.find('td', {'style': 'padding:10px 0 5px 10px; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
ratio = soup.find('td', {'style': 'padding:10px 0 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
ask = soup.find('td', {'style': 'padding:10px 10px 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
record.append({
'symbol' : s,
'bid' : bid,
'ask' : ask,
'ratio': ratio
})
numbers = re.compile(r'\d+(?:\:)+\d+')
output = numbers.findall(bid)
print(output[0])
To only get the digits, filter for the elements by checking if it isdigit()
:
For example, create a function to filter the digits:
def filter_digits(tag):
return ''.join(element for element in tag if element.isdigit())
...
for s in hksi:
response = requests.get(url.format(s))
info = response.text
soup = BeautifulSoup(info, "lxml")
bid = filter_digits(soup.find('td', {'style': 'padding:10px 0 5px 10px; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text)
ratio = filter_digits(soup.find('td', {'style': 'padding:10px 0 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text)
ask = filter_digits(soup.find('td', {'style': 'padding:10px 10px 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text)
print(bid)
print(ratio)
print(ask)
Output (currently):
556
5644
434
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.