简体   繁体   中英

Extract the text in element by using BeautifulSoup

I am able to fetch the text in td element by BeautifulSoup. However, this includes characters that I don't want. I just want the numbers, how can I remove the characters?

The code looks like this:

import requests
import pandas as pd
from bs4 import BeautifulSoup

record = []
hksi = ['CKH']

url = "http://www.etnet.com.hk/www/tc/futures/futures_stockoptions.php?atscode={}&month=202101"

for s in hksi:
    response = requests.get(url.format(s))
    info = response.text
    soup = BeautifulSoup(info, "lxml")
    
    bid = soup.find('td', {'style': 'padding:10px 0 5px 10px; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
    ratio = soup.find('td', {'style': 'padding:10px 0 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
    ask = soup.find('td', {'style': 'padding:10px 10px 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text

    record.append({
        'symbol' : s,
        'bid' : bid,
        'ask' : ask,
        'ratio': ratio
    })
print(bid)

Output:

認購總數 1441

I want to remove the "認購總數", and the output should look like this:

1441

In order to account for the three variables bid , ratio and ask , a neat approach is to simply use re.sub() with this substitution:

val = re.sub(r'[^!-~]', '', val)

for each of bid , ratio and ask .

This removes anything but printable ASCII characters, and also removes spaces. If you want to keep the spaces, then instead do:

val = re.sub(r'[^ -~]', '', val)

You could also make the pattern more specific, keeping only digits, . , : , % or whatever characters make sense depending on other fields you may need to extract, eg

val = re.sub(r'[^0-9:\.%]', '', val)

Here is a full working version:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

record = []
hksi = ['CKH']

url = "http://www.etnet.com.hk/www/tc/futures/futures_stockoptions.php?atscode={}&month=202101"

for s in hksi:
    response = requests.get(url.format(s))
    info = response.text
    soup = BeautifulSoup(info, "lxml")
    
    bid = soup.find('td', {'style': 'padding:10px 0 5px 10px; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
    ratio = soup.find('td', {'style': 'padding:10px 0 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
    ask = soup.find('td', {'style': 'padding:10px 10px 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text

    record.append({
        'symbol' : s,
        'bid' : bid,
        'ask' : ask,
        'ratio': ratio
    })
for val in [bid, ratio, ask]:
    val = re.sub(r'[^!-~]', '', val)
    print(val)

You can use regex to only get the numbers from your string. Using ' (\d+)' will only match digits that follow a space.

import re

bid = '認購總數 1441'
number = re.findall(' (\d+)', bid)
print(int(number[0]))

Output:

1441

Alternatively, if bid will always have the same structure, ie characters followed by space followed by digits, you can split on space and get the last element:

bid = '認購總數 1441'
number = bit.split(' ')[-1]

You can use regular expression, to search only for something that matches the patter you expect:

If you look for something like 2.3 or 2:3

use (case 2.3):

\d+(?:\.)+\d+

or use (case 2:3):

\d+(?:\:)+\d+

This code will work with input like 23:2:

import requests
import pandas as pd
import re
from bs4 import BeautifulSoup


record = []
hksi = ['CKH']

url = "http://www.etnet.com.hk/www/tc/futures/futures_stockoptions.php?atscode={}&month=202101"

for s in hksi:
    response = requests.get(url.format(s))
    info = response.text
    soup = BeautifulSoup(info, "lxml")
    
    bid = soup.find('td', {'style': 'padding:10px 0 5px 10px; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
    ratio = soup.find('td', {'style': 'padding:10px 0 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text
    ask = soup.find('td', {'style': 'padding:10px 10px 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text

    record.append({
        'symbol' : s,
        'bid' : bid,
        'ask' : ask,
        'ratio': ratio
    })
numbers = re.compile(r'\d+(?:\:)+\d+')
output = numbers.findall(bid)
print(output[0])

To only get the digits, filter for the elements by checking if it isdigit() :

For example, create a function to filter the digits:

def filter_digits(tag):
    return ''.join(element for element in tag if element.isdigit())

...
for s in hksi:
    response = requests.get(url.format(s))
    info = response.text
    soup = BeautifulSoup(info, "lxml")

    bid = filter_digits(soup.find('td', {'style': 'padding:10px 0 5px 10px; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text)
    ratio = filter_digits(soup.find('td', {'style': 'padding:10px 0 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text)
    ask = filter_digits(soup.find('td', {'style': 'padding:10px 10px 5px 0; border-top:1px dotted #e2e2e2; font-weight:bold;'}).text)

    print(bid)
    print(ratio)
    print(ask)

Output (currently):

556
5644
434

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM