简体   繁体   中英

I want to scrape a table with beautifulsoup

Hi I am new to stackoverflow.

I am trying to scrape a the table which comes under the heading "Import VAT and excise" from this website for the commodity code"1704906500". I know for sure that the table will fall under "Import VAT and excise". I have several commodity codes and I will be looping through all the codes. The problem here is I am not able to point to the table under "Import VAT and excise " for scraping.

Please advice?

Weblink Scraping Webpage

Screenshot of the table

import pandas as pd
import re
import requests
from bs4 import BeautifulSoup, NavigableString, Tag
comCode="1704906500"
url = "https://www.trade-tariff.service.gov.uk/commodities/"+comCode+ "?currency=GBP#import"
url_request = requests.get(url).text
soup=BeautifulSoup(url_request, "lxml")

for header in soup.find_all('h3', text=re.compile('Import VAT and excise')):
    nextNode = header
    while True:
        nextNode = nextNode.nextSibling
        if nextNode is None:
            break
        if isinstance(nextNode, Tag):
            if nextNode.name == "h3":
                break
            print((nextNode))
            #comm_table = pd.read_html(nextNode.text, attrs = {"table class":"small-table measures govuk-table"} )

You could use .find_next('table') based on the selection of your heading:

soup.find('h3', text=re.compile('Import VAT and excise')).find_next('table')

or as alternative with css selectors :

soup.select_one('h3:-soup-contains("Import VAT and excise")').find_next('table')

Example

Iterate over a list of comCodes and concat all the tables to one dataframe:

import pandas as pd
import requests
from bs4 import BeautifulSoup
comCode=["1704906500"]

data = []

for c in comCode:
    url = f'https://www.trade-tariff.service.gov.uk/commodities/{c}?currency=GBP#import'
    soup=BeautifulSoup(requests.get(url).text)
    data.append(pd.read_html(str(soup.select_one('h3:-soup-contains("Import VAT and excise")').find_next('table')))[0])

pd.concat(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM