简体   繁体   English

使用 python 中的 beautifulsoup 提取表数据

[英]Extracting table data using beautifulsoup in python

I have a webpage - https://www.1800wheelchair.com/category/369/transport-wheelchairs/ from which I want to extract name, url, sku and specifications (from table) of each product.我有一个网页 - https://www.1800wheelchair.com/category/369/transport-wheelchairs/我想从中提取每个产品的名称、url、sku 和规格(来自表格)。 I wrote the code below but I am getting an empty excel file.我写了下面的代码,但我得到一个空的 excel 文件。 I have been trying to fix it for long but cant think of what is going wrong.我一直在尝试修复它,但无法想到出了什么问题。

import requests
import xlsxwriter
from bs4 import BeautifulSoup 
def cpap_spider(max_pages):
    global row_i
    page=1
    while page<=max_pages:
        url= "https://www.1800wheelchair.com/category/369/transport-wheelchairs/?p=" +str(page)
        headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
        soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
        for link in soup.findAll("h2", {"class":"product-name"}):
            href=link.find("a")['href']
            title = link.string
            worksheet.write(row_i, 0, title)
            each_item(href)
            print(href)
            #print(title)
        page+=1

def each_item(item_url):
    global cols_names, row_i
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
    soup = BeautifulSoup(requests.get(item_url, headers=headers).content, 'html.parser')
    table=soup.find("table", {"class":"specifications "})
    if table:
        table_rows = table.find_all('tr')
    else:
        return
    for row in table_rows:
      cols = row.find_all('td')
      for ele in range(0,len(cols)):
        temp = cols[ele].text.strip()
        if temp:
          
          if temp[-1:] == ":":
            temp = temp[:-1]
          # Name of column
          if ele == 0:
            try:
              cols_names_i = cols_names.index(temp)
            except:
              cols_names.append(temp)
              cols_names_i = len(cols_names) -  1
              worksheet.write(0, cols_names_i + 1, temp)
              continue;
          worksheet.write(row_i, cols_names_i + 1, temp)      
    row_i += 1
    
cols_names=[]
cols_names_i = 0
row_i = 1
workbook = xlsxwriter.Workbook('all_appended.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "Title")
    
cpap_spider(1)
workbook.close()

You have an extra space in your class name {"class":"specifications "}) , removed and the excel file was generated with multiple specs columns and data lines.您的 class 名称{"class":"specifications "})中有一个额外的空间,已删除,并且 excel 文件是使用多个规格列和数据行生成的。

As a suggestion, if you're willing to add some extra libraries, you can use pandas do read the specifications table as data frames with pd.read_html and use the included function df.to_excel to write an excel file (which can use the same engine xlsxwriter you're already using) without worrying about incrementing rows and columns.作为建议,如果您愿意添加一些额外的库,您可以使用pandas使用 pd.read_html 将规格表作为数据帧pd.read_html ,并使用包含的 function df.to_excel编写 ZBF57C77C906FA7D25BB666D6 文件(可以使用相同的文件) engine xlsxwriter您已经在使用),而不必担心增加行和列。

import requests
from bs4 import BeautifulSoup

import pandas as pd
from functools import reduce

AGENT = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
BASE_URL = "https://www.1800wheelchair.com/"
CATG_URL = "category/369/transport-wheelchairs/?p="


def cpap_spider(max_pages):
    chair_names = ["Specs"]
    chair_tables = ''
    page = 1
    while page <= max_pages:
        url = BASE_URL+CATG_URL+str(page)
        soup = BeautifulSoup(requests.get(
            url, headers=AGENT).content, 'html.parser')
        for link in soup.findAll("h2", {"class": "product-name"}):
            href = link.find("a")['href']
            title = link.string
            chair_name = href.replace(BASE_URL+"product/","")
            chair_names.append(chair_name[:20])
            chair_tables += each_item(href)
            print(href)
        page += 1
    return [chair_names, chair_tables]


def each_item(item_url):
    soup = BeautifulSoup(requests.get(
        item_url, headers=AGENT).content, 'html.parser')
    table = soup.find("table", {"class": "specifications"})
    if table:
        return str(table)


chair_name, chair_list = cpap_spider(1)

# create a list of dataframes from html tables
df = pd.read_html(chair_list)
# merge the spec. tables list into one dataframe
all_chairs = reduce(lambda left, right: pd.merge(left, right, on=[0], how='outer'), df)

# add chair names as indices
all_chairs.columns = chair_name
all_chairs.set_index("Specs", drop=True, inplace=True)

# transpose to get chairs as index and specs as columns
all_chairs = all_chairs.T

all_chairs.to_excel("all_appended.xlsx")

Output from all_appended.xlsx Output 来自all_appended.xlsx

轮椅站点数据

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM