简体   繁体   English

使用 python beautifulsoup 从 web 页面获取表数据

[英]Getting table data from web page using python beautifulsoup

I have a webpage that displays some products, and I need to go to each of these products and obtain the table data under the tab called technical details, and get this data into one big table in excel.我有一个显示一些产品的网页,我需要对这些产品中的每一个进行 go 并获取名为技术细节的选项卡下的表数据,并将这些数据放入 excel 中的一个大表中。 I wrote the following code, but I seem to get a blank excel file.我写了以下代码,但我似乎得到了一个空白的 excel 文件。 Where is it going wrong?它哪里出错了?

import requests
import xlsxwriter
from bs4 import BeautifulSoup


def cpap_spider(url):
    global row_i
    
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    for link in soup.findAll('td', {'class': 'name name2_padd'}):
        href = link.get('href')
        title = link.string
        worksheet.write(row_i, 0, title)
        each_item(href)
        print(href)
        

def each_item(item_url):
    global cols_names, row_i
    
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')
    table = soup.find('table', {'class': 'width_table'})
    if table:
        table_rows = table.find_all('tr')
    else:
        return
    for row in table_rows:
      cols = row.select('td')
      for ele in range(0, len(cols)):
        temp = cols[ele].text.strip()
        if temp:
          if temp[-1] == ':':
            temp = temp[:-1]
          # Name of column
          if ele == 0:
            try:
              cols_names_i = cols_names.index(temp)
            except:
              cols_names.append(temp)
              cols_names_i = len(cols_names) -  1
              worksheet.write(0, cols_names_i + 1, temp)
              continue;
          worksheet.write(row_i, cols_names_i + 1, temp)      
    row_i += 1
    
cols_names = []
cols_names_i = 0
row_i = 1
workbook = xlsxwriter.Workbook('st.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, 'Title')
    
cpap_spider('https://www.respshop.com/cpap-machines/manual/')

workbook.close()

The product info is loaded via Ajax from another URL.产品信息通过 Ajax 从另一个 URL 加载。

This script will load all technical parameters along name/url of the product:此脚本将沿产品的名称/网址加载所有技术参数:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup


url = 'https://www.respshop.com/cpap-masks/nasal/'
product_info_url = 'https://www.respshop.com/product_info.php'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

all_data = []
for item in soup.select('td.name a'):
    s = BeautifulSoup(requests.get(item['href'], headers=headers).content, 'html.parser')
    sku = s.select_one('[itemprop="mpn"]').text
    print(item.text, sku)
    products_id = re.search(r'p-(\d+)\.html', item['href'])[1]

    s = BeautifulSoup(requests.post(product_info_url, data={'products_id': products_id, 'tab': 3}, headers=headers).content, 'html.parser')

    row = {'Name': item.text, 'SKU': sku, 'URL': item['href']}
    for k, v in zip(s.select('#cont_3 td.main:nth-child(1)'),
                    s.select('#cont_3 td.main:nth-child(2)')):
        row[k.get_text(strip=True)] = v.get_text(strip=True)
    all_data.append(row)

df = pd.DataFrame(all_data)
df.to_csv('data.csv')

Prints:印刷:

ComfortGel Blue Nasal CPAP Mask - Philips Respironics  1070038, 1070037, 1070039, 1070040, 1070050, 1070051, 1070052, 1070049
Wisp Nasal Mask - Philips Respironics  1094051, 1094050, 1109298
Dreamwear Nasal Mask - Philips Respironics 1116700, 1116680, 1116681, 1116682, 1116683, 1116685, 1116686, 1116687, 1116688, 1116690, 1116691, 1116692, 1116693
Airfit N20 Nasal CPAP Mask by ResMed w/ 5 Free Cushions 63536, 63538, 63539
Airfit N30i - ResMed Nasal Mask  63800, 63801
New Respironics DreamWear Nasal Mask With Headgear Arm FitPack 1142376
ResMed AirFit N30 CPAP Nasal Cradle Mask 64222, 64223, 64224

...etc.

Creates data.csv (screenshot from LibreOffice):创建data.csv (来自 LibreOffice 的屏幕截图):

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用BeautifulSoup从网页获取特定表 - Getting specific table from web page with BeautifulSoup 使用 Python 和 BeautifulSoup 从页面获取表格信息 - Getting Table Info From Page Using Python and BeautifulSoup Python - 使用 Beautifulsoup 从网页中提取数据 - Python - Extracting data from web page using Beautifulsoup 从网络获取表格时,BeautifulSoup返回未记录的数据 - BeautifulSoup returns No Data Recorded when getting table from web 使用python中的beautifulsoup从具有更多文本内容的网页中提取数据 - Extract the data from a Web Page which has more Textual Content using beautifulsoup in python 使用 BeautifulSoup Python 从网页中提取特定的 JS 值 - Extract specific JS value from web page using BeautifulSoup Python 使用python和BeautifulSoup从网页检索特定链接 - retrieve specific links from web page using python and BeautifulSoup 使用 Python BeautifulSoup 从具有多个同名表的特定页面中提取数据表 - Extract data table from a specific page with multiple same name tables using Python BeautifulSoup 无法使用 BeautifulSoup 从 web 页面获取更新的数据 - Not able to get updated data from web page using BeautifulSoup 如何在Python中使用BeautifulSoup从HTML页面提取表内容? - How to extract Table contents from an HTML page using BeautifulSoup in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM