[英]Extracting table data using beautifulsoup in python
I have a webpage - https://www.1800wheelchair.com/category/369/transport-wheelchairs/ from which I want to extract name, url, sku and specifications (from table) of each product.我有一个网页 - https://www.1800wheelchair.com/category/369/transport-wheelchairs/我想从中提取每个产品的名称、url、sku 和规格(来自表格)。 I wrote the code below but I am getting an empty excel file.我写了下面的代码,但我得到一个空的 excel 文件。 I have been trying to fix it for long but cant think of what is going wrong.我一直在尝试修复它,但无法想到出了什么问题。
import requests
import xlsxwriter
from bs4 import BeautifulSoup
def cpap_spider(max_pages):
global row_i
page=1
while page<=max_pages:
url= "https://www.1800wheelchair.com/category/369/transport-wheelchairs/?p=" +str(page)
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for link in soup.findAll("h2", {"class":"product-name"}):
href=link.find("a")['href']
title = link.string
worksheet.write(row_i, 0, title)
each_item(href)
print(href)
#print(title)
page+=1
def each_item(item_url):
global cols_names, row_i
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(item_url, headers=headers).content, 'html.parser')
table=soup.find("table", {"class":"specifications "})
if table:
table_rows = table.find_all('tr')
else:
return
for row in table_rows:
cols = row.find_all('td')
for ele in range(0,len(cols)):
temp = cols[ele].text.strip()
if temp:
if temp[-1:] == ":":
temp = temp[:-1]
# Name of column
if ele == 0:
try:
cols_names_i = cols_names.index(temp)
except:
cols_names.append(temp)
cols_names_i = len(cols_names) - 1
worksheet.write(0, cols_names_i + 1, temp)
continue;
worksheet.write(row_i, cols_names_i + 1, temp)
row_i += 1
cols_names=[]
cols_names_i = 0
row_i = 1
workbook = xlsxwriter.Workbook('all_appended.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "Title")
cpap_spider(1)
workbook.close()
You have an extra space in your class name {"class":"specifications "})
, removed and the excel file was generated with multiple specs columns and data lines.您的 class 名称{"class":"specifications "})
中有一个额外的空间,已删除,并且 excel 文件是使用多个规格列和数据行生成的。
As a suggestion, if you're willing to add some extra libraries, you can use pandas
do read the specifications table as data frames with pd.read_html
and use the included function df.to_excel
to write an excel file (which can use the same engine xlsxwriter
you're already using) without worrying about incrementing rows and columns.作为建议,如果您愿意添加一些额外的库,您可以使用pandas
使用 pd.read_html 将规格表作为数据帧pd.read_html
,并使用包含的 function df.to_excel
编写 ZBF57C77C906FA7D25BB666D6 文件(可以使用相同的文件) engine xlsxwriter
您已经在使用),而不必担心增加行和列。
import requests
from bs4 import BeautifulSoup
import pandas as pd
from functools import reduce
AGENT = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
BASE_URL = "https://www.1800wheelchair.com/"
CATG_URL = "category/369/transport-wheelchairs/?p="
def cpap_spider(max_pages):
chair_names = ["Specs"]
chair_tables = ''
page = 1
while page <= max_pages:
url = BASE_URL+CATG_URL+str(page)
soup = BeautifulSoup(requests.get(
url, headers=AGENT).content, 'html.parser')
for link in soup.findAll("h2", {"class": "product-name"}):
href = link.find("a")['href']
title = link.string
chair_name = href.replace(BASE_URL+"product/","")
chair_names.append(chair_name[:20])
chair_tables += each_item(href)
print(href)
page += 1
return [chair_names, chair_tables]
def each_item(item_url):
soup = BeautifulSoup(requests.get(
item_url, headers=AGENT).content, 'html.parser')
table = soup.find("table", {"class": "specifications"})
if table:
return str(table)
chair_name, chair_list = cpap_spider(1)
# create a list of dataframes from html tables
df = pd.read_html(chair_list)
# merge the spec. tables list into one dataframe
all_chairs = reduce(lambda left, right: pd.merge(left, right, on=[0], how='outer'), df)
# add chair names as indices
all_chairs.columns = chair_name
all_chairs.set_index("Specs", drop=True, inplace=True)
# transpose to get chairs as index and specs as columns
all_chairs = all_chairs.T
all_chairs.to_excel("all_appended.xlsx")
Output from all_appended.xlsx Output 来自all_appended.xlsx
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.