Web 在 Python 中抓取 - 但将数据导出到 excel 时出现问题

Question

我正在尝试将 som 数据导出到 excel。我是一个初学者，所以对于任何愚蠢的问题我深表歉意。

我正在练习从演示站点 webscraper.io 中抓取数据- 到目前为止，我已经找到了我想要的数据，即笔记本电脑名称和产品链接

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

r = requests.get(url)

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)

for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    print(text)
    print (full_url)

我在思考如何将文本 + full_url 导出到 excel 时遇到了很大的困难。

我已经看到编码是这样完成的

import pandas as pd

df = pd.DataFrame(laptops)

df.to_excel("laptops_testing.xlsx", encoding="utf-8")

但是当我这样做时，我得到了一个.xlsx 文件，其中包含大量我不想要的数据和编码。 我只想要数据，我一直在打印(text)和(full_url)

我在 Excel 中看到的数据如下所示：

<div class="thumbnail">  
<img alt="item" class="img-responsive" src="/images/test-sites/e-commerce/items/cart2.png"/> 
<div class="caption">  
<h4 class="pull-right price">$295.99</h4>  
<h4>  
<a class="title" href="/test-sites/e-commerce/allinone/product/545" title="Asus VivoBook X441NA-GA190">Asus VivoBook X4...</a>  
</h4>  
<p class="description">Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd</p>  
</div>

<div class="ratings">  
<p class="pull-right">14 reviews</p>  
<p data-rating="3">  
<span class="glyphicon glyphicon-star"></span>  
<span class="glyphicon glyphicon-star"></span>  
<span class="glyphicon glyphicon-star"></span>  
</p>  
</div>  
</div>

谷歌表格截图：

Answer 1

这并不难解决，只需使用此代码，您只需在列表中添加 url 和文本，然后将其更改为 pandas dataframe，然后创建一个新的 excel 文件。

import pandas as pd
import numpy as np
 
import requests

from bs4 import BeautifulSoup

from pprint import pprint

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

r = requests.get(url)

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)

laptop_name = []
laptop_url = []
for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    print(text)
    //appending name of laptops
    laptop_name.append(text)
    print (full_url)
    //appending urls
    laptop_url.append(full_url)

//changing it into dataframe
new_df = pd.DataFrame({'Laptop Name':laptop_name,'Laptop url':laptop_url})

print(new_df)

// defining excel file 
file_name = 'laptop.xlsx'
new_df.to_excel(file_name)

Answer 2

使用soup.select function 通过扩展的 css 选择器查找。

这是一个简短的解决方案：

import requests
from bs4 import BeautifulSoup

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
laptops = [(a.getText(), requests.compat.urljoin(url, a.get('href')))
           for a in soup.select("div.col-sm-4.col-lg-4.col-md-4 a")]
df = pd.DataFrame(laptops)
df.to_excel("laptops_testing.xlsx")

最终文档如下所示：

Answer 3

试试这个。 记住导入 pandas 并尽量不要多次运行代码，每次都向网站发送新请求

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)
data = []

for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    data.append([text,full_url])

df = pd.DataFrame(data, columns = ["laptop name","Url"])

df.to_csv("name")

Web 在 Python 中抓取 - 但将数据导出到 excel 时出现问题

问题描述

3 个解决方案

解决方案1
1 已采纳 2022-12-25 14:35:55

解决方案2
1 2022-12-25 14:57:50

解决方案3
0 2022-12-25 14:54:02

Web 在 Python 中抓取 - 但将数据导出到 excel 时出现问题

问题描述

3 个解决方案

解决方案1 1 已采纳 2022-12-25 14:35:55

解决方案2 1 2022-12-25 14:57:50

解决方案3 0 2022-12-25 14:54:02

解决方案1
1 已采纳 2022-12-25 14:35:55

解决方案2
1 2022-12-25 14:57:50

解决方案3
0 2022-12-25 14:54:02