简体   繁体   English

Web 在 Python 中抓取 - 但将数据导出到 excel 时出现问题

[英]Web scraping in Python - but problems exporting data to excel

I'm trying to export som data to excel. I'm a total beginner, so i apologise for any dumb questions.我正在尝试将 som 数据导出到 excel。我是一个初学者,所以对于任何愚蠢的问题我深表歉意。

I',m practicising scraping from a demosite webscraper.io - and so far i have found scraped the data, that i want, which is the laptop names and links for the products我正在练习从演示站点 webscraper.io 中抓取数据- 到目前为止,我已经找到了我想要的数据,即笔记本电脑名称和产品链接

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

r = requests.get(url)

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)

for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    print(text)
    print (full_url)

I'm having major difficulties wrapping my head around how to export the text + full_url to excel.我在思考如何将文本 + full_url 导出到 excel 时遇到了很大的困难。

I have seen coding being done like this我已经看到编码是这样完成的

import pandas as pd

df = pd.DataFrame(laptops)

df.to_excel("laptops_testing.xlsx", encoding="utf-8")

But when i'm doing so, i'm getting an.xlsx file which contains a lot of data and coding, that i dont want.但是当我这样做时,我得到了一个.xlsx 文件,其中包含大量我不想要的数据和编码。 I just want the data, that i have been printing (text) and (full_url)我只想要数据,我一直在打印(text)(full_url)

The data i'm seeing in Excel is looking like this:我在 Excel 中看到的数据如下所示:

<div class="thumbnail">  
<img alt="item" class="img-responsive" src="/images/test-sites/e-commerce/items/cart2.png"/> 
<div class="caption">  
<h4 class="pull-right price">$295.99</h4>  
<h4>  
<a class="title" href="/test-sites/e-commerce/allinone/product/545" title="Asus VivoBook X441NA-GA190">Asus VivoBook X4...</a>  
</h4>  
<p class="description">Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd</p>  
</div>

<div class="ratings">  
<p class="pull-right">14 reviews</p>  
<p data-rating="3">  
<span class="glyphicon glyphicon-star"></span>  
<span class="glyphicon glyphicon-star"></span>  
<span class="glyphicon glyphicon-star"></span>  
</p>  
</div>  
</div>

Screenshot from google sheets:谷歌表格截图:

在此处输入图像描述

This is not that much hard for solve just use this code you just have to add urls and text in lists then change it into a pandas dataframe and then make a new excel file.这并不难解决,只需使用此代码,您只需在列表中添加 url 和文本,然后将其更改为 pandas dataframe,然后创建一个新的 excel 文件。

import pandas as pd
import numpy as np
 
import requests

from bs4 import BeautifulSoup

from pprint import pprint

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

r = requests.get(url)

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)

laptop_name = []
laptop_url = []
for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    print(text)
    //appending name of laptops
    laptop_name.append(text)
    print (full_url)
    //appending urls
    laptop_url.append(full_url)

//changing it into dataframe
new_df = pd.DataFrame({'Laptop Name':laptop_name,'Laptop url':laptop_url})

print(new_df)

// defining excel file 
file_name = 'laptop.xlsx'
new_df.to_excel(file_name)

Use soup.select function to find by extended css selectors.使用soup.select function 通过扩展的 css 选择器查找。

Here's a short solution:这是一个简短的解决方案:

import requests
from bs4 import BeautifulSoup

url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
laptops = [(a.getText(), requests.compat.urljoin(url, a.get('href')))
           for a in soup.select("div.col-sm-4.col-lg-4.col-md-4 a")]
df = pd.DataFrame(laptops)
df.to_excel("laptops_testing.xlsx")

The final document would look like:最终文档如下所示:

在此处输入图像描述

Try this.试试这个。 Remeber to import pandas And try not to run the code to many times you are sending a new request to the website each time记住导入 pandas 并尽量不要多次运行代码,每次都向网站发送新请求

html = r.text

soup = BeautifulSoup(html)

css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}

laptops = soup.find_all("div", attrs=css_selector)
data = []

for laptop in laptops:
    laptop_link = laptop.find('a')
    text = laptop_link.get_text()
    href = laptop_link['href']
    full_url = f"https://webscraper.io{href}"
    data.append([text,full_url])

df = pd.DataFrame(data, columns = ["laptop name","Url"])

df.to_csv("name")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM