[英]Web scraping in Python - but problems exporting data to excel
我正在尝试将 som 数据导出到 excel。我是一个初学者,所以对于任何愚蠢的问题我深表歉意。
我正在练习从演示站点 webscraper.io 中抓取数据- 到目前为止,我已经找到了我想要的数据,即笔记本电脑名称和产品链接
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}
laptops = soup.find_all("div", attrs=css_selector)
for laptop in laptops:
laptop_link = laptop.find('a')
text = laptop_link.get_text()
href = laptop_link['href']
full_url = f"https://webscraper.io{href}"
print(text)
print (full_url)
我在思考如何将文本 + full_url 导出到 excel 时遇到了很大的困难。
我已经看到编码是这样完成的
import pandas as pd
df = pd.DataFrame(laptops)
df.to_excel("laptops_testing.xlsx", encoding="utf-8")
但是当我这样做时,我得到了一个.xlsx 文件,其中包含大量我不想要的数据和编码。 我只想要数据,我一直在打印(text)
和(full_url)
我在 Excel 中看到的数据如下所示:
<div class="thumbnail">
<img alt="item" class="img-responsive" src="/images/test-sites/e-commerce/items/cart2.png"/>
<div class="caption">
<h4 class="pull-right price">$295.99</h4>
<h4>
<a class="title" href="/test-sites/e-commerce/allinone/product/545" title="Asus VivoBook X441NA-GA190">Asus VivoBook X4...</a>
</h4>
<p class="description">Asus VivoBook X441NA-GA190 Chocolate Black, 14", Celeron N3450, 4GB, 128GB SSD, Endless OS, ENG kbd</p>
</div>
<div class="ratings">
<p class="pull-right">14 reviews</p>
<p data-rating="3">
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
</p>
</div>
</div>
谷歌表格截图:
这并不难解决,只需使用此代码,您只需在列表中添加 url 和文本,然后将其更改为 pandas dataframe,然后创建一个新的 excel 文件。
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}
laptops = soup.find_all("div", attrs=css_selector)
laptop_name = []
laptop_url = []
for laptop in laptops:
laptop_link = laptop.find('a')
text = laptop_link.get_text()
href = laptop_link['href']
full_url = f"https://webscraper.io{href}"
print(text)
//appending name of laptops
laptop_name.append(text)
print (full_url)
//appending urls
laptop_url.append(full_url)
//changing it into dataframe
new_df = pd.DataFrame({'Laptop Name':laptop_name,'Laptop url':laptop_url})
print(new_df)
// defining excel file
file_name = 'laptop.xlsx'
new_df.to_excel(file_name)
使用soup.select
function 通过扩展的 css 选择器查找。
这是一个简短的解决方案:
import requests
from bs4 import BeautifulSoup
url ="https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
laptops = [(a.getText(), requests.compat.urljoin(url, a.get('href')))
for a in soup.select("div.col-sm-4.col-lg-4.col-md-4 a")]
df = pd.DataFrame(laptops)
df.to_excel("laptops_testing.xlsx")
最终文档如下所示:
试试这个。 记住导入 pandas 并尽量不要多次运行代码,每次都向网站发送新请求
html = r.text
soup = BeautifulSoup(html)
css_selector = {"class": "col-sm-4 col-lg-4 col-md-4"}
laptops = soup.find_all("div", attrs=css_selector)
data = []
for laptop in laptops:
laptop_link = laptop.find('a')
text = laptop_link.get_text()
href = laptop_link['href']
full_url = f"https://webscraper.io{href}"
data.append([text,full_url])
df = pd.DataFrame(data, columns = ["laptop name","Url"])
df.to_csv("name")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.