简体   繁体   English

用逗号分隔 Python Web 抓取的数据

[英]Separate Python web scraped data by comma

I am new to web scraping with Python and found a quick tutorial online with some sample code.我是使用 Python 进行网页抓取的新手,并在网上找到了一个包含一些示例代码的快速教程。 I adjusted some of the code to add another aspect to the result (output as a csv file).我调整了一些代码以向结果添加另一个方面(输出为 csv 文件)。 The code is scraping the info about different laptops (name, price, rating, specs).该代码正在抓取有关不同笔记本电脑的信息(名称、价格、评级、规格)。

The issue I am having is separating the specs with a comma in the output.我遇到的问题是在输出中用逗号分隔规格。

Here is the code I am using:这是我正在使用的代码:

    from selenium import webdriver
    from bs4 import BeautifulSoup
    import pandas as pd
    from webdriver_manager.chrome import ChromeDriverManager

    driver = webdriver.Chrome(ChromeDriverManager().install())

    products=[]
    prices=[]
    ratings=[]
    specs=[]
    driver.get('https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2')

    content = driver.page_source
    soup = BeautifulSoup(content)
    for a in soup.find_all('a', href=True, attrs={'class':'_31qSD5'}):
        name = a.find('div', attrs={'class':'_3wU53n'})
        price = a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
        rating = a.find('div', attrs={'class':'hGSR34'})
        spec = a.find('div', attrs={'class':'_3ULzGw'})
        products.append(name.text)
        prices.append(price.text)
        ratings.append(rating.text)
        specs.append(spec.text)

    df = pd.DataFrame({'Product name':products, 'Price':prices, 'Rating':ratings, 'Tech Specs':specs})
    df.to_csv('products.csv', index=False, encoding='utf-8')

Here is the current output:这是当前的输出:

Tech Specs技术规格

Intel Core i5 Processor (5th Gen)8 GB DDR3 RAM64 bit Mac OS Operating System128 GB SSD33.78 cm (13.3 inch) Display1 Year Carry In Warranty英特尔酷睿 i5 处理器(第 5 代)8 GB DDR3 RAM64 位 Mac OS 操作系统 128 GB SSD33.78 厘米(13.3 英寸)显示器 1 年保修

Pre-installed Genuine Windows 10 Operating System (Includes Built-in Security, Free Automated Updates, Latest Features)Intel Core i5 Processor (7th Gen)8 GB DDR4 RAM64 bit Windows 10 Operating System1 TB HDD39.62 cm (15.6 inch) Display1 Year Onsite Warranty预装正版 Windows 10 操作系统(包括内置安全性、免费自动更新、最新功能)英特尔酷睿 i5 处理器(第 7 代)8 GB DDR4 RAM64 位 Windows 10 操作系统 1 TB 硬盘 39.62 厘米(15.6 英寸)显示器 1 年现场保修

Here is how I would like the output to look:这是我希望输出的外观:

Tech Specs技术规格

Intel Core i5 Processor (5th Gen), 8 GB DDR3 RAM, 64 bit Mac OS Operating System, 128 GB SSD, 33.78 cm (13.3 inch) Display, 1 Year Carry In Warranty英特尔酷睿 i5 处理器(第 5 代)、8 GB DDR3 RAM、64 位 Mac OS 操作系统、128 GB SSD、33.78 厘米(13.3 英寸)显示屏、1 年保修

Pre-installed Genuine Windows 10 Operating System (Includes Built-in Security, Free Automated Updates, Latest Features), Intel Core i5 Processor (7th Gen), 8 GB DDR4 RAM, 64 bit Windows 10 Operating System, 1 TB HDD, 39.62 cm (15.6 inch) Display, 1 Year Onsite Warranty预装正版 Windows 10 操作系统(包括内置安全性、免费自动更新、最新功能)、英特尔酷睿 i5 处理器(第 7 代)、8 GB DDR4 RAM、64 位 Windows 10 操作系统、1 TB 硬盘、39.62 厘米(15.6 英寸)显示屏,1 年现场保修

Any help is appreciated.任何帮助表示赞赏。 Thanks in advance!提前致谢!

You should split tech specs if you need it.如果需要,您应该拆分技术规格。 Right now you write tech specs as it is.现在您按原样编写技术规范。 Please take a look of updated code请看一下更新的代码

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())

products=[]
prices=[]
ratings=[]
specs=[]
driver.get('https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2')

content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.find_all('a', href=True, attrs={'class':'_31qSD5'}):
    name = a.find('div', attrs={'class':'_3wU53n'})
    price = a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
    rating = a.find('div', attrs={'class':'hGSR34'})
    spec = a.find('div', attrs={'class':'_3ULzGw'})
    products.append(name.text)
    prices.append(price.text)
    ratings.append(rating.text)
    specs.append(", ".join([l.text for l in spec.find_all('li')]))

df = pd.DataFrame({'Product name':products, 'Price':prices, 'Rating':ratings, 'Tech Specs':specs})
df.to_csv('products.csv', index=False, encoding='utf-8', sep=";")

I iterate through all specs and combine them using join and also I changed csv file column separator to ;我遍历所有规范并使用 join 将它们组合起来,我还将 csv 文件列分隔符更改为 ;

selenium is not proffered for such case, as the data is already visible within the HTML script tag, So you can use requests module with bs4 to load the JSON data as below:这种情况下不提供selenium ,因为数据已经在HTML script标签中可见,因此您可以使用带有bs4 requests模块来加载JSON数据,如下所示:

import requests
from bs4 import BeautifulSoup
import re
import json


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    script = soup.find("script", id="is_script").text
    target = re.search(r"__INITIAL_STATE__ = ({.+});$", script).group(1)
    data = json.loads(target)
    # print(json.dumps(data, indent=4)) to see it in nice view
    print(data.keys()) # it's a dict, so you can deal with it.


main("https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo,b5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM