简体   繁体   中英

How to get data in another format, using scrapy

I'm trying to scrape data about laptops, from amazon.

My code:

class AmazonData(scrapy.Spider):
    name = 'amazon_laptops'
    allowed_domains = ['https://www.amazon.com/']

    # generate link from dataframe
    start_urls = ['https://www.amazon.com' + str(i) for i in df.link.values]

    def parse(self, response):
        for vals in response.xpath("//table[@id='productDetails_techSpec_section_1']"):

            yield {
            'parameters': [x.strip() for x in vals.xpath("//tr/th[@class='a-color-secondary a-size-base prodDetSectionEntry']/text()").getall()],
            'values': [x.strip() for x in vals.xpath("//tr/td[@class='a-size-base']/text()").getall()]
        }

Output looks like this:

[ {'parameters': ['resolution', 'ram', ...], 'values': ['1920x1080', '8gb', ...}]

It's not readable, and after saving this output to csv file, no possibilities to get a DataFrame for some data manipulations.

I have no idea, how to get a DataFrame looks like this:

  resolution  ram  ...
0  1920x1080  8gb  ...
1  1366x768   4gb  ...

Url examples: Link 1 , Link 2

As the tables vary it's not as straight forward, but here is one way of solving it:
You could create an empty dataframe with all the columns you are interested in, then scrape the parameters and values from the table, combine them into a dictionary and then add these dictionary-entries to your dataframe. By using the for-loop-logic you take into account that some values might not be present in your table and the order in tables might differ.

This code is based on the two url-examples you provided:

import scrapy
import pandas as pd

class AmazonData(scrapy.Spider):
    name = 'amazon_laptops'
    df = pd.DataFrame(columns=['Screen Size', 'Screen Resolution', 'Max Screen Resolution', 'Processor', 'RAM', 'Hard Drive', 'Graphics Coprocessor', 'Chipset Brand', 'Card Description', 'Graphics Card Ram Size', 'Wireless Type', 'Number of USB 2.0 Ports', 'Number of USB 3.0 Ports'])

    start_urls = ['https://www.amazon.com/dp/B081945D2S', 'https://www.amazon.com/dp/B081721LTM']

    def parse(self, response):
        product = response.url.split("/")[-1]
        summary_table = response.xpath("//table[@id='productDetails_techSpec_section_1']//tr")
        keys = [x.strip() for x in summary_table.xpath(".//th/text()").getall()]
        values = [x.strip() for x in summary_table.xpath(".//td/text()").getall()]
        table_dict = dict(zip(keys, values))

        for key, val in table_dict.items():
            if key in list(self.df):
                self.df.loc[product, key] = val
        print(self.df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM