繁体   English   中英

Scrapy to df如何不覆盖数据

[英]Scrapy to df how to not overwrite data

因此,这里的菜鸟,很难将抓取的数据写入xlsx。 好吧,第一页很棒,问题在于其他页面会覆盖以前的页面。 我认为这是由于收益率行为造成的,但老实说,我无法清楚地理解原因。

因此,如您在下面的代码中所见,我可以读取所需的所有信息,但是当我将所有信息发送到excel并开始新页面时,我看不到一种不覆盖此信息的方法信息,有人可以帮我吗? 欣赏!

def parse(self,response):
    product_name = response.css('.a-text-ellipsis .a-link-normal').css('::text').extract() #when having 2 tags, use ::text in the end, else, in the tag.
    #series_product_to_fill_df = pd.Series(product_name)

    date = response.css('.review-date::text').extract()
    rating_text = response.css('.review-rating').extract()
    rate =[] 
    for x in rating_text:
        extracting_stars = Selector(text=x).xpath('//span/text()').extract_first()
        rate.append(extracting_stars)
    title = response.css('.a-text-bold span::text').extract()
    reviewer_name = response.css('.a-profile-name::text').extract()
    badge = response.css('.c7y-badge-text::text').extract()
    review = response.css('.review-text-content span::text').extract()

    print('******************************************')

    df = pd.DataFrame(columns=['Date','Rate','Title','Reviewer_name','Badge', 'Review']) ##When passing the dictionary, using [] implies that all the data from that are rows and not columns, not generating index problems
    
    df['Date'] = pd.Series(date)
    df['Rate'] = pd.Series(rate)
    df['Title'] = pd.Series(title)
    df['Reviewer_name'] = pd.Series(reviewer_name)
    df['Badge'] = pd.Series(badge)
    df['Review'] = pd.Series(review)
    df['Product_name'] = pd.Series(product_name)

    #Reordering cols
    df = df[['Product_name','Date','Rate','Title','Reviewer_name','Badge', 'Review']]
    print(df)
    #filling all rows in the "Product_name"
    df['Product_name'].fillna( method ='ffill',inplace=True) #uses the only entry in the col and repeats it
    
    #excel_path
    destination_path = "C:\\Users\\xxxxxx\\export_dataframe.xlsx" 
    
    #excel
    df.to_excel(destination_path)

    #yield items
    yield df.to_excel(destination_path)

    #Next page scraping
    next_page = response.css('li.a-last > a::attr(href)').extract_first() #extracting all the link
    if next_page: 
        yield scrapy.Request(urljoin(response.url, next_page),callback=self.parse)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM