简体   繁体   中英

CSV output from web scrawling using scrapy

I am saving the output of web scrawling using scrapy in a csv file. The crawling itself seems to be working correctly, but I am not happy with the format of the output saved in csv file. I crawl 20 webpages where each page contains 100 job titles and their respective urls. So I am expecting the output looking like this:

url1, title1
url2, title2
...
...
url1999, title1999
url2000, title2000

however, the actual output in csv looks like this:

url1 url2 ... url100, title1 title2 ... title100
url101 url02 ... url200, title101 title102 ... title200
...
url1901 url902 ... url2000, title1901 title1902 ... title2000

My Spider code is:

import scrapy

class TextPostItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()

class MySpider(scrapy.Spider):
    name = "craig_spider"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/search/npo"]

    def parse(self, response):
        number = 0
        for page in range(0, 20):
            yield scrapy.Request("http://sfbay.craigslist.org/search/npo?=%s" % number, callback=self.parse_item, dont_filter=True)
            number += 100

    def parse_item(self, response):
        item = TextPostItem()
        item['title'] =response.xpath("//span[@class='pl']/a/text()").extract()
        item['link'] = response.xpath("//span[@class='pl']/a/@href").extract()
        return item

My csv code is:

scrapy crawl craig_spider -o craig.csv -t csv

Any suggestion? Thanks.

The problem is that you get a response with multiple //span[@class='pl']/a/ fields back, loading every text() into a list and assigning that to item['title'] , and then loading every @href into a list and assigning that to item['link'] .

In otherwords for the first response, you are essentially doing the following:

item['title'] = [title1, title2, ..., title100]
item['link'] = [url1, url2, ..., url100]

So, that's being sent to CSV as:

title,link
[title1, title2, ..., title100],[url1, url2, ..., url100]

To fix this, loop through each //span[@class='pl']/a/ and have individual items for each.

def parse_item(self, response):
    for span in response.xpath("//span[@class='pl']/a"):
        item = TextPostItem()
        item['title'] = span.xpath(".//text()").extract()
        item['link'] = span.xpath(".//@href").extract()
        yield item

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM