如何让抓痒的蜘蛛根据CSV文件向项目添加信息

Question

正如你们中的一些人可能会聚集在一起一样，我正在学习抓取方法，以便从我正在运行的研究项目的Google学术搜索中抓取一些数据。 我有一个包含许多文章标题的文件，而这些文章我都是我对其进行引用。 我使用熊猫读取了文件，生成了需要抓取的URL，然后开始抓取。

我面临的一个问题是503错误。 Google很快就将我拒之门外，许多条目仍未删除。 我正在使用Crawlera提供的一些中间件来解决这个问题。

我面临的另一个问题是，当我导出我的抓取数据时，我很难将抓取的数据与我要查找的内容进行匹配。 我的输入数据是一个具有三个字段的CSV文件-'Authors'，'Title'，'pid'，其中'pid'是唯一标识符。

我使用熊猫读取文件，并根据标题为学者生成URL。 每次抓取给定的URL时，我的蜘蛛都会浏览学者页面，并为该页面上列出的每篇文章选择标题，出版物信息和引用。

这是我生成刮取链接的方式：

class ScholarSpider(Spider):
    name = "scholarscrape"
    allowed_domains = ["scholar.google.com"]

    # get the data
    data = read_csv("../../data/master_jeea.csv")
    # get the titles
    queries = data.Title.apply(urllib.quote)
    # generate a var to store links
    links = []
    # create the URLs to crawl
    for entry in queries:
        links.append("http://scholar.google.com/scholar?q=allintitle%3A"+entry)
    # give the URLs to scrapy
    start_urls = links

例如，我数据文件中的一个标题可能是Rodney Brooks撰写的论文“ Elephants Do n't Play Chess”，其编号为“ pid” 5067。

http://scholar.google.com/scholar?q=allintitle%3Aelephants+don%27t+play+chess

现在在此页面上，有六个点击。 蜘蛛会获得全部六个命中，但需要为其分配相同的“ pid”。 我知道我需要在某处插入一行，该行的内容类似于item ['pid'] = data.pid.apply（“ something”），但我不知道该怎么做。

以下是我的蜘蛛的其余代码。 我确信这样做的方法非常简单，但是我想不出如何让蜘蛛知道应该寻找哪个data.pid条目。

def parse(self, response):
    # initialize something to hold the data
    items=[]
    sel = Selector(response)
    # get each 'entry' on the page
    # an entry is a self contained div
    # that has the title, publication info
    # and cites
    entries = sel.xpath('//div[@class="gs_ri"]')
    # a counter for the entry that is being scraped
    count = 1
    for entry in entries:
        item = ScholarscrapeItem()
        # get the title
        title = entry.xpath('.//h3[@class="gs_rt"]/a//text()').extract()
        # the title is messy
        # clean up
        item['title'] = "".join(title)
        # get publication info
        # clean up
        author = entry.xpath('.//div[@class="gs_a"]//text()').extract()
        item['authors'] = "".join(author)
        # get the portion that contains citations
        cite_string = entry.xpath('.//div[@class="gs_fl"]//text()').extract()
        # find the part that says "Cited by"
        match = re.search("Cited by \d+",str(cite_string))
        # if it exists, note the number
        if match:
            cites = re.search("\d+",match.group()).group()
        # if not, there is no citation info
        else:
            cites = None
        item['cites'] = cites
        item['entry'] = count
        # iterate the counter
        count += 1
        # append this item to the list
        items.append(item)
    return items

我希望这个问题定义明确，但是请让我知道我是否更清楚。 除了在顶部输入内容的某些行外，我的刮板中实际上没有其他东西。

编辑1 ：根据以下建议，我对代码进行了如下修改：

# test-case: http://scholar.google.com/scholar?q=intitle%3Amigratory+birds
import re
from pandas import *
import urllib

from scrapy.spider import Spider
from scrapy.selector import Selector

from scholarscrape.items import ScholarscrapeItem

class ScholarSpider(Spider):
    name = "scholarscrape"
    allowed_domains = ["scholar.google.com"]

    # get the data
    data = read_csv("../../data/master_jeea.csv")
    # get the titles
    queries = data.Title.apply(urllib.quote)
    pid = data.pid
    # generate a var to store links
    urls = []
    # create the URLs to crawl
    for entry in queries:
        urls.append("http://scholar.google.com/scholar?q=allintitle%3A"+entry)
    # give the URLs to scrapy
    start_urls = (
        (urls, pid),
        )

    def make_requests_from_url(self, (url,pid)):
        return Request(url, meta={'pid':pid}, callback=self.parse, dont_filter=True)

    def parse(self, response):
        # initialize something to hold the data
        items=[]
        sel = Selector(response)
        # get each 'entry' on the page
        # an entry is a self contained div
        # that has the title, publication info
        # and cites
        entries = sel.xpath('//div[@class="gs_ri"]')
        # a counter for the entry that is being scraped
        count = 1
        for entry in entries:
            item = ScholarscrapeItem()
            # get the title
            title = entry.xpath('.//h3[@class="gs_rt"]/a//text()').extract()
            # the title is messy
            # clean up
            item['title'] = "".join(title)
            # get publication info
            # clean up
            author = entry.xpath('.//div[@class="gs_a"]//text()').extract()
            item['authors'] = "".join(author)
            # get the portion that contains citations
            cite_string = entry.xpath('.//div[@class="gs_fl"]//text()').extract()
            # find the part that says "Cited by"
            match = re.search("Cited by \d+",str(cite_string))
            # if it exists, note the number
            if match:
                cites = re.search("\d+",match.group()).group()
            # if not, there is no citation info
            else:
                cites = None
            item['cites'] = cites
            item['entry'] = count
            item['pid'] = response.meta['pid']
            # iterate the counter
            count += 1
            # append this item to the list
            items.append(item)
        return items

Answer 1

您需要使用元组（url，pid）填充列表start_urls 。 现在重新定义方法make_requests_from_url(url) ：

class ScholarSpider(Spider):
    name = "ScholarSpider"
    allowed_domains = ["scholar.google.com"]
    start_urls = (
        ('http://www.scholar.google.com/', 100),
        )

    def make_requests_from_url(self, (url, pid)):
        return Request(url, meta={'pid': pid}, callback=self.parse, dont_filter=True)

    def parse(self, response):
        pid = response.meta['pid']
        print '!!!!!!!!!!!', pid, '!!!!!!!!!!!!'
        pass

如何让抓痒的蜘蛛根据CSV文件向项目添加信息

问题描述

1 个解决方案

解决方案1
0 2014-03-03 23:04:19

如何让抓痒的蜘蛛根据CSV文件向项目添加信息

问题描述

1 个解决方案

解决方案1 0 2014-03-03 23:04:19

解决方案1
0 2014-03-03 23:04:19