繁体   English   中英

如何让抓痒的蜘蛛根据CSV文件向项目添加信息

[英]How to get scrapy spider to add information to an item based on a CSV file

正如你们中的一些人可能会聚集在一起一样,我正在学习抓取方法,以便从我正在运行的研究项目的Google学术搜索中抓取一些数据。 我有一个包含许多文章标题的文件,而这些文章我都是我对其进行引用。 我使用熊猫读取了文件,生成了需要抓取的URL,然后开始抓取。

我面临的一个问题是503错误。 Google很快就将我拒之门外,许多条目仍未删除。 我正在使用Crawlera提供的一些中间件来解决这个问题。

我面临的另一个问题是,当我导出我的抓取数据时,我很难将抓取的数据与我要查找的内容进行匹配。 我的输入数据是一个具有三个字段的CSV文件-'Authors','Title','pid',其中'pid'是唯一标识符。

我使用熊猫读取文件,并根据标题为学者生成URL。 每次抓取给定的URL时,我的蜘蛛都会浏览学者页面,并为该页面上列出的每篇文章选择标题,出版物信息和引用。

这是我生成刮取链接的方式:

class ScholarSpider(Spider):
    name = "scholarscrape"
    allowed_domains = ["scholar.google.com"]

    # get the data
    data = read_csv("../../data/master_jeea.csv")
    # get the titles
    queries = data.Title.apply(urllib.quote)
    # generate a var to store links
    links = []
    # create the URLs to crawl
    for entry in queries:
        links.append("http://scholar.google.com/scholar?q=allintitle%3A"+entry)
    # give the URLs to scrapy
    start_urls = links

例如,我数据文件中的一个标题可能是Rodney Brooks撰写的论文“ Elephants Do n't Play Chess”,其编号为“ pid” 5067。

http://scholar.google.com/scholar?q=allintitle%3Aelephants+do​​n​​%27t+play+chess

现在在此页面上,有六个点击。 蜘蛛会获得全部六个命中,但需要为其分配相同的“ pid”。 我知道我需要在某处插入一行,该行的内容类似于item ['pid'] = data.pid.apply(“ something”),但我不知道该怎么做。

以下是我的蜘蛛的其余代码。 我确信这样做的方法非常简单,但是我想不出如何让蜘蛛知道应该寻找哪个data.pid条目。

def parse(self, response):
    # initialize something to hold the data
    items=[]
    sel = Selector(response)
    # get each 'entry' on the page
    # an entry is a self contained div
    # that has the title, publication info
    # and cites
    entries = sel.xpath('//div[@class="gs_ri"]')
    # a counter for the entry that is being scraped
    count = 1
    for entry in entries:
        item = ScholarscrapeItem()
        # get the title
        title = entry.xpath('.//h3[@class="gs_rt"]/a//text()').extract()
        # the title is messy
        # clean up
        item['title'] = "".join(title)
        # get publication info
        # clean up
        author = entry.xpath('.//div[@class="gs_a"]//text()').extract()
        item['authors'] = "".join(author)
        # get the portion that contains citations
        cite_string = entry.xpath('.//div[@class="gs_fl"]//text()').extract()
        # find the part that says "Cited by"
        match = re.search("Cited by \d+",str(cite_string))
        # if it exists, note the number
        if match:
            cites = re.search("\d+",match.group()).group()
        # if not, there is no citation info
        else:
            cites = None
        item['cites'] = cites
        item['entry'] = count
        # iterate the counter
        count += 1
        # append this item to the list
        items.append(item)
    return items

我希望这个问题定义明确,但是请让我知道我是否更清楚。 除了在顶部输入内容的某些行外,我的刮板中实际上没有其他东西。

编辑1 :根据以下建议,我对代码进行了如下修改:

# test-case: http://scholar.google.com/scholar?q=intitle%3Amigratory+birds
import re
from pandas import *
import urllib

from scrapy.spider import Spider
from scrapy.selector import Selector

from scholarscrape.items import ScholarscrapeItem

class ScholarSpider(Spider):
    name = "scholarscrape"
    allowed_domains = ["scholar.google.com"]

    # get the data
    data = read_csv("../../data/master_jeea.csv")
    # get the titles
    queries = data.Title.apply(urllib.quote)
    pid = data.pid
    # generate a var to store links
    urls = []
    # create the URLs to crawl
    for entry in queries:
        urls.append("http://scholar.google.com/scholar?q=allintitle%3A"+entry)
    # give the URLs to scrapy
    start_urls = (
        (urls, pid),
        )

    def make_requests_from_url(self, (url,pid)):
        return Request(url, meta={'pid':pid}, callback=self.parse, dont_filter=True)

    def parse(self, response):
        # initialize something to hold the data
        items=[]
        sel = Selector(response)
        # get each 'entry' on the page
        # an entry is a self contained div
        # that has the title, publication info
        # and cites
        entries = sel.xpath('//div[@class="gs_ri"]')
        # a counter for the entry that is being scraped
        count = 1
        for entry in entries:
            item = ScholarscrapeItem()
            # get the title
            title = entry.xpath('.//h3[@class="gs_rt"]/a//text()').extract()
            # the title is messy
            # clean up
            item['title'] = "".join(title)
            # get publication info
            # clean up
            author = entry.xpath('.//div[@class="gs_a"]//text()').extract()
            item['authors'] = "".join(author)
            # get the portion that contains citations
            cite_string = entry.xpath('.//div[@class="gs_fl"]//text()').extract()
            # find the part that says "Cited by"
            match = re.search("Cited by \d+",str(cite_string))
            # if it exists, note the number
            if match:
                cites = re.search("\d+",match.group()).group()
            # if not, there is no citation info
            else:
                cites = None
            item['cites'] = cites
            item['entry'] = count
            item['pid'] = response.meta['pid']
            # iterate the counter
            count += 1
            # append this item to the list
            items.append(item)
        return items

您需要使用元组(url,pid)填充列表start_urls 现在重新定义方法make_requests_from_url(url)

class ScholarSpider(Spider):
    name = "ScholarSpider"
    allowed_domains = ["scholar.google.com"]
    start_urls = (
        ('http://www.scholar.google.com/', 100),
        )

    def make_requests_from_url(self, (url, pid)):
        return Request(url, meta={'pid': pid}, callback=self.parse, dont_filter=True)

    def parse(self, response):
        pid = response.meta['pid']
        print '!!!!!!!!!!!', pid, '!!!!!!!!!!!!'
        pass

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM