![](/img/trans.png)
[英]How to create a csv file dynamically with name of the spider in scrapy python
[英]How to get scrapy spider to add information to an item based on a CSV file
正如你们中的一些人可能会聚集在一起一样,我正在学习抓取方法,以便从我正在运行的研究项目的Google学术搜索中抓取一些数据。 我有一个包含许多文章标题的文件,而这些文章我都是我对其进行引用。 我使用熊猫读取了文件,生成了需要抓取的URL,然后开始抓取。
我面临的一个问题是503错误。 Google很快就将我拒之门外,许多条目仍未删除。 我正在使用Crawlera提供的一些中间件来解决这个问题。
我面临的另一个问题是,当我导出我的抓取数据时,我很难将抓取的数据与我要查找的内容进行匹配。 我的输入数据是一个具有三个字段的CSV文件-'Authors','Title','pid',其中'pid'是唯一标识符。
我使用熊猫读取文件,并根据标题为学者生成URL。 每次抓取给定的URL时,我的蜘蛛都会浏览学者页面,并为该页面上列出的每篇文章选择标题,出版物信息和引用。
这是我生成刮取链接的方式:
class ScholarSpider(Spider):
name = "scholarscrape"
allowed_domains = ["scholar.google.com"]
# get the data
data = read_csv("../../data/master_jeea.csv")
# get the titles
queries = data.Title.apply(urllib.quote)
# generate a var to store links
links = []
# create the URLs to crawl
for entry in queries:
links.append("http://scholar.google.com/scholar?q=allintitle%3A"+entry)
# give the URLs to scrapy
start_urls = links
例如,我数据文件中的一个标题可能是Rodney Brooks撰写的论文“ Elephants Do n't Play Chess”,其编号为“ pid” 5067。
http://scholar.google.com/scholar?q=allintitle%3Aelephants+don%27t+play+chess
现在在此页面上,有六个点击。 蜘蛛会获得全部六个命中,但需要为其分配相同的“ pid”。 我知道我需要在某处插入一行,该行的内容类似于item ['pid'] = data.pid.apply(“ something”),但我不知道该怎么做。
以下是我的蜘蛛的其余代码。 我确信这样做的方法非常简单,但是我想不出如何让蜘蛛知道应该寻找哪个data.pid条目。
def parse(self, response):
# initialize something to hold the data
items=[]
sel = Selector(response)
# get each 'entry' on the page
# an entry is a self contained div
# that has the title, publication info
# and cites
entries = sel.xpath('//div[@class="gs_ri"]')
# a counter for the entry that is being scraped
count = 1
for entry in entries:
item = ScholarscrapeItem()
# get the title
title = entry.xpath('.//h3[@class="gs_rt"]/a//text()').extract()
# the title is messy
# clean up
item['title'] = "".join(title)
# get publication info
# clean up
author = entry.xpath('.//div[@class="gs_a"]//text()').extract()
item['authors'] = "".join(author)
# get the portion that contains citations
cite_string = entry.xpath('.//div[@class="gs_fl"]//text()').extract()
# find the part that says "Cited by"
match = re.search("Cited by \d+",str(cite_string))
# if it exists, note the number
if match:
cites = re.search("\d+",match.group()).group()
# if not, there is no citation info
else:
cites = None
item['cites'] = cites
item['entry'] = count
# iterate the counter
count += 1
# append this item to the list
items.append(item)
return items
我希望这个问题定义明确,但是请让我知道我是否更清楚。 除了在顶部输入内容的某些行外,我的刮板中实际上没有其他东西。
编辑1 :根据以下建议,我对代码进行了如下修改:
# test-case: http://scholar.google.com/scholar?q=intitle%3Amigratory+birds
import re
from pandas import *
import urllib
from scrapy.spider import Spider
from scrapy.selector import Selector
from scholarscrape.items import ScholarscrapeItem
class ScholarSpider(Spider):
name = "scholarscrape"
allowed_domains = ["scholar.google.com"]
# get the data
data = read_csv("../../data/master_jeea.csv")
# get the titles
queries = data.Title.apply(urllib.quote)
pid = data.pid
# generate a var to store links
urls = []
# create the URLs to crawl
for entry in queries:
urls.append("http://scholar.google.com/scholar?q=allintitle%3A"+entry)
# give the URLs to scrapy
start_urls = (
(urls, pid),
)
def make_requests_from_url(self, (url,pid)):
return Request(url, meta={'pid':pid}, callback=self.parse, dont_filter=True)
def parse(self, response):
# initialize something to hold the data
items=[]
sel = Selector(response)
# get each 'entry' on the page
# an entry is a self contained div
# that has the title, publication info
# and cites
entries = sel.xpath('//div[@class="gs_ri"]')
# a counter for the entry that is being scraped
count = 1
for entry in entries:
item = ScholarscrapeItem()
# get the title
title = entry.xpath('.//h3[@class="gs_rt"]/a//text()').extract()
# the title is messy
# clean up
item['title'] = "".join(title)
# get publication info
# clean up
author = entry.xpath('.//div[@class="gs_a"]//text()').extract()
item['authors'] = "".join(author)
# get the portion that contains citations
cite_string = entry.xpath('.//div[@class="gs_fl"]//text()').extract()
# find the part that says "Cited by"
match = re.search("Cited by \d+",str(cite_string))
# if it exists, note the number
if match:
cites = re.search("\d+",match.group()).group()
# if not, there is no citation info
else:
cites = None
item['cites'] = cites
item['entry'] = count
item['pid'] = response.meta['pid']
# iterate the counter
count += 1
# append this item to the list
items.append(item)
return items
您需要使用元组(url,pid)填充列表start_urls
。 现在重新定义方法make_requests_from_url(url)
:
class ScholarSpider(Spider):
name = "ScholarSpider"
allowed_domains = ["scholar.google.com"]
start_urls = (
('http://www.scholar.google.com/', 100),
)
def make_requests_from_url(self, (url, pid)):
return Request(url, meta={'pid': pid}, callback=self.parse, dont_filter=True)
def parse(self, response):
pid = response.meta['pid']
print '!!!!!!!!!!!', pid, '!!!!!!!!!!!!'
pass
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.