简体   繁体   English

Python:将网址的“列表”发送给抓爬虫抓取时出现问题

[英]Python: Problems sending 'list' of urls to scrapy spider to scrape

Trying to send a 'list' of urls to scrapy to crawl via a certain spider via using a long string, then splitting the string inside the crawler. 尝试使用长字符串发送要抓取的网址的“列表”以使用一定的蜘蛛爬网,然后在爬网程序内拆分字符串。 I've tried copying the format that was given in this answer. 我尝试复制答案中给出的格式。

The list I'm trying to send to the crawler is future_urls 我要发送给future_urls器的列表是future_urls

    >>> print future_urls
    set(['https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'http://finance.yahoo.com/q/hp?s=TFW.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=DLTR&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=AGNC&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'https://ca.finance.yahoo.com/q/hp?s=HMSY&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m', 'http://finance.yahoo.com/q/hp?s=BATS.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m'])

Then sending it to the crawler through: 然后通过以下方式将其发送给搜寻器:

command4 = ("scrapy crawl future -o future_portfolios_{0} -t csv -a future_urls={1}").format(input_file, str(','.join(list(future_urls))))

>>> print command4
scrapy crawl future -o future_portfolios_input_10062008_10062012_ver_1.csv -t csv -a future_urls=https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,http://finance.yahoo.com/q/hp?s=TFW.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=DLTR&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=AGNC&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,https://ca.finance.yahoo.com/q/hp?s=HMSY&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m,http://finance.yahoo.com/q/hp?s=BATS.L&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m
>>> type(command4)
<type 'str'>

My crawler (partial): 我的搜寻器(部分):

class FutureSpider(scrapy.Spider):
name = "future"
allowed_domains = ["finance.yahoo.com", "ca.finance.yahoo.com"]
start_urls = ['https://ca.finance.yahoo.com/q/hp?s=%5EIXIC']

def __init__(self, *args, **kwargs):
    super(FutureSpider, self).__init__(*args,**kwargs)
    self.future_urls = kwargs.get('future_urls').split(',')
    self.rate_returns_len_min = 12
    self.required_amount_of_returns = 12
    for x in self.future_urls:
            print "Going to scrape:"
            print x

def parse(self, response):

    if self.future_urls:
        for x in self.future_urls:
            yield scrapy.Request(x, self.stocks1)

However, what is printed out from print 'going to scrape:', x is: 但是,从print 'going to scrape:', x打印出来的print 'going to scrape:', x是:

Going to scrape:
https://ca.finance.yahoo.com/q/hp?s=ALXN

Only one url, and it's only a portion of the first url in future_urls which is obviously problematic. 仅有一个网址,并且仅是future_urls中第一个网址的future_urls ,这显然是有问题的。

Can't seem to figure out why the crawler won't scrape all of the urls in future_urls ... 似乎无法弄清楚为什么future_urls器不会抓取future_urls所有URL ...

I think it's stopping when it hits the ampersand ( & ), you can escape it by using urllib.quote . 我认为,当它击中号(它停止& ),您可以通过使用转义urllib.quote

For example: 例如:

import urllib

escapedurl = urllib.quote('https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m')

Then you get it back to normal you can do: 然后,您可以将其恢复到正常状态:

>>>>urllib.unquote(escapedurl)
https://ca.finance.yahoo.com/q/hp?s=ALXN&a=06&b=10&c=2012&d=06&e=10&f=2015&g=m

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM