简体   繁体   English

Scrapy-将Excel .csv导入为start_url

[英]Scrapy - Importing Excel .csv as start_url

So I'm building a scraper that imports a .csv excel file which has one row of ~2,400 websites (each website is in its own column) and using these as the start_url. 因此,我正在构建一个刮板,该刮板会导入一个.csv excel文件,该文件具有一排约2,400个网站(每个网站都在其自己的列中),并将其用作start_url。 I keep getting this error saying that I am passing in a list and not a string. 我不断收到此错误,表示我正在传递列表而不是字符串。 I think this may be caused by the fact that my list basically just has one reallllllly long list in it that represents the row. 我认为这可能是由于我的列表中基本上只有一个真正代表该行的长列表。 How can I overcome this and basically put each website from my .csv as its own seperate string within the list? 如何克服这个问题,并将基本上来自.csv的每个网站作为其自己的单独字符串放入列表中?

raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
    exceptions.TypeError: Request url must be str or unicode, got list:


import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
from tutorial.items import DanishItem
from scrapy.http import Request
import csv

with open('websites.csv', 'rbU') as csv_file:
  data = csv.reader(csv_file)
  scrapurls = []
  for row in data:
    scrapurls.append(row)

class DanishSpider(scrapy.Spider):
  name = "dmoz"
  allowed_domains = []
  start_urls = scrapurls

  def parse(self, response):
    for sel in response.xpath('//link[@rel="icon" or @rel="shortcut icon"]'):
      item = DanishItem()
      item['website'] = response
      item['favicon'] = sel.xpath('./@href').extract()
      yield item

Thanks! 谢谢!

Joey 乔伊

Just generating a list for start_urls does not work as it is clearly written in Scrapy documentation . 仅为start_urls生成列表是行不通的,因为Scrapy文档中已清楚地编写了该列表

From documentation: 从文档:

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests. 首先,生成初始请求以爬网第一个URL,然后指定要调用的回调函数,并从这些请求中下载响应。

The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests. 要执行的第一个请求是通过调用start_requests()方法获得的,该方法(默认情况下)会生成对start_urls指定的URL的Request ,并使用parse方法作为Requests的回调函数。

I would rather do it in this way: 我宁愿这样:

def get_urls_from_csv():
    with open('websites.csv', 'rbU') as csv_file:
        data = csv.reader(csv_file)
        scrapurls = []
        for row in data:
            scrapurls.append(row)
        return scrapurls


class DanishSpider(scrapy.Spider):

    ...

    def start_requests(self):
        return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]

Try opening the .csv file inside the class (not outside as you have done before) and append the start_urls. 尝试在类内打开.csv文件(而不是像以前那样在外部打开),然后附加start_urls。 This solution worked for me. 这个解决方案对我有用。 Hope this helps :-) 希望这可以帮助 :-)

    class DanishSpider(scrapy.Spider):
        name = "dmoz"
        allowed_domains = []
        start_urls = []

        f = open('websites.csv'), 'r')
        for i in f:
        u = i.split('\n')
        start_urls.append(u[0])
  for row in data:
    scrapurls.append(row)

row is a list [column1, column2, ..] So I think you need to extract the columns, and append to your start_urls. row是一个列表[column1,column2,..]因此,我认为您需要提取这些列,并将其追加到start_urls中。

  for row in data:
      # if all the column is the url str
      for column in row:
          scrapurls.append(column)

Try this way also, 也尝试这种方式

filee = open("filename.csv","r+")

# Removing the \n 'new line' from the url

r=[i for i in filee]
start_urls=[r[j].replace('\n','') for j in range(len(r))]

I find the following useful when in need: 我发现以下在需要时有用:

import csv
import scrapy

class DanishSpider(scrapy.Spider):
    name = "rei"
    with open("output.csv","r") as f:
        reader = csv.DictReader(f)
        start_urls = [item['Link'] for item in reader]

    def parse(self, response):
        yield {"link":response.url}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM