Scrapy Getting Start_Urls

Question

Ok, going to keep this short, need to rush off for a meeting 好吧，为了简短起见，需要赶时间参加会议

I am trying to get the start urls in scrapy and no matter how i try, i can't seem to accomplish it. 我正试图使起始URL变得混乱，无论我如何尝试，我似乎都无法完成。 Here is my code(spider). 这是我的代码（蜘蛛）。

import scrapy
import csv

from scrapycrawler.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request


class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["snipplr.com"]


def start_requests(self):
    for i in range(1, 230):
        yield self.make_requests_from_url("http://www.snipplr.com/view/%d" % i)




def make_requests_from_url(self, url):
    item = DmozItem()

    # assign url
    item['link'] = url
    request = Request(url, dont_filter=True)

    # set the meta['item'] to use the item in the next call back
    request.meta['item'] = item
    return request


#Rules only apply before
rules = (
    Rule (LxmlLinkExtractor(deny_domains=('http:\/\/www.snipplr.com\/snippet-not-found\/',)),callback="parse", follow= True),
)


def parse(self, response):
    sel = Selector(response)
    item = response.meta['item']
    item['title'] = sel.xpath('//div[@class="post"]/h1/text()').extract()
    #start_url
    item['link'] = response.url

I have tried all ways, and till now, i get a "h" in my database, the url column. 我已经尝试了所有方法，直到现在，我在数据库的url列中都得到一个“ h”。

This is my database : 这是我的数据库：

import csv
from scrapy.exceptions import DropItem
from scrapy import log
import sys
 import mysql.connector

class CsvWriterPipeline(object):

    def __init__(self):
        self.connection = mysql.connector.connect(host='localhost', user='ws', passwd='ps', db='ws')
        self.cursor = self.connection.cursor()

    def process_item(self, item, spider):
        self.cursor.execute("SELECT title,url FROM items WHERE title= %s", item['title'])
        result = self.cursor.fetchone()
        if result:

            log.msg("Item already in database: %s" % item, level=log.DEBUG)
        else:
            self.cursor.execute(
               "INSERT INTO items (title, url) VALUES (%s, %s)",
                    (item['title'][0], item['link'][0]))
            self.connection.commit()

            log.msg("Item stored : " % item, level=log.DEBUG)
        return item

    def handle_error(self, e):
            log.err(e)

As u can see from here, 从这里您可以看到， it is clearly working. 显然是可行的。

How would i get the start url or rather how would i prase it. 我将如何获得起始网址，或者我将如何使用它。 I believe h means that the field is empty. 我相信h表示该字段为空。 Database is mysql. 数据库是mysql。

Thanks for your reading and for your help 感谢您的阅读和帮助

Regards, Charlie 问候，查理

Answer 1

item['link'] , as opposed to item['title'] , is just a string, not a list: 与item['title']相对， item['link']只是一个字符串，而不是一个列表：

self.cursor.execute("INSERT INTO items (title, url) VALUES (%s, %s)",
                    (item['title'][0], item['link']))

Scrapy Getting Start_Urls

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-01-24 01:55:36

Scrapy Getting Start_Urls

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-01-24 01:55:36

解决方案1
1 已采纳 2015-01-24 01:55:36