Scrapypipeline.py没有从蜘蛛插入项目到MYSQL

Question

I am using scrapy for scraping news headlines and I am a rookie for scrapy and scraping as a whole. 我正在使用scrapy来抓取新闻头条，并且我是一个新手，可以整体抓取和抓取。 I am having huge issues for a few days now pipelining my scraped data into my SQL db. 几天来，我遇到了巨大的问题，将已抓取的数据通过管道传输到我的SQL数据库中。 I have 2 classes in my pipelines.py file one for inserting items to Database and another for backing up scraped data into json file for front end web development reasons. 我的pipes.py文件中有2个类，一个用于将项目插入数据库中，另一个用于出于前端Web开发原因将抓取的数据备份到json文件中。

This is the code for my spider - its extracting news headlines from the start_urls - it picks up this data as strings using extract() and later on looping through all of them and using strip() to remove white spaces for better formatting 这是我的蜘蛛的代码-它从start_urls提取新闻标题-它使用extract()并以字符串的形式extract()这些数据，随后遍历所有数据并使用strip()删除空格以获取更好的格式

from scrapy.spider import Spider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from Aljazeera.items import AljazeeraItem
from datetime import date, datetime


class AljazeeraSpider(Spider):
    name = "aljazeera"
    allowed_domains = ["aljazeera.com"]
    start_urls = [
        "http://www.aljazeera.com/news/europe/",
        "http://www.aljazeera.com/news/middleeast/",
        "http://www.aljazeera.com/news/asia/",
        "http://www.aljazeera.com/news/asia-pacific/",
        "http://www.aljazeera.com/news/americas/",
        "http://www.aljazeera.com/news/africa/",
        "http://blogs.aljazeera.com/"

    ]

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//td[@valign="bottom"]')
        contents = sel.xpath('//div[@class="indexSummaryText"]')
        items = []

        for site,content in zip(sites, contents):
            item = AljazeeraItem()
            item['headline'] = site.xpath('div[3]/text()').extract()
            item['content'] = site.xpath('div/a/text()').extract()
            item['date'] = str(date.today())
            for headline, content in zip(item['content'], item['headline']):
              item['headline'] = headline.strip()
              item['content'] = content.strip()
              items.append(item)
        return items

The Code for my pipeline.py is as follows : 我的pipeline.py的代码如下：

import sys
import MySQLdb
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request
import json
import os.path

class SQLStore(object):
  def __init__(self):
    self.conn = MySQLdb.connect(user='root', passwd='', db='aj_db', host='localhost', charset="utf8", use_unicode=True)
    self.cursor = self.conn.cursor()
    #log data to json file


def process_item(self, item, spider): 

    try:
        self.cursor.execute("""INSERT INTO scraped_data(headlines, contents, dates) VALUES (%s, %s, %s)""", (item['headline'].encode('utf-8'), item['content'].encode('utf-8'), item['date'].encode('utf-8')))
        self.conn.commit()

    except MySQLdb.Error, e:
        print "Error %d: %s" % (e.args[0], e.args[1])

        return item



#log runs into back file 
class JsonWriterPipeline(object):

    def __init__(self):
        self.file = open('backDataOfScrapes.json', "w")

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write("item === " + line)
        return item

And the settings.py is as follows : 并且settings.py如下所示：

BOT_NAME = 'Aljazeera'

SPIDER_MODULES = ['Aljazeera.spiders']
NEWSPIDER_MODULE = 'Aljazeera.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Aljazeera (+http://www.yourdomain.com)'

ITEM_PIPELINES = {
    'Aljazeera.pipelines.JsonWriterPipeline': 300,
    'Aljazeera.pipelines.SQLStore': 300,
}

My sql setting are all ok. 我的SQL设置都可以。 and after running scrapy crawl aljazeera it works and even outputs the items in json format as follows : 在运行scrapy crawl aljazeera它甚至可以以json格式输出项目，如下所示：

item === {"headline": "Turkey court says Twitter ban violates rights", "content": "Although ruling by Turkey's highest court is binding, it is unclear whether the government will overturn the ban.", "date": "2014-04-02"}

i really dont know or cant see what I am missing here. 我真的不知道或看不到我在这里想念的东西。 I would really appreciate if u guys could help me out. 如果你们能帮助我，我将不胜感激。

Thanks for your time, 谢谢你的时间，

Answer 1

Your indentation is wrong in the SQLStore pipeline. 您的缩进在SQLStore管道中是错误的。 I've tested with correct indentation and its working fine. 我已经测试了正确的缩进及其工作正常。 Copy the below and it should be perfect. 复制以下内容，它应该是完美的。

class SQLStore(object):
def __init__(self):
    self.conn = MySQLdb.connect(user='root', passwd='', db='aj_db', host='localhost', charset="utf8", use_unicode=True)
    self.cursor = self.conn.cursor()
    #log data to json file


def process_item(self, item, spider): 

    try:
        self.cursor.execute("""INSERT INTO scraped_data(headlines, contents, dates) VALUES (%s, %s, %s)""", (item['headline'].encode('utf-8'), item['content'].encode('utf-8'), item['date'].encode('utf-8')))
        self.conn.commit()

    except MySQLdb.Error, e:
        print "Error %d: %s" % (e.args[0], e.args[1])

        return item

Scrapypipeline.py没有从蜘蛛插入项目到MYSQL

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-04-03 02:00:04

Scrapypipeline.py没有从蜘蛛插入项目到MYSQL

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-04-03 02:00:04

解决方案1
1 已采纳 2014-04-03 02:00:04