简体   繁体   English

Scrapypipeline.py没有从蜘蛛插入项目到MYSQL

[英]Scrapy pipeline.py not inserting items to MYSQL from spider

I am using scrapy for scraping news headlines and I am a rookie for scrapy and scraping as a whole. 我正在使用scrapy来抓取新闻头条,并且我是一个新手,可以整体抓取和抓取。 I am having huge issues for a few days now pipelining my scraped data into my SQL db. 几天来,我遇到了巨大的问题,将已抓取的数据通过管道传输到我的SQL数据库中。 I have 2 classes in my pipelines.py file one for inserting items to Database and another for backing up scraped data into json file for front end web development reasons. 我的pipes.py文件中有2个类,一个用于将项目插入数据库中,另一个用于出于前端Web开发原因将抓取的数据备份到json文件中。

This is the code for my spider - its extracting news headlines from the start_urls - it picks up this data as strings using extract() and later on looping through all of them and using strip() to remove white spaces for better formatting 这是我的蜘蛛的代码-它从start_urls提取新闻标题-它使用extract()并以字符串的形式extract()这些数据,随后遍历所有数据并使用strip()删除空格以获取更好的格式

from scrapy.spider import Spider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from Aljazeera.items import AljazeeraItem
from datetime import date, datetime


class AljazeeraSpider(Spider):
    name = "aljazeera"
    allowed_domains = ["aljazeera.com"]
    start_urls = [
        "http://www.aljazeera.com/news/europe/",
        "http://www.aljazeera.com/news/middleeast/",
        "http://www.aljazeera.com/news/asia/",
        "http://www.aljazeera.com/news/asia-pacific/",
        "http://www.aljazeera.com/news/americas/",
        "http://www.aljazeera.com/news/africa/",
        "http://blogs.aljazeera.com/"

    ]

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//td[@valign="bottom"]')
        contents = sel.xpath('//div[@class="indexSummaryText"]')
        items = []

        for site,content in zip(sites, contents):
            item = AljazeeraItem()
            item['headline'] = site.xpath('div[3]/text()').extract()
            item['content'] = site.xpath('div/a/text()').extract()
            item['date'] = str(date.today())
            for headline, content in zip(item['content'], item['headline']):
              item['headline'] = headline.strip()
              item['content'] = content.strip()
              items.append(item)
        return items

The Code for my pipeline.py is as follows : 我的pipeline.py的代码如下:

import sys
import MySQLdb
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request
import json
import os.path

class SQLStore(object):
  def __init__(self):
    self.conn = MySQLdb.connect(user='root', passwd='', db='aj_db', host='localhost', charset="utf8", use_unicode=True)
    self.cursor = self.conn.cursor()
    #log data to json file


def process_item(self, item, spider): 

    try:
        self.cursor.execute("""INSERT INTO scraped_data(headlines, contents, dates) VALUES (%s, %s, %s)""", (item['headline'].encode('utf-8'), item['content'].encode('utf-8'), item['date'].encode('utf-8')))
        self.conn.commit()

    except MySQLdb.Error, e:
        print "Error %d: %s" % (e.args[0], e.args[1])

        return item



#log runs into back file 
class JsonWriterPipeline(object):

    def __init__(self):
        self.file = open('backDataOfScrapes.json', "w")

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write("item === " + line)
        return item

And the settings.py is as follows : 并且settings.py如下所示:

BOT_NAME = 'Aljazeera'

SPIDER_MODULES = ['Aljazeera.spiders']
NEWSPIDER_MODULE = 'Aljazeera.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Aljazeera (+http://www.yourdomain.com)'

ITEM_PIPELINES = {
    'Aljazeera.pipelines.JsonWriterPipeline': 300,
    'Aljazeera.pipelines.SQLStore': 300,
}

My sql setting are all ok. 我的SQL设置都可以。 and after running scrapy crawl aljazeera it works and even outputs the items in json format as follows : 在运行scrapy crawl aljazeera它甚至可以以json格式输出项目,如下所示:

item === {"headline": "Turkey court says Twitter ban violates rights", "content": "Although ruling by Turkey's highest court is binding, it is unclear whether the government will overturn the ban.", "date": "2014-04-02"}

i really dont know or cant see what I am missing here. 我真的不知道或看不到我在这里想念的东西。 I would really appreciate if u guys could help me out. 如果你们能帮助我,我将不胜感激。

Thanks for your time, 谢谢你的时间,

Your indentation is wrong in the SQLStore pipeline. 您的缩进在SQLStore管道中是错误的。 I've tested with correct indentation and its working fine. 我已经测试了正确的缩进及其工作正常。 Copy the below and it should be perfect. 复制以下内容,它应该是完美的。

class SQLStore(object):
def __init__(self):
    self.conn = MySQLdb.connect(user='root', passwd='', db='aj_db', host='localhost', charset="utf8", use_unicode=True)
    self.cursor = self.conn.cursor()
    #log data to json file


def process_item(self, item, spider): 

    try:
        self.cursor.execute("""INSERT INTO scraped_data(headlines, contents, dates) VALUES (%s, %s, %s)""", (item['headline'].encode('utf-8'), item['content'].encode('utf-8'), item['date'].encode('utf-8')))
        self.conn.commit()

    except MySQLdb.Error, e:
        print "Error %d: %s" % (e.args[0], e.args[1])

        return item

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM