简体   繁体   English

PYTHON Scrapy | 从项目插入 MySQL

[英]PYTHON Scrapy | inserting into MySQL from items

I have been trying to scrape a news site to store each article in mySQL database.我一直在尝试抓取一个新闻站点来将每篇文章存储在 mySQL 数据库中。 My goal is to store the following data for each article on the news site: date, title, summary, link我的目标是为新闻网站上的每篇文章存储以下数据:日期、标题、摘要、链接

I been trying different methods and decided after trying for a few weeks to come here on stackoverflow to get a solution to my problem.我一直在尝试不同的方法,并在尝试了几周后决定在 stackoverflow 上找到我的问题的解决方案。 (Note: I have one code that is near to solve my problem, but it only takes out all of the items at once and not one by one so I tried a new approche and here is where I hit the wall) (注意:我有一个代码可以解决我的问题,但它只一次取出所有项目而不是一个一个,所以我尝试了一种新的方法,这就是我撞墙的地方)

SPIDER.PY蜘蛛侠

    import scrapy
    from ..items import WebspiderItem


    class NewsSpider(scrapy.Spider):
        name = 'news'
        start_urls = [
            'https://www.coindesk.com/feed'
        ]

        def parse(self, response):

            for date in response.xpath('//pubDate/text()').extract():
                yield WebspiderItem(date = date)


            for title in response.xpath('//title/text()').extract():
                yield WebspiderItem(title = title)


            for summary in response.xpath('//description/text()').extract():
                yield WebspiderItem(summary = summary)


            for link in response.xpath('//link/text()').extract():
                yield WebspiderItem(link = link)

ITEMS.PY项目.PY

import scrapy


class WebspiderItem(scrapy.Item):
    date = scrapy.Field()
    title = scrapy.Field()
    summary = scrapy.Field()
    link = scrapy.Field()

PIPELINES.PY管道.PY

import mysql.connector


class WebspiderPipeline(object):

    def __init__(self):
        self.create_connection()
        self.create_table()

    def create_connection(self):
        self.conn = mysql.connector.connect(
            host='localhost',
            user='root',
            passwd='HIDDENPASSWORD',
            database='news_db'
        )
        self.curr = self.conn.cursor()

    def create_table(self):
        self.curr.execute("""DROP TABLE IF EXISTS news_tb""")
        self.curr.execute("""create table news_tb(
                        date text,
                        title text,
                        summary text,
                        link text
                        )""")

    def process_item(self, item, spider):
        self.store_db(item)
        return item

    def store_db(self, item):
        self.curr.execute("""insert into news_tb values (%s, %s, %s, %s)""", (
            item['date'],
            item['title'],
            item['summary'],
            item['link']

        ))
        self.conn.commit()

Response Multiple of these:响应倍数:

2020-03-17 07:54:32 [scrapy.core.scraper] ERROR: Error processing {'link': 'https://www.coindesk.com/makerdaos-problems-are-a-textbook-case-of-governance-failure'}
Traceback (most recent call last):
  File "c:\users\r\pycharmprojects\project\venv\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\r\PycharmProjects\Project\webspider v3 RSS\webspider\pipelines.py", line 36, in process_item
    self.store_db(item)
  File "C:\Users\r\PycharmProjects\Project\webspider v3 RSS\webspider\pipelines.py", line 41, in store_db
    item['date'],
  File "c:\users\r\pycharmprojects\_project\venv\lib\site-packages\scrapy\item.py", line 91, in __getitem__
    return self._values[key]
KeyError:

you should yield all the data once, don't do it while on loop, python reads code from top to bottom, you yield the date first and the pipelines received it and try to find the value title, summary and link and its missing now returns KeyError你应该一次产生所有数据,不要在循环时这样做,python从上到下读取代码,你首先产生日期,管道收到它并尝试找到值标题,摘要和链接,现在它丢失了返回密钥错误

class NewsSpider(scrapy.Spider):
        name = 'news'
    def start_requests(self):
        page = 'https://www.coindesk.com/feed'
        yield scrapy.Request(url=page, callback=self.parse)

    def parse(self, response):
        links = response.xpath('//link/text()').extract()
        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_contents)

    def parse_contents(self, response):
        url = response.url
        article_title = response.xpath('//h1/text()').extract()[0]
        pub_date = response.xpath('//div[@class="article-hero-datetime"]/time/@datetime').extract()[0]
        description = response.xpath('//meta[@name="description"]/@content').extract()[0]
        item = WebspiderItem()
        item['date'] = pub_date
        item['title'] = article_title
        item['summary'] = description
        item['link'] = url

        yield item

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM