將數據保存在Postgresql數據庫中

Question

我不了解如何將抓取的數據保存在Postgresql數據庫中。 我嘗試使用Psycopg2卻沒有任何效果...我了解到可以為此使用django模型

抓取工具應抓取每個頁面上的每個博客帖子，抓取工具中的數據應進入Postgresql數據庫，該數據庫將計算以下統計信息：

1.地址/ stats下的10個最常用字及其編號

2.地址/統計//下的每個作者最常使用的10個單詞及其編號

在地址/統計信息//地址/作者下可用的位置發布作者的姓名/

例如，在下面的代碼中，我嘗試獲取作者的姓名，但出現這樣的錯誤：

authors = Author(name='author name')
TypeError: 'NoneType' object is not callable

將模型導入刮板也無濟於事...

這是我的刮板：

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from collections import Counter


url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []


headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0'
}
with requests.Session() as s:
    r = s.get('https://teonite.com/blog/')
    soup = bs(r.content, 'lxml')
    article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
    all_links.append(article_links)
    num_pages = int(soup.select_one('.page-number').text.split('/')[1])


    for page in range(2, num_pages + 1):
        r = s.get(url.format(page))
        soup = bs(r.content, 'lxml')
        article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
        all_links.append(article_links)



    all_links = [item for i in all_links for item in i]

    d = webdriver.Chrome(ChromeDriverManager().install())

    contents = []
    authors = []

    for article in all_links:
        d.get(article)
        soup = bs(d.page_source, 'lxml')
        [t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
        visible_text = soup.getText()
        content = soup.find('section', attrs={'class': 'post-content'})
        contents.append(content)
        Author = soup.find('span', attrs={'class': 'author-content'})
        authors.append(Author)

##  Below is the two lines of code where is the error

        authors = Author(name='author name')
        Author.save()

        unique_authors = list(set(authors))
        unique_contents = list(set(contents))
        try:
            print(soup.select_one('.post-title').text)
        except:
            print(article)
            print(soup.select_one('h1').text)
            break  # for debugging
    d.quit()

楷模：

from django.db import models

class Author(models.Model):
    author_id = models.CharField(primary_key=True, max_length=50, editable=False)
    author_name = models.CharField(max_length=50)

    class Meta:
        ordering = ['-author_id']
        db_table = 'author'


class Stats(models.Model):
    content = models.CharField(max_length=50)
    stats = models.IntegerField()

    class Meta:
        ordering = ['-stats']
        db_table = 'stats'



class AuthorStats(models.Model):
    author_id = models.CharField(max_length=100)
    content = models.CharField(max_length=100)
    stats = models.IntegerField()

    class Meta:
        ordering = ['stats']
        db_table = 'author_stats'

Answer 1

您已經將Author設置為除模型之外的其他值：

Author = soup.find('span', attrs={'class': 'author-content'})

導入模型Author ，不要隱藏它。

（並且您正在與authors進行類似的操作。）

Answer 2

您以錯誤的方式使用Author 。 你需要：

導入您的models.py
您不需要本地authors列表。
將Author = soup.find('span', attrs={'class': 'author-content'})更改為author_raw = soup.find('span', attrs={'class': 'author-content'}) 。
通過author = models.Author(name=author_raw.name)創建作者，並通過author.save()保存。 （我不知道soap.find為作者返回什么，因此您可以在Author模型的構造函數參數中編輯該部分。）

將數據保存在Postgresql數據庫中

問題描述

2 個解決方案

解決方案1
0 2019-07-10 08:41:02

解決方案2
0 2019-07-10 08:45:04

將數據保存在Postgresql數據庫中

問題描述

2 個解決方案

解決方案1 0 2019-07-10 08:41:02

解決方案2 0 2019-07-10 08:45:04

解決方案1
0 2019-07-10 08:41:02

解決方案2
0 2019-07-10 08:45:04