简体   繁体   中英

Save the data in the Postgresql database

I don't catch how to save the scraped data in the Postgresql database. I tried to use Psycopg2 without effect... I learned that I can use django models for this

The scraper should scrape every blog post on each page Data from the scraper should go to the Postgresql database, where the following statistics will be counted:

1.The 10 most common words along with their numbers under the address /stats

2.The 10 most common words with their numbers per author available under the address / stats / /

  1. posts authors with their name available in the address / stats / / available under the address / authors /

in the code below, for example, I tried to get the names of the authors but I get such an error :

authors = Author(name='author name')
TypeError: 'NoneType' object is not callable

importing models to the scraper does not help either...

Here is my scraper:

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from collections import Counter


url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []


headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0'
}
with requests.Session() as s:
    r = s.get('https://teonite.com/blog/')
    soup = bs(r.content, 'lxml')
    article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
    all_links.append(article_links)
    num_pages = int(soup.select_one('.page-number').text.split('/')[1])


    for page in range(2, num_pages + 1):
        r = s.get(url.format(page))
        soup = bs(r.content, 'lxml')
        article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
        all_links.append(article_links)



    all_links = [item for i in all_links for item in i]

    d = webdriver.Chrome(ChromeDriverManager().install())

    contents = []
    authors = []

    for article in all_links:
        d.get(article)
        soup = bs(d.page_source, 'lxml')
        [t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
        visible_text = soup.getText()
        content = soup.find('section', attrs={'class': 'post-content'})
        contents.append(content)
        Author = soup.find('span', attrs={'class': 'author-content'})
        authors.append(Author)

##  Below is the two lines of code where is the error

        authors = Author(name='author name')
        Author.save()

        unique_authors = list(set(authors))
        unique_contents = list(set(contents))
        try:
            print(soup.select_one('.post-title').text)
        except:
            print(article)
            print(soup.select_one('h1').text)
            break  # for debugging
    d.quit()

Models:

from django.db import models

class Author(models.Model):
    author_id = models.CharField(primary_key=True, max_length=50, editable=False)
    author_name = models.CharField(max_length=50)

    class Meta:
        ordering = ['-author_id']
        db_table = 'author'


class Stats(models.Model):
    content = models.CharField(max_length=50)
    stats = models.IntegerField()

    class Meta:
        ordering = ['-stats']
        db_table = 'stats'



class AuthorStats(models.Model):
    author_id = models.CharField(max_length=100)
    content = models.CharField(max_length=100)
    stats = models.IntegerField()

    class Meta:
        ordering = ['stats']
        db_table = 'author_stats'

You've eplicitly set Author to be something other than the model:

Author = soup.find('span', attrs={'class': 'author-content'})

Import the model Author , and don't hide it.

(And you're doing something similar with authors .)

You are using Author in a wrong way. You need to:

  1. Import your models.py
  2. You don't need the local authors list.
  3. Change Author = soup.find('span', attrs={'class': 'author-content'}) to something like author_raw = soup.find('span', attrs={'class': 'author-content'}) .
  4. Create author by author = models.Author(name=author_raw.name) and save it by author.save() . (I don't know what soap.find returns for authors so you can edit that part in your Author model constructor parameters.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM