[英]Save the data in the Postgresql database
我不了解如何將抓取的數據保存在Postgresql數據庫中。 我嘗試使用Psycopg2卻沒有任何效果...我了解到可以為此使用django模型
抓取工具應抓取每個頁面上的每個博客帖子,抓取工具中的數據應進入Postgresql數據庫,該數據庫將計算以下統計信息:
1.地址/ stats下的10個最常用字及其編號
2.地址/統計//下的每個作者最常使用的10個單詞及其編號
例如,在下面的代碼中,我嘗試獲取作者的姓名,但出現這樣的錯誤:
authors = Author(name='author name')
TypeError: 'NoneType' object is not callable
將模型導入刮板也無濟於事...
這是我的刮板:
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from collections import Counter
url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0'
}
with requests.Session() as s:
r = s.get('https://teonite.com/blog/')
soup = bs(r.content, 'lxml')
article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
all_links.append(article_links)
num_pages = int(soup.select_one('.page-number').text.split('/')[1])
for page in range(2, num_pages + 1):
r = s.get(url.format(page))
soup = bs(r.content, 'lxml')
article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
all_links.append(article_links)
all_links = [item for i in all_links for item in i]
d = webdriver.Chrome(ChromeDriverManager().install())
contents = []
authors = []
for article in all_links:
d.get(article)
soup = bs(d.page_source, 'lxml')
[t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
visible_text = soup.getText()
content = soup.find('section', attrs={'class': 'post-content'})
contents.append(content)
Author = soup.find('span', attrs={'class': 'author-content'})
authors.append(Author)
## Below is the two lines of code where is the error
authors = Author(name='author name')
Author.save()
unique_authors = list(set(authors))
unique_contents = list(set(contents))
try:
print(soup.select_one('.post-title').text)
except:
print(article)
print(soup.select_one('h1').text)
break # for debugging
d.quit()
楷模:
from django.db import models
class Author(models.Model):
author_id = models.CharField(primary_key=True, max_length=50, editable=False)
author_name = models.CharField(max_length=50)
class Meta:
ordering = ['-author_id']
db_table = 'author'
class Stats(models.Model):
content = models.CharField(max_length=50)
stats = models.IntegerField()
class Meta:
ordering = ['-stats']
db_table = 'stats'
class AuthorStats(models.Model):
author_id = models.CharField(max_length=100)
content = models.CharField(max_length=100)
stats = models.IntegerField()
class Meta:
ordering = ['stats']
db_table = 'author_stats'
您已經將Author
設置為除模型之外的其他值:
Author = soup.find('span', attrs={'class': 'author-content'})
導入模型Author
,不要隱藏它。
(並且您正在與authors
進行類似的操作。)
您以錯誤的方式使用Author
。 你需要:
authors
列表。 Author = soup.find('span', attrs={'class': 'author-content'})
更改為author_raw = soup.find('span', attrs={'class': 'author-content'})
。 author = models.Author(name=author_raw.name)
創建作者,並通過author.save()
保存。 (我不知道soap.find
為作者返回什么,因此您可以在Author
模型的構造函數參數中編輯該部分。)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.