從僅具有網站網址的meta標簽和img網址中提取值。用python編寫的Django應用程序

Question

我想知道如何用python編寫並將其連接到django應用。 我的意思是從僅具有網站網址的meta標簽和img網址中提取值。 用戶粘貼鏈接時，與facebook相同。

Answer 1

就個人而言，我會選擇使用非常漂亮的Requests ， BeautifulSoup和LXML庫來解決此問題。

假設在models.py具有以下模型，我們可以覆蓋save()方法以填充title ， description和keywords屬性：

from bs4 import BeautifulSoup
import requests

from django.db import models

class Link(models.Model):
    url = models.URLField(blank=True)
    title = models.CharField(max_length=20, blank=True)
    description = models.TextField(blank=True)
    keywords = models.TextField(blank=True)

    def save(self, *args, **kwargs):
        if self.url and not (self.title or self.keywords or self.description):
            # optionally, use 'html' instead of 'lxml' if you don't have lxml installed
            soup = BeatifulSoup(requests.get(self.url).content, "lxml")
            self.title = soup.title.string
            meta = soup.find_all('meta')
            for tag in meta:
                if 'name' in tag.attrs and tag.attrs['name'].lower() in ['description', 'keywords']:
                    setattr(self, tag.attrs['name'].lower(), tag.attrs['content'])

        super(Link, self).save(*args, **kwargs)

覆蓋的save()方法中的邏輯可以很好地存在於視圖或實用函數中，甚至可以存在於Link模型上的其他可以有條件調用的方法中。

以上適用於Django 1.4。 不做任何保證，但它也應該在較早的版本上運行。

編輯：修復語法錯誤並提及替代解析器，謝謝@jinesh和@stonefury 。

Answer 2

jnovinger的回答基本上在Django 1.5中對我有用，但是我必須進行一些調整。 首先，代碼本身似乎有錯字。 線

soup = BeatifulSoup(requests.get(self.url).contents, "lxml")

引發AttributeError: 'Response' object has no attribute 'contents' 。 根據Requests文檔，我相信正確的屬性是requests.get(self.url).content ，盡管requests.get(self.url).content requests.get().text似乎也可以工作。

在將其放入Django項目之前，我先嘗試用一個簡單的腳本來實現它，在這種情況下，我還遇到了以下錯誤：

requests.exceptions.MissingSchema: Invalid URL u'www.example.com/': No schema supplied

這是因為天真地在URL之前不包含http:// 。 在Django中，這是自動完成的，但是我提到這一點是為了防止其他初學者犯同樣的錯誤並且不理解“缺少模式”的含義。

我遇到的最后一個問題是由於缺少內容長度的驗證。 當檢查的鏈接產生的標題，關鍵字或詳細信息字段長於模型中Field對象的限制（max_length參數）時，這會在嘗試保存新鏈接時引起DatabaseErrors。

DatabaseError: value too long for type character varying(20)

這是我的粗略解決方法； 可能有更好的方法，但這似乎可行。

from bs4 import BeautifulSoup
import requests

from django.db import models

class Link(models.Model):
    url = models.URLField(blank=True)
    title = models.CharField(max_length=100, blank=True)
    description = models.TextField(blank=True)
    keywords = models.TextField(blank=True)

    def save(self, *args, **kwargs):
        if self.url and not (self.title or self.keywords or self.description):
            soup = BeautifulSoup(requests.get(self.url).content, "lxml")
            limit = self._meta.get_field('title').max_length    # check field max_length
            self.title = soup.title.string[:limit]              # limit title to max_length
            meta = soup.find_all('meta')
            for tag in meta:
                if 'name' in tag.attrs and tag.attrs['name'].lower() in ['description', 'keywords']:
                    field = tag.attrs['name'].lower()                   # check whether description or keywords
                    limit = self._meta.get_field(field).max_length      # check field max_length
                    content = tag.attrs['content'][:limit]              # limit field to max_length
                    setattr(self, tag.attrs['name'].lower(), content)

        super(Link, self).save(*args, **kwargs)

請注意，這只會截斷單詞中的標題，說明和關鍵字列表，而不是停在最后一個完整的單詞上，因此您可能會得到無意義的單詞片段。

從僅具有網站網址的meta標簽和img網址中提取值。用python編寫的Django應用程序

問題描述

2 個解決方案

解決方案1
4 已采納 2012-06-19 04:12:15

解決方案2
2 2013-04-21 17:38:35

從僅具有網站網址的meta標簽和img網址中提取值。 用python編寫的Django應用程序

問題描述

2 個解決方案

解決方案1 4 已采納 2012-06-19 04:12:15

解決方案2 2 2013-04-21 17:38:35

從僅具有網站網址的meta標簽和img網址中提取值。用python編寫的Django應用程序

解決方案1
4 已采納 2012-06-19 04:12:15

解決方案2
2 2013-04-21 17:38:35