简体   繁体   English

报库中的发布日期总是返回无

[英]Publishing date in newspaper library always returning None

I've been using newspaper library lately.我最近一直在使用报纸图书馆。 The only issue I am finding is when I do article.publish_date I am always getting None .我发现的唯一问题是当我做article.publish_date我总是得到None

class NewsArticle:
    def __init__(self,url):
        self.article = Article(url)
        self.article.download()
        self.article.parse()
        self.article.nlp()

    def getKeywords(self):
        x = self.article.keywords
        for i in range(0,len(x)):
            x[i] = x[i].encode('ascii', 'ignore')
        return x

        return self.article.keywords

    def getSummary(self):
        return self.article.summary.encode('ascii', 'ignore')

    def getAuthors(self):
        x = self.article.authors
        for i in range(0,len(x)):
            x[i] = x[i].encode('ascii', 'ignore')
        return x

    def thumbnail_url(self):
        return self.article.top_image.encode('ascii', 'ignore')

    def date_made(self):
        print self.article.publish_date
        return self.article.publish_date
    def get_videos(self):
        x=self.article.movies
        for i in range(0,len(x)):
            x[i] = x[i].encode('ascii', 'ignore')
        return x
    def get_title(self):
        return self.article.title.encode('ascii','ignore')

I'm going over a bunch of URLS.我正在浏览一堆 URL。 You can see I'm printing out the publish_date before returning it.你可以看到我在返回之前打印了publish_date

I get as I said before:我得到了我之前说的:

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None

All the other functions are working as intended.所有其他功能都按预期工作。 The documentation from the site looks at an example,该站点的文档查看了一个示例,

>>> article.publish_date
datetime.datetime(2013, 12, 30 0, 0)

I'm doing this I'm pretty sure.我正在做这个我很确定。 I'm not sure if someone had an eye to see my issue.我不确定是否有人看到我的问题。

I'm 100% sure that you have solved this issue in the last 5ish years, but I wanted to throw in my knowledge on newspaper .我 100% 确定您在过去 5 年里已经解决了这个问题,但我想在报纸上发表我的知识。

This Python library isn't perfect, because it's designed to make a best effort in harvesting specific elements, such as article's title, author's name, published date and several other items.这个Python库并不完美,因为它旨在尽最大努力收集特定元素,例如文章标题、作者姓名、发布日期和其他几个项目。 Even with a best effort newspaper will miss content that isn't in a place that it's designed to look.即使尽了最大努力,报纸也会错过不在其设计位置上的内容。

For example this is from the extract code of newspaper .例如,这是来自报纸的提取代码。

3 strategies for publishing date extraction. The strategies are descending in accuracy and the next strategy is only attempted if a preferred one fails.

1. Pubdate from URL
2. Pubdate from metadata
3. Raw regex searches in the HTML + added heuristics

If newspaper does find a date in the URL it moves to the metatag, but only these:如果报纸确实在 URL 中找到了日期,它就会移动到元标记,但只有这些:

PUBLISH_DATE_TAGS = [
            {'attribute': 'property', 'value': 'rnews:datePublished',
             'content': 'content'},
            {'attribute': 'property', 'value': 'article:published_time',
             'content': 'content'},
            {'attribute': 'name', 'value': 'OriginalPublicationDate',
             'content': 'content'},
            {'attribute': 'itemprop', 'value': 'datePublished',
             'content': 'datetime'},
            {'attribute': 'property', 'value': 'og:published_time',
             'content': 'content'},
            {'attribute': 'name', 'value': 'article_date_original',
             'content': 'content'},
            {'attribute': 'name', 'value': 'publication_date',
             'content': 'content'},
            {'attribute': 'name', 'value': 'sailthru.date',
             'content': 'content'},
            {'attribute': 'name', 'value': 'PublishDate',
             'content': 'content'},
            {'attribute': 'pubdate', 'value': 'pubdate',
             'content': 'datetime'},
            {'attribute': 'name', 'value': 'publish_date',
             'content': 'content'},

Fox news stores their dates in the meta tag section, but in a tag that newspaper doesn't query. Fox news 将他们的日期存储在元标签部分,但在报纸不查询的标签中。 To extract the dates from Fox news articles you would do this:要从 Fox 新闻文章中提取日期,您可以这样做:

article_meta_data = article.meta_data

article_published_date = str({value for (key, value) in article_meta_data.items() if key == 'dcterms.created'})
print(article_published_date)
{'2020-10-11T12:51:53-04:00'}

Sometimes a source has its published dates in a section that newspaper doesn't look at.有时,一个消息来源在报纸没有查看的部分中包含其发布日期。 When this happens you have to wrap some additional code around newspaper to harvest the date.发生这种情况时,您必须在报纸周围包裹一些额外的代码来获取日期。

For example BBC stores its dates in the script application/ld+json .例如,BBC 将其日期存储在脚本application/ld+json 中 Newspaper isn't designed to query or extract from this script.报纸不是为了从这个脚本中查询或提取而设计的。 To extract the dates from BBC articles you would do this:要从 BBC 文章中提取日期,您可以这样做:

soup = BeautifulSoup(article.html, 'html.parser')
bbc_dictionary = json.loads("".join(soup.find("script", {"type":"application/ld+json"}).contents))

date_published = [value for (key, value) in bbc_dictionary.items() if key == 'datePublished']
print(date_published)
['2020-10-11T20:11:33.000Z']

I published a Newspaper Usage Document on GitHub that discusses various collection strategies and other topics surrounding this library.我在 GitHub 上发布了一份报纸使用文档,讨论了围绕这个库的各种收集策略和其他主题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM