无法从网页中抓取一些附加到文本的时间戳

Question

我正在尝试从网页中抓取附加到文本的时间戳。 我可以完美地抓取文本，但无法找到时间戳。 不过，我可以从那里抓取附加到评论的其他时间戳。 带有注释的时间戳可以在脚本标签中作为created_at的值找到。 但是，我找不到我要找的那个。

网址

我试过：

import re
import json
import requests

url = 'https://www.instagram.com/p/CEuX_8iH95S/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
    r = s.get(url)
    script_tag = json.loads(re.findall(r"window\._sharedData = (.*?});",r.text)[0])
    post_content = script_tag['entry_data']['PostPage'][0]['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']['text']
    print(post_content)

如何解析附加到上述站点文本的时间戳？

Answer 1

您可以使用datetime模块中的.fromtimestamp()方法解析时间戳。

这是如何做到的：

import datetime
import re
import json
import requests

url = 'https://www.instagram.com/p/CEuX_8iH95S/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
    r = s.get(url)
    script_tag = json.loads(re.findall(r'window\._sharedData = (.*?});', r.text)[0])
    post_date = script_tag['entry_data']['PostPage'][0]['graphql']['shortcode_media']['taken_at_timestamp']

    print(datetime.datetime.fromtimestamp(post_date).isoformat())
    print(datetime.datetime.fromtimestamp(post_date).strftime("%b %d %Y %H:%M:%S"))

这打印：

2020-09-04T20:25:49
Sep 04 2020 20:25:49

如果您想了解有关日期格式的更多信息，请查看此处的文档。

无法从网页中抓取一些附加到文本的时间戳

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-09-13 12:55:32

无法从网页中抓取一些附加到文本的时间戳

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-09-13 12:55:32

解决方案1
1 已采纳 2020-09-13 12:55:32