简体   繁体   English

使用python检索嵌入在另一个网页中的注释(disqus)

[英]Retrieving comments (disqus) embedded in another web page with python

I'm scrapping a web site using python 3.5 (Beautifulsoup). 我正在使用python 3.5(Beautifulsoup)抓取一个网站。 I can read everything in the source code but I've been trying to retrieve the embedded comments from disqus with no success (which is a reference to a script). 我可以阅读源代码中的所有内容,但一直在尝试从Disqus中检索嵌入的注释,但都没有成功(这是对脚本的引用)。

The piece of the html code source looks like this: 该段html代码源如下所示:

var disqus_identifier = "node/XXXXX";
script type='text/javascript' src='https://disqus.com/forums/siteweb/embed.js';

the src sends to a script function. src发送到脚本功能。

I've read the suggestions in stackoverflow, using selenium but I had a really hard time to make it work with no success. 我已经阅读了使用硒的stackoverflow中的建议,但是我很难使它成功。 I understand that selenium emulates a browser (which I believe is too heavy for what I want). 我了解硒模拟了浏览器(我认为它对于我想要的东西来说太重了)。 However, I have a problem with the webdrivers, it is not working correctly. 但是,我的webdrivers有问题,它不能正常工作。 So, I dropped this option. 因此,我放弃了此选项。

I would like to be able to execute the script and retrieve the .js with the comments. 我希望能够执行脚本并检索带有注释的.js。 I found that a possible solution is PyV8. 我发现可能的解决方案是PyV8。 But I can't import in python. 但是我不能导入python。 I read the posts in internet, I googled it, but it's not working. 我在互联网上阅读了帖子,我用Google搜索了它,但是它不起作用。

I installed Sublime Text 3 and I downloaded pyv8-win64-p3 manually in: 我安装了Sublime Text 3,并在以下位置手动下载了pyv8-win64-p3:

C:\\Users\\myusername\\AppData\\Roaming\\Sublime Text 3\\Installed Packages\\PyV8\\pyv8-win64-p3 C:\\ Users \\ myusername \\ AppData \\ Roaming \\ Sublime Text 3 \\ Installed Packages \\ PyV8 \\ pyv8-win64-p3

But I keep getting: 但我不断得到:

ImportError: No module named 'PyV8'. ImportError:没有名为“ PyV8”的模块。

If somebody can help me, I'll be very very thankful. 如果有人可以帮助我,我将非常感激。

So, you can construct the Disqus API by studying its network traffic; 因此,您可以通过研究网络流量来构造Disqus API。 in the page source all required data are present. 页面源中包含所有必需的数据。 Like Disqus API send some query string. 像Disqus API发送一些查询字符串。 Recently I have extracted comments from Disqus API, here is the sample code. 最近,我从Disqus API中提取了注释,这是示例代码。

Example: Here soup - page source and params_dict = json.loads(str(soup).split("embedVars = ")[1].split(";")[0]) 示例:这里汤-页面源代码和params_dict = json.loads(str(soup).split("embedVars = ")[1].split(";")[0])

def disqus(params_dict,soup):
    headers = {
    'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0'
    }
    comments_list = []
    base = 'default'
    s_o = 'default'
    version = '25916d2dd3d996eaedf6cdb92f03e7dd'
    f = params_dict['disqusShortname']
    t_i = params_dict['disqusIdentifier']
    t_u = params_dict['disqusUrl']
    t_e = params_dict['disqusTitle']
    t_d = soup.head.title.text
    t_t = params_dict['disqusTitle']
    url = 'http://disqus.com/embed/comments/?base=%s&version=%s&f=%s&t_i=%s&t_u=%s&t_e=%s&t_d=%s&t_t=%s&s_o=%s&l='%(base,version,f,t_i,t_u,t_e,t_d,t_t,s_o)
    comment_soup = getLink(url)
    temp_dict = json.loads(str(comment_soup).split("threadData\" type=\"text/json\">")[1].split("</script")[0])
    thread_id = temp_dict['response']['thread']['id']
    forumname = temp_dict['response']['thread']['forum']
    i = 1
    count = 0
    flag = True
    while flag is True:
        disqus_url = 'http://disqus.com/api/3.0/threads/listPostsThreaded?limit=100&thread='+thread_id+'&forum='+forumname+'&order=popular&cursor='+str(i)+':0:0'+'&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F'
        comment_soup = getJson(disqus_url)

It,will return json and you can find comments where you can extract comments. 它会返回json,您可以在其中提取注释的地方找到注释。 Hope this will help for you. 希望这对您有帮助。

For Facebook embedded comments,you may use Facebook's graph api to extract the comments in json format. 对于Facebook嵌入的注释,您可以使用Facebook的graph api提取json格式的注释。

Example- 例-

Facebook comments - https://graph.facebook.com/comments/?ids= "link of page" Facebook评论-https: https://graph.facebook.com/comments/?ids= "link of page"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM