简体   繁体   English

Python- Twitter爬虫

[英]Python- Twitter crawler

I would like to inquire if there is any method that allow my crawler to go all the way down to the bottom of the page, and wait for the page to load(so that the html of the loaded post will be added). 我想询问是否有任何方法允许我的抓取工具一直向下到页面底部,并等待页面加载(以便添加加载的帖子的html)。 As twitter's html code only show a few post and u have to manually scroll down for the html to be refreshed after the bottom post are loaded. 由于twitter的HTML代码只显示了一些帖子,你必须手动向下滚动才能在底部帖子加载后刷新html。 The <html></html> tag will only show the currently existing post, and my crawler will stop. <html></html>标记仅显示当前现有的帖子,我的抓取工具将停止。

def spider(targetname, DOMAIN, g_data):
    for item in g_data:
        try:
            name = item.find_all("strong", {"class": "fullname show-popup-with-id "})[0].text
            username = item.find_all("span", {"class": "username u-dir"})[0].text
            post = item.find_all("p", {"class": "TweetTextSize TweetTextSize--normal js-tweet-text tweet-text"})[0].text
            replies = item.find_all("span", {"class": "u-hiddenVisually"})[3].text
            retweets = item.find_all("span", {"class": "u-hiddenVisually"})[4].text
            likes = item.find_all("span", {"class": "u-hiddenVisually"})[5].text
            retweetby = item.find_all("a", {"href": "/"+targetname})[0].text
            datas = item.find_all('a', {'class':'tweet-timestamp js-permalink js-nav js-tooltip'})
            for data in datas:
                link = DOMAIN + data['href']
                date = data['title']
            append_to_file(crawledfile, name, username, post, link, replies, retweets, likes, retweetby, date)
        except:
            pass

That would require the crawler to execute javascript while crawling, which I believe most crawlers won't. 这将需要爬虫在爬行时执行javascript,我相信大多数爬虫都不会。 You may find you can do whatever you're trying to do using Twitter's official REST API instead. 您可能会发现使用Twitter的官方REST API可以做任何您想要做的事情。

Also using APIs where possible will usually be more reliable than scraping web pages. 在可能的情况下使用API​​通常比抓取网页更可靠。 ;) ;)

In addition to what swalladge mentioned, there are plenty of Twitter packages for Python that mean you don't even need to really read Twitter's API to do what you are trying to do! 除了swalladge提到的内容之外,还有很多用于Python的Twitter软件包,这意味着你甚至不需要真正阅读Twitter的API来做你想做的事情! Just search for Twitter Python to get numerous suggestions. 只需搜索Twitter Python即可获得大量建议。

A crawler can't execute JavaScript functions and get new output, so all you see is what you get. 爬虫无法执行JavaScript函数并获得新的输出,因此您所看到的就是您所获得的。 If a website that uses AJAX wants to be crawlable it needs to provide HTML snapshots of how it would look like for a normal user. 如果使用AJAX的网站想要可抓取,则需要提供普通用户的HTML快照。

In your case that would be outputting all the tweets but who knows how much data that is. 在你的情况下,将输出所有的推文,但谁知道有多少数据。 However Twitter likes to be crawled as it makes them easily viewable in search engines, so there's an API you can use. 然而Twitter喜欢被抓取,因为它使它们在搜索引擎中很容易查看,所以你可以使用一个API。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM