[英]How Remove retweet from collected dataset
i have a collected dataset of tweets in python (jupyter notebook).我在python(jupyter笔记本)中有一个收集的推文数据集。 but there are many duplicate tweets.
但是有很多重复的推文。 how can i remove these programmaticaly with python (jupyter notebook)
我怎样才能用python(jupyter notebook)以编程方式删除这些
csvFile = open('ua.csv', 'a')
csvWriter = csv.writer(csvFile)
search_words = "corona"
date_since = "2020-10-13"
new_search = search_words + " -filter:retweets"
new_search
for tweet in tweepy.Cursor(api.search,q=search_words,count=100,
lang="id",
since=date_since).items():
print (tweet.created_at, tweet.text)
csvWriter.writerow([tweet.created_at, tweet.text.encode('utf-8')])
While you're iterating through the list of tweets, you could keep a list of tweets in a set, and check if you've already written that tweet.当您遍历推文列表时,您可以在一个集合中保留一个推文列表,并检查您是否已经编写了该推文。
tweet_set = set() # store tweet ids you've already seen before
for tweet in tweepy.Cursor(api.search,q=search_words,count=100,
lang="id",
since=date_since).items():
if tweet.id not in tweet_set:
print (tweet.created_at, tweet.text)
csvWriter.writerow([tweet.created_at, tweet.text.encode('utf-8')])
tweet_set.add(tweet.id) # update the set of tweets
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.