简体   繁体   English

如何从收集的数据集中删除转发

[英]How Remove retweet from collected dataset

i have a collected dataset of tweets in python (jupyter notebook).我在python(jupyter笔记本)中有一个收集的推文数据集。 but there are many duplicate tweets.但是有很多重复的推文。 how can i remove these programmaticaly with python (jupyter notebook)我怎样才能用python(jupyter notebook)以编程方式删除这些

csvFile = open('ua.csv', 'a')
csvWriter = csv.writer(csvFile)
search_words = "corona"
date_since = "2020-10-13"
new_search = search_words + " -filter:retweets"
new_search
for tweet in tweepy.Cursor(api.search,q=search_words,count=100,
                           lang="id",
                           since=date_since).items():
    print (tweet.created_at, tweet.text)
    csvWriter.writerow([tweet.created_at, tweet.text.encode('utf-8')])

While you're iterating through the list of tweets, you could keep a list of tweets in a set, and check if you've already written that tweet.当您遍历推文列表时,您可以在一个集合中保留一个推文列表,并检查您是否已经编写了该推文。

tweet_set = set() # store tweet ids you've already seen before
for tweet in tweepy.Cursor(api.search,q=search_words,count=100,
                           lang="id",
                           since=date_since).items():

    if tweet.id not in tweet_set:
        print (tweet.created_at, tweet.text)
        csvWriter.writerow([tweet.created_at, tweet.text.encode('utf-8')])

        tweet_set.add(tweet.id) # update the set of tweets

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM