简体   繁体   中英

Extract tweets and save the tweets in seperate CSV files using loops

I am trying to extract tweets for three journalists in three time periods using zip() in a for loop containing cnn_search_query , start_time , and end_time . The extracted data for each journalist should be saved in separate CSV files using the for day in date: for loop so I end up with three CSV files 2022-02-02.csv , 2022-01-31.csv , and 2022-01-28.csv .

My code does extract tweets and appends the tweets to all three CSV files. However, it does not loop correctly through the zip() for loop as the tweets appended to the three CSV files are all the same.

Can anyone please spot the mistake in the way I loop through cnn_search_query , start_time , and end_time ?

# A look at the four lists I loop over
date = ['2022-02-02', '2022-01-31', '2022-01-28'] # names for csv files
cnn_search_query = ['(Jim acosta OR "@Acosta") -is:retweet', '(Jim acosta OR "@Acosta") -is:retweet', '(Jake tapper OR "@jaketapper") -is:retweet'] # search queries 
start_time = ['2022-01-27T21:58:59.000Z', '2022-01-25T21:58:59.000Z', '2022-01-22T21:58:59.000Z'] # start datetime for extraction of tweets
end_time = ['2022-02-06T21:58:59.000Z', '2022-02-04T21:58:59.000Z', '2022-02-01T21:58:59.000Z'] # end datetime for extraction of tweets
# Inputs for tweets
bearer_token = auth()
headers = create_headers(bearer_token)
max_results = 500

# Total number of tweets we collected from the loop
total_tweets = 0

# code that extracts tweets and appends them to separate CSV files
for day in date:

    # Inputs
    count = 0  # Counting tweets per time period
    max_count = 500  # Max tweets per time period
    flag = True
    next_token = None

    # create csv files named after date i.e. index in list date
    csvFile = open(day + ".csv", "a", newline="", encoding='utf-8')
    csvWriter = csv.writer(csvFile)

    # create headers for the four variables: author_id, created_at, id, and tweet
    csvWriter.writerow(
        ['author_id', 'created_at', 'id', 'tweet'])
    csvFile.close()

    # loop over cnn_search_query, start_time, and end_time
    for query, start, end in zip(cnn_search_query, start_time, end_time):

        url = create_url(query, start, end, max_results)
        json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
        result_count = json_response['meta']['result_count']

        # Check if flag is true
        while flag:

            # Check if max_count reached
            if count >= max_count:
                break
            print("-------------------")
            print("Token: ", next_token)


            if 'next_token' in json_response['meta']:
                #  Save the token to use for next call
                next_token = json_response['meta']['next_token']
                print("Next Token: ", next_token)
                if result_count is not None and result_count > 0 and next_token is not None:
                    print("Start Date: ", s)
                    append_to_csv(json_response, day + ".csv")
                    count += result_count
                    total_tweets += result_count
                    print("Total # of Tweets added: ", total_tweets)
                    print("-------------------")
                    sleep(5)
                    # If no next token exists
            else:
                if result_count is not None and result_count > 0:
                    print("-------------------")
                    print("Start Date: ", s)
                    append_to_csv(json_response, day + ".csv")
                    count += result_count
                    total_tweets += result_count
                    print("Total # of Tweets added: ", total_tweets)
                    print("-------------------")
                    sleep(5)

                # Since this is the final request, turn flag to false to move to the next time period.
                flag = False
                next_token = None
            sleep(5)
print("Total number of results: ", total_tweets)

Load saved CSV files

jimacosta1 = pd.read_csv(r"path...\2022-01-28.csv")
jimacosta2 = pd.read_csv(r"path...\2022-01-31.csv")
tapper = pd.read_csv(r"path...\2022-02-02.csv")

As can be seen, all three CSV files contain the same tweets. Note dates for jimacosta1 and jimacosta2 overlap. However, that is not the reason the tweets are similar in both data frames as the tapper data frame also has the same tweets. Thus, the issue is with the code above.

jimacosta1.head()
             author_id  ...                                              tweet
0  1483878706477256704  ...                                @Acosta Love Duke !
1  1452815367206838281  ...  @tomselliott @brianstelter @Acosta Your entire...
2  1454178957931208704  ...  @Acosta Hey Jimbo.  How many of the paltry 350...
3           3365876644  ...  @JeffreyHallett @ConserveLetters @Acosta Does ...
4  1450129826870927372  ...  @Acosta 😂🤣😂🤣😂🤣😂😂😂🤣😂🤣🤣🤣😂🤣😂🤣😂🤣🤣😂🤣😂🤣😂🤣😂😂😂🤣😂🤣😂🤣🤣🤣😂...
[5 rows x 4 columns]
jimacosta2.head()
             author_id  ...                                              tweet
0  1483878706477256704  ...                                @Acosta Love Duke !
1  1452815367206838281  ...  @tomselliott @brianstelter @Acosta Your entire...
2  1454178957931208704  ...  @Acosta Hey Jimbo.  How many of the paltry 350...
3           3365876644  ...  @JeffreyHallett @ConserveLetters @Acosta Does ...
4  1450129826870927372  ...  @Acosta 😂🤣😂🤣😂🤣😂😂😂🤣😂🤣🤣🤣😂🤣😂🤣😂🤣🤣😂🤣😂🤣😂🤣😂😂😂🤣😂🤣😂🤣🤣🤣😂...
[5 rows x 4 columns]
tapper.head()
             author_id  ...                                              tweet
0  1483878706477256704  ...                                @Acosta Love Duke !
1  1452815367206838281  ...  @tomselliott @brianstelter @Acosta Your entire...
2  1454178957931208704  ...  @Acosta Hey Jimbo.  How many of the paltry 350...
3           3365876644  ...  @JeffreyHallett @ConserveLetters @Acosta Does ...
4  1450129826870927372  ...  @Acosta 😂🤣😂🤣😂🤣😂😂😂🤣😂🤣🤣🤣😂🤣😂🤣😂🤣🤣😂🤣😂🤣😂🤣😂😂😂🤣😂🤣😂🤣🤣🤣😂...
[5 rows x 4 columns]

I can provide code for the function append_to_csv() if needed.

I printed the data as below

for day in date:
    for query, start, end in zip(cnn_search_query, start_time, end_time):
        print(day,'\t|', query,'\t|', start,'\t|', end)

Output of this: 在此处输入图像描述

With this I see that, you are repeating the same queries 3 times. Only difference is the csv file name. That is why you get the same content in all the 3 csv files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM