I am trying to extract tweets for three journalists in three time periods using zip()
in a for loop containing cnn_search_query
, start_time
, and end_time
. The extracted data for each journalist should be saved in separate CSV files using the for day in date:
for loop so I end up with three CSV files 2022-02-02.csv
, 2022-01-31.csv
, and 2022-01-28.csv
.
My code does extract tweets and appends the tweets to all three CSV files. However, it does not loop correctly through the zip()
for loop as the tweets appended to the three CSV files are all the same.
Can anyone please spot the mistake in the way I loop through cnn_search_query
, start_time
, and end_time
?
# A look at the four lists I loop over
date = ['2022-02-02', '2022-01-31', '2022-01-28'] # names for csv files
cnn_search_query = ['(Jim acosta OR "@Acosta") -is:retweet', '(Jim acosta OR "@Acosta") -is:retweet', '(Jake tapper OR "@jaketapper") -is:retweet'] # search queries
start_time = ['2022-01-27T21:58:59.000Z', '2022-01-25T21:58:59.000Z', '2022-01-22T21:58:59.000Z'] # start datetime for extraction of tweets
end_time = ['2022-02-06T21:58:59.000Z', '2022-02-04T21:58:59.000Z', '2022-02-01T21:58:59.000Z'] # end datetime for extraction of tweets
# Inputs for tweets
bearer_token = auth()
headers = create_headers(bearer_token)
max_results = 500
# Total number of tweets we collected from the loop
total_tweets = 0
# code that extracts tweets and appends them to separate CSV files
for day in date:
# Inputs
count = 0 # Counting tweets per time period
max_count = 500 # Max tweets per time period
flag = True
next_token = None
# create csv files named after date i.e. index in list date
csvFile = open(day + ".csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# create headers for the four variables: author_id, created_at, id, and tweet
csvWriter.writerow(
['author_id', 'created_at', 'id', 'tweet'])
csvFile.close()
# loop over cnn_search_query, start_time, and end_time
for query, start, end in zip(cnn_search_query, start_time, end_time):
url = create_url(query, start, end, max_results)
json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
result_count = json_response['meta']['result_count']
# Check if flag is true
while flag:
# Check if max_count reached
if count >= max_count:
break
print("-------------------")
print("Token: ", next_token)
if 'next_token' in json_response['meta']:
# Save the token to use for next call
next_token = json_response['meta']['next_token']
print("Next Token: ", next_token)
if result_count is not None and result_count > 0 and next_token is not None:
print("Start Date: ", s)
append_to_csv(json_response, day + ".csv")
count += result_count
total_tweets += result_count
print("Total # of Tweets added: ", total_tweets)
print("-------------------")
sleep(5)
# If no next token exists
else:
if result_count is not None and result_count > 0:
print("-------------------")
print("Start Date: ", s)
append_to_csv(json_response, day + ".csv")
count += result_count
total_tweets += result_count
print("Total # of Tweets added: ", total_tweets)
print("-------------------")
sleep(5)
# Since this is the final request, turn flag to false to move to the next time period.
flag = False
next_token = None
sleep(5)
print("Total number of results: ", total_tweets)
Load saved CSV files
jimacosta1 = pd.read_csv(r"path...\2022-01-28.csv")
jimacosta2 = pd.read_csv(r"path...\2022-01-31.csv")
tapper = pd.read_csv(r"path...\2022-02-02.csv")
As can be seen, all three CSV files contain the same tweets. Note dates for jimacosta1
and jimacosta2
overlap. However, that is not the reason the tweets are similar in both data frames as the tapper
data frame also has the same tweets. Thus, the issue is with the code above.
jimacosta1.head()
author_id ... tweet
0 1483878706477256704 ... @Acosta Love Duke !
1 1452815367206838281 ... @tomselliott @brianstelter @Acosta Your entire...
2 1454178957931208704 ... @Acosta Hey Jimbo. How many of the paltry 350...
3 3365876644 ... @JeffreyHallett @ConserveLetters @Acosta Does ...
4 1450129826870927372 ... @Acosta 😂🤣😂🤣😂🤣😂😂😂🤣😂🤣🤣🤣😂🤣😂🤣😂🤣🤣😂🤣😂🤣😂🤣😂😂😂🤣😂🤣😂🤣🤣🤣😂...
[5 rows x 4 columns]
jimacosta2.head()
author_id ... tweet
0 1483878706477256704 ... @Acosta Love Duke !
1 1452815367206838281 ... @tomselliott @brianstelter @Acosta Your entire...
2 1454178957931208704 ... @Acosta Hey Jimbo. How many of the paltry 350...
3 3365876644 ... @JeffreyHallett @ConserveLetters @Acosta Does ...
4 1450129826870927372 ... @Acosta 😂🤣😂🤣😂🤣😂😂😂🤣😂🤣🤣🤣😂🤣😂🤣😂🤣🤣😂🤣😂🤣😂🤣😂😂😂🤣😂🤣😂🤣🤣🤣😂...
[5 rows x 4 columns]
tapper.head()
author_id ... tweet
0 1483878706477256704 ... @Acosta Love Duke !
1 1452815367206838281 ... @tomselliott @brianstelter @Acosta Your entire...
2 1454178957931208704 ... @Acosta Hey Jimbo. How many of the paltry 350...
3 3365876644 ... @JeffreyHallett @ConserveLetters @Acosta Does ...
4 1450129826870927372 ... @Acosta 😂🤣😂🤣😂🤣😂😂😂🤣😂🤣🤣🤣😂🤣😂🤣😂🤣🤣😂🤣😂🤣😂🤣😂😂😂🤣😂🤣😂🤣🤣🤣😂...
[5 rows x 4 columns]
I can provide code for the function append_to_csv()
if needed.
I printed the data as below
for day in date:
for query, start, end in zip(cnn_search_query, start_time, end_time):
print(day,'\t|', query,'\t|', start,'\t|', end)
With this I see that, you are repeating the same queries 3 times. Only difference is the csv file name. That is why you get the same content in all the 3 csv files.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.