I have a dictionary called "output" , there are some other dictionary nested in it as bellow:
>>> output.keys()
dict_keys(['posts', 'totalResults', 'moreResultsAvailable', 'next', 'requestsLeft', 'warnings'])
>>> output['posts'][0].keys()
dict_keys(['thread', 'uuid', 'url', 'ord_in_thread', 'parent_url', 'author', 'published', 'title','text', 'highlightText', 'highlightTitle', 'highlightThreadTitle', 'language', 'external_links', 'external_images', 'entities', 'rating', 'crawled', 'updated'])
>>> output['posts'][0]['thread'].keys()
dict_keys(['uuid', 'url', 'site_full', 'site', 'site_section', 'site_categories', 'section_title', 'title', 'title_full', 'published', 'replies_count', 'participants_count', 'site_type', 'country', 'spam_score', 'main_image', 'performance_score', 'domain_rank', 'reach', 'social'])
>>> output['posts'][0]['thread']['social'].keys()
dict_keys(['facebook', 'gplus', 'pinterest', 'linkedin', 'stumbledupon', 'vk'])
I want to make a csv file consisting of a list of selected keys from output['posts'][0] , output['posts'][0]['thread'] and output['posts'][0]['thread']['social'] with related values as each row content, I came up with this code:
post_keys = output['posts'][0].keys()
post_thread_keys = output['posts'][0]['thread'].keys()
social_keys = output['posts'][0]['thread']['social'].keys()
with open('file.csv', 'w', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=post_thread_keys)
writer.writeheader()
for i in range(len(output['posts'])):
for key in output['posts'][i]['thread']:
writer.writerow(output['posts'][i]['thread'])
It only works for first level of dictionary which is "output['posts'][0]['thread']" , not other insiders, and also it doubles the number of rows which is 200 now instead of 100.
Please have a look at the output file I have stored on google drive for more tangible approach: file.csv
You need a function to create the sub-keys in the format you have specified. By using a function, it can also be called to give you the list of the extra column names needed for the header.
As you are adding 3 sub-entries, they could be removed from the columns to avoid duplication (by using .pop()
)
import webhoseio
import csv
def get_social_entries(social):
social_entries = {}
for social_key, social_values in social.items():
for key, value in social_values.items():
social_entries[f'{social_key}_{key}'] = value
return social_entries
# <<Get output here>>
csv_columns = []
first_post = output['posts'][0]
for key in first_post['thread']:
csv_columns.append(key)
for key in first_post:
if key not in ['entities', 'thread', 'social']:
csv_columns.append(key)
for key in first_post['entities']:
csv_columns.append(key)
csv_columns.extend(list(get_social_entries(first_post['thread']['social']).keys()))
with open('file.csv', 'w', encoding='utf-8', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
writer.writeheader()
for post in output['posts']:
thread = post.pop('thread')
entities = post.pop('entities')
social = thread.pop('social')
social_entries = get_social_entries(social)
writer.writerow(post | thread | entities | social_entries) # | operator needs Python 3.9
This assumes you are using Python 3.9, if not you could use something like:
row = post
row.update(thread)
row.update(entities)
row.update(social_entries)
writer.writerow(row)
Note: newline=''
is added to remove the extra blank lines in the output.
You could use a similar approach to also expand the entities
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.