I am a student in Data Science but have little code experience so far.
My issue is: how can I obtain a list of dicts from a string, that already is in the form of a list of dicts but is seen by pandas as a string?
Here is the dataset (credits): https://www.kaggle.com/tmdb/tmdb-movie-metadata/data
In the columns 'cast' and 'crew' I have cells like this:
[
{"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"},
{"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "Allison Anders"}
]
(obviously there are dozens of dicts for each cell)
My main problem is that, after I have loaded the file and created a data frame, the cells of these two columns (cast and crew) are seen by pandas as strings, and not as a list of dicts, and so I cannot perform the operations I need.
creditsB = pd.read_csv('folder\\tmdb_5000_credits.csv')
creditsDF = pd.DataFrame(creditsB)
type(creditsDF.loc[0,'crew'])
# str
And if I try to apply list() on it, it just creates a list of single characters.
dct = list(creditsDF.loc[0,'crew'])
dct
# output:
['[',
'{',
'"',
'c',
'r',
'e',
# and so on
How can I make python understand it's actually a list of dicts, and treat it as well?
I have to do some basic operations like "for each movie, compute the number of cast members" or "for each movie, compute the number of directors". These would be really easy if I just solved this big issue.
Thanks in advance for any help!
You have to append dict in list
movies = [ {"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"}, {"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "Allison Anders"} ]
for movie in movies:
print movie["name"]
# count movies in list
print len(movies)
Try ast.literal_eval :
import ast
text = '''
[
{"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"},
{"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "Allison Anders"}
]
'''
dicts = ast.literal_eval(text)
# [{'name': 'Allison Anders', 'department': 'Directing', 'credit_id': '52fe420dc3a36847f800012d', 'gender': 1, 'job': 'Director', 'id': 3110},
# {'name': 'Allison Anders', 'department': 'Writing', 'credit_id': '52fe420dc3a36847f80001c9', 'gender': 1, 'job': 'Writer', 'id': 3110}]
print(len(dicts))
# 2
print(dicts[0]['department'])
# Directing
For efficient applying changes, try apply :
df['col'] = df['col'].apply(lambda x: ast.literal_eval(x))
Extracting desired fields from dictionaries:
dicts = ast.literal_eval(text)
[d['department'] for d in dicts]
# ['Directing', 'Writing']
So you have list of dictionaries, but they appear in your dataframe as strings. This is extremely inefficient. You should aim to improve the workflow upstream so that you read dictionaries directly into Python.
However, given what you have, you can utilise ast.literal_eval
to read your strings literally. Then feed into pd.DataFrame
. This works because pd.DataFrame
accepts a list of dictionaries directly.
Once in a dataframe, you can:
len(df.index)
. df.loc[df['job'] == 'Director', 'name']
will filter for names of directors. Here's an example:
import pandas as pd
from itertools import chain
from ast import literal_eval
s = pd.Series(['[{"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"},{"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "DEF GHI"}]',
'[{"credit_id": "52fe420dc3a36847f800012e", "department": "Costume", "gender": 0, "id": 4110, "job": "Dresser", "name": "A B"},{"credit_id": "52fe420dc3a36847f80001c8", "department": "Videography", "gender": 1, "id": 3111, "job": "Other", "name": "Joe Smith"}]',
'[{"credit_id": "52fe420dc3a36847f800012f", "department": "Music", "gender": 1, "id": 5110, "job": "Composer", "name": "C D"},{"credit_id": "52fe420dc3a36847f80001c7", "department": "Production", "gender": 0, "id": 3112, "job": "Writer", "name": "Ben Andrews"}]'])
print(s)
# 0 [{"credit_id": "52fe420dc3a36847f800012d", "de...
# 1 [{"credit_id": "52fe420dc3a36847f800012e", "de...
# 2 [{"credit_id": "52fe420dc3a36847f800012f", "de...
# dtype: object
chained = chain.from_iterable(literal_eval(i) for i in s)
df = pd.DataFrame(list(chained))
print(df)
# credit_id department gender id job \
# 0 52fe420dc3a36847f800012d Directing 1 3110 Director
# 1 52fe420dc3a36847f80001c9 Writing 1 3110 Writer
# 2 52fe420dc3a36847f800012e Costume 0 4110 Dresser
# 3 52fe420dc3a36847f80001c8 Videography 1 3111 Other
# 4 52fe420dc3a36847f800012f Music 1 5110 Composer
# 5 52fe420dc3a36847f80001c7 Production 0 3112 Writer
# name
# 0 Allison Anders
# 1 DEF GHI
# 2 A B
# 3 Joe Smith
# 4 C D
# 5 Ben Andrews
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.