简体   繁体   中英

Python sees list of dicts as string: how to parse?

I am a student in Data Science but have little code experience so far.

My issue is: how can I obtain a list of dicts from a string, that already is in the form of a list of dicts but is seen by pandas as a string?

Here is the dataset (credits): https://www.kaggle.com/tmdb/tmdb-movie-metadata/data

In the columns 'cast' and 'crew' I have cells like this:

[
{"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"}, 
{"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "Allison Anders"}
]

(obviously there are dozens of dicts for each cell)

My main problem is that, after I have loaded the file and created a data frame, the cells of these two columns (cast and crew) are seen by pandas as strings, and not as a list of dicts, and so I cannot perform the operations I need.

creditsB = pd.read_csv('folder\\tmdb_5000_credits.csv')
creditsDF = pd.DataFrame(creditsB)
type(creditsDF.loc[0,'crew'])
# str

And if I try to apply list() on it, it just creates a list of single characters.

dct = list(creditsDF.loc[0,'crew'])
dct
 # output:
 ['[',
 '{',
 '"',
 'c',
 'r',
 'e',
 # and so on

How can I make python understand it's actually a list of dicts, and treat it as well?

I have to do some basic operations like "for each movie, compute the number of cast members" or "for each movie, compute the number of directors". These would be really easy if I just solved this big issue.

Thanks in advance for any help!

You have to append dict in list

 movies = [ {"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"}, {"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "Allison Anders"} ]

    for movie in movies:
        print movie["name"]

    # count movies in list
    print len(movies)

Try ast.literal_eval :

import ast

text = '''
[
{"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"}, 
{"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "Allison Anders"}
]
'''

dicts = ast.literal_eval(text)
# [{'name': 'Allison Anders', 'department': 'Directing', 'credit_id': '52fe420dc3a36847f800012d', 'gender': 1, 'job': 'Director', 'id': 3110}, 
# {'name': 'Allison Anders', 'department': 'Writing', 'credit_id': '52fe420dc3a36847f80001c9', 'gender': 1, 'job': 'Writer', 'id': 3110}]
print(len(dicts))
# 2
print(dicts[0]['department'])
# Directing

For efficient applying changes, try apply :

df['col'] = df['col'].apply(lambda x: ast.literal_eval(x))

Extracting desired fields from dictionaries:

dicts = ast.literal_eval(text)
[d['department'] for d in dicts]
# ['Directing', 'Writing']

So you have list of dictionaries, but they appear in your dataframe as strings. This is extremely inefficient. You should aim to improve the workflow upstream so that you read dictionaries directly into Python.

However, given what you have, you can utilise ast.literal_eval to read your strings literally. Then feed into pd.DataFrame . This works because pd.DataFrame accepts a list of dictionaries directly.

Once in a dataframe, you can:

  • Count the number of dictionaries via len(df.index) .
  • Use Pandas Boolean indexing to filter, eg df.loc[df['job'] == 'Director', 'name'] will filter for names of directors.

Here's an example:

import pandas as pd
from itertools import chain
from ast import literal_eval

s = pd.Series(['[{"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"},{"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "DEF GHI"}]',
               '[{"credit_id": "52fe420dc3a36847f800012e", "department": "Costume", "gender": 0, "id": 4110, "job": "Dresser", "name": "A B"},{"credit_id": "52fe420dc3a36847f80001c8", "department": "Videography", "gender": 1, "id": 3111, "job": "Other", "name": "Joe Smith"}]',
               '[{"credit_id": "52fe420dc3a36847f800012f", "department": "Music", "gender": 1, "id": 5110, "job": "Composer", "name": "C D"},{"credit_id": "52fe420dc3a36847f80001c7", "department": "Production", "gender": 0, "id": 3112, "job": "Writer", "name": "Ben Andrews"}]'])

print(s)

# 0    [{"credit_id": "52fe420dc3a36847f800012d", "de...
# 1    [{"credit_id": "52fe420dc3a36847f800012e", "de...
# 2    [{"credit_id": "52fe420dc3a36847f800012f", "de...
# dtype: object

chained = chain.from_iterable(literal_eval(i) for i in s)

df = pd.DataFrame(list(chained))

print(df)

#                   credit_id   department  gender    id       job  \
# 0  52fe420dc3a36847f800012d    Directing       1  3110  Director   
# 1  52fe420dc3a36847f80001c9      Writing       1  3110    Writer   
# 2  52fe420dc3a36847f800012e      Costume       0  4110   Dresser   
# 3  52fe420dc3a36847f80001c8  Videography       1  3111     Other   
# 4  52fe420dc3a36847f800012f        Music       1  5110  Composer   
# 5  52fe420dc3a36847f80001c7   Production       0  3112    Writer   

#              name  
# 0  Allison Anders  
# 1         DEF GHI  
# 2             A B  
# 3       Joe Smith  
# 4             C D  
# 5     Ben Andrews  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM