[英]Python sees list of dicts as string: how to parse?
我是数据科学专业的学生,但到目前为止几乎没有代码经验。
我的问题是:如何从字符串中获取字典列表,该字符串已经以字典列表的形式出现,但被熊猫视为字符串?
以下是数据集(信用): https : //www.kaggle.com/tmdb/tmdb-movie-metadata/data
在“ cast”和“ crew”列中,我具有以下单元格:
[
{"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"},
{"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "Allison Anders"}
]
(显然,每个单元格有几十个字典)
我的主要问题是,在加载文件并创建数据框后,熊猫将这两列(广播和乘员组)的单元格视为字符串,而不是字典列表,因此我无法执行我需要的操作。
creditsB = pd.read_csv('folder\\tmdb_5000_credits.csv')
creditsDF = pd.DataFrame(creditsB)
type(creditsDF.loc[0,'crew'])
# str
如果我尝试在其上应用list(),它只会创建一个单个字符的列表。
dct = list(creditsDF.loc[0,'crew'])
dct
# output:
['[',
'{',
'"',
'c',
'r',
'e',
# and so on
如何使python了解它实际上是字典列表,并对其进行处理?
我必须做一些基本的操作,例如“为每部电影,计算演员人数”或“为每部电影,计算导演人数”。 如果我刚刚解决了这个大问题,这些将非常容易。
在此先感谢您的帮助!
您必须在列表中添加字典
movies = [ {"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"}, {"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "Allison Anders"} ]
for movie in movies:
print movie["name"]
# count movies in list
print len(movies)
尝试ast.literal_eval :
import ast
text = '''
[
{"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"},
{"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "Allison Anders"}
]
'''
dicts = ast.literal_eval(text)
# [{'name': 'Allison Anders', 'department': 'Directing', 'credit_id': '52fe420dc3a36847f800012d', 'gender': 1, 'job': 'Director', 'id': 3110},
# {'name': 'Allison Anders', 'department': 'Writing', 'credit_id': '52fe420dc3a36847f80001c9', 'gender': 1, 'job': 'Writer', 'id': 3110}]
print(len(dicts))
# 2
print(dicts[0]['department'])
# Directing
为了有效地应用更改,请尝试应用 :
df['col'] = df['col'].apply(lambda x: ast.literal_eval(x))
从字典中提取所需字段:
dicts = ast.literal_eval(text)
[d['department'] for d in dicts]
# ['Directing', 'Writing']
因此,您有了字典列表,但它们在数据框中显示为字符串。 这是极其低效的。 您应该致力于改善上游的工作流程,以便直接将字典读入Python。
但是,根据您所拥有的,您可以利用ast.literal_eval
从字面上读取字符串。 然后输入pd.DataFrame
。 之所以pd.DataFrame
是因为pd.DataFrame
直接接受字典列表。
进入数据框后,您可以:
len(df.index)
计算字典的数量。 df.loc[df['job'] == 'Director', 'name']
将过滤董事姓名。 这是一个例子:
import pandas as pd
from itertools import chain
from ast import literal_eval
s = pd.Series(['[{"credit_id": "52fe420dc3a36847f800012d", "department": "Directing", "gender": 1, "id": 3110, "job": "Director", "name": "Allison Anders"},{"credit_id": "52fe420dc3a36847f80001c9", "department": "Writing", "gender": 1, "id": 3110, "job": "Writer", "name": "DEF GHI"}]',
'[{"credit_id": "52fe420dc3a36847f800012e", "department": "Costume", "gender": 0, "id": 4110, "job": "Dresser", "name": "A B"},{"credit_id": "52fe420dc3a36847f80001c8", "department": "Videography", "gender": 1, "id": 3111, "job": "Other", "name": "Joe Smith"}]',
'[{"credit_id": "52fe420dc3a36847f800012f", "department": "Music", "gender": 1, "id": 5110, "job": "Composer", "name": "C D"},{"credit_id": "52fe420dc3a36847f80001c7", "department": "Production", "gender": 0, "id": 3112, "job": "Writer", "name": "Ben Andrews"}]'])
print(s)
# 0 [{"credit_id": "52fe420dc3a36847f800012d", "de...
# 1 [{"credit_id": "52fe420dc3a36847f800012e", "de...
# 2 [{"credit_id": "52fe420dc3a36847f800012f", "de...
# dtype: object
chained = chain.from_iterable(literal_eval(i) for i in s)
df = pd.DataFrame(list(chained))
print(df)
# credit_id department gender id job \
# 0 52fe420dc3a36847f800012d Directing 1 3110 Director
# 1 52fe420dc3a36847f80001c9 Writing 1 3110 Writer
# 2 52fe420dc3a36847f800012e Costume 0 4110 Dresser
# 3 52fe420dc3a36847f80001c8 Videography 1 3111 Other
# 4 52fe420dc3a36847f800012f Music 1 5110 Composer
# 5 52fe420dc3a36847f80001c7 Production 0 3112 Writer
# name
# 0 Allison Anders
# 1 DEF GHI
# 2 A B
# 3 Joe Smith
# 4 C D
# 5 Ben Andrews
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.