[英]Data cleaning: How to remove certain values from a pandas dataframe column?
I am working on the analysis of the user profile interest of a social network. 我正在分析社交网络的用户个人资料兴趣。 I have generated a dataframe with User id, Name and User interest from the export of the social network database. 我从社交网络数据库的导出生成了一个带有用户ID,名称和用户兴趣的数据框。 I was supposed to only get keywords in the 'User interest' column. 我应该只在“用户兴趣”列中获得关键字。 but actually, I got a mix of keywords and User ID... 但实际上,我混合使用了关键字和用户ID ...
User ID displayName interests
0 5705952d0eb2063205ca1d3c Jane Catch []
1 5705e99ac391580e00ea87c9 Heidi Kent [{u'text': u'psychology', u'_id': {u'$oid': u'...
2 5705efb6c391580e00ea87ca Rob Tuckinson [{u'text': u'learning', u'_id': {u'$oid': u'57...
I would like to make some data cleaning on the column interests
to only keep the keywords in the user interest column. 我想对列interests
进行一些数据清理,以仅将关键字保留在用户兴趣列中。
Today, I have this information: 今天,我有以下信息:
User ID,displayName,interests
"570df0f2a40cc20e00c15e09,Alejandra Zara,""[{u'text': u'pretend-play', u'_id': {u'$oid': u'570e57eba40cc20e00c161ea'}}, {u'text': u'autobiographical-memory', u'_id': {u'$oid': u'570e57eba40cc20e00c161e9'}}]"""
For the first line, I would like only to keep the information below: 对于第一行,我只想保留以下信息:
"570df0f2a40cc20e00c15e09,Alejandra Zara,pretend-play', autobiographical-memory'
Any ideas of data cleaning techniques? 关于数据清理技术有什么想法? Each time, I need to remove the information relative to user ID (different for each row such as: 每次,我需要删除与用户ID相关的信息(每一行都不同,例如:
u'_id': {u'$oid': u'570e57eba40cc20e00c161ea'}}
and remove {u'text': u
(which is placed at the beginning of each keyword). 并删除{u'text': u
(位于每个关键字的开头)。
If I'm reading the question correctly, what you have in your interests
column is the string representation of a Python list
of dict
s from which you want to get specific values. 如果我正确地阅读了该问题,那么您interests
列中的内容就是dict
的Python list
的字符串表示形式,您想从中获取特定值。 If so, you can use ast.literal_eval
to parse it: 如果是这样,则可以使用ast.literal_eval
进行解析:
In [24]: df
Out[24]:
User ID displayName \
0 570df0f2a40cc20e00c15e09 Alejandra Zara
interests
0 [{u'text': u'pretend-play', u'_id': {u'$oid': ...
In [25]: df['interests'].map(lambda x: ','.join(i['text'] for i in ast.literal_eval(x)))
Out[25]:
0 pretend-play,autobiographical-memory
Name: interests, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.