[英]Slicing Pandas Dataframe based on a value present in a column which is a list of lists
I have a Pandas Dataframe with a million rows (ids) with one of the columns as a list of lists. 我有一个带有一百万行(id)的熊猫数据框,其中一列作为列表列表。 eg 例如
df = pd.DataFrame({'id' : [1,2,3,4] ,'token_list' : [['a','b','c'],['c','d'],['a','e','f'],['c','f']]}) df = pd.DataFrame({'id':[1,2,3,4],'token_list':[['a','b','c'],['c','d'], ['a','e','f'],['c','f']]})
I want to create a dictionary of all the unique tokens - 'a', 'b', 'c', 'e', 'f' (which i already have as a separate list) as keys and all the ids that each key is associated with. 我想创建一个包含所有唯一标记的字典-'a','b','c','e','f'(我已经作为单独的列表)作为键以及每个键的所有ID与..相联系。 For eg, {'a' : [1,3], 'b': [1], 'c': [1, 2,4]..} and so on. 例如,{'a':[1,3],'b':[1],'c':[1、2,4] ..}等等。
My problem is there are 12000 such tokens, and I do not want to use loops to run through each row in the first frame. 我的问题是有12000个这样的令牌,我不想使用循环来遍历第一帧的每一行。 And is in does not seem to work. 并在似乎不起作用。
Use np.repeat
with numpy.concatenate
for flattening first and then groupby
with list
and last to_dict
: 使用np.repeat
与numpy.concatenate
为第一平整,然后groupby
与list
和最后to_dict
:
a = np.repeat(df['id'], df['token_list'].str.len())
b = np.concatenate(df['token_list'].values)
d = a.groupby(b).apply(list).to_dict()
print (d)
{'c': [1, 2, 4], 'a': [1, 3], 'b': [1], 'd': [2], 'e': [3], 'f': [3, 4]}
Detail: 详情:
print (a)
0 1
0 1
0 1
1 2
1 2
2 3
2 3
2 3
3 4
3 4
Name: id, dtype: int64
print (b)
['a' 'b' 'c' 'c' 'd' 'a' 'e' 'f' 'c' 'f']
df.set_index('id')['token_list'].\
apply(pd.Series).stack().reset_index(name='V').\
groupby('V')['id'].apply(list).to_dict()
Out[359]: {'a': [1, 3], 'b': [1], 'c': [1, 2, 4], 'd': [2], 'e': [3], 'f': [3, 4]}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.