[英]Python Pandas dataframe: Collect values of a column
I have the following data frame: 我有以下数据框:
var_1 var_2 item_list
0 0 1 [beer, apple, pear, rice]
1 0 1 [egg, banana, oil, pear]
2 0 1 [beer, noodle]
3 1 0 [tomato, milk]
4 1 0 [apple]
Is it possible to collect all items in the item_list using data-frame apply function? 是否可以使用数据框应用功能收集item_list中的所有项目? The output should be something like [beer, apple, pear, rice, egg, banana, oil, pear, ...]
without duplicates in the list. 输出应类似于[beer, apple, pear, rice, egg, banana, oil, pear, ...]
,但列表中没有重复项。
Or I have to iterate cell by cell to collect all values in to one list? 还是我必须逐个单元地迭代以将所有值收集到一个列表中?
If your DataFrame is df
, then you can use 如果您的DataFrame是df
,则可以使用
import itertools
itertools.chain.from_iterable(df.item_list)
to create an iterable of all the items. 创建所有项目的可迭代对象。 If you do 如果你这样做
list(itertools.chain.from_iterable(df.item_list))
then it will become a list. 然后它将成为一个列表。
Example 例
import pandas as pd
df = pd.DataFrame({'item_list': [[1, 2], [3, 4]]})
import itertools
>>> list(itertools.chain.from_iterable(df.item_list.values))
[1, 2, 3, 4]
I think you can apply
Series
, stack
and convert tolist
: 我认为您可以apply
Series
, stack
并转换为tolist
:
print df['item_list'].apply(pd.Series).stack().tolist()
['beer', 'apple', 'pear', 'rice', 'egg', 'banana', 'oil',
'pear', 'beer', 'noodle', 'tomato', 'milk', 'apple']
If you need remove duplicates use drop_duplicates
or set
: 如果需要删除重复项,请使用drop_duplicates
或set
:
print df['item_list'].apply(pd.Series).stack().drop_duplicates().tolist()
['beer', 'apple', 'pear', 'rice', 'egg', 'banana', 'oil', 'noodle', 'tomato', 'milk']
print list(set(df['item_list'].apply(pd.Series).stack().tolist()))
['tomato', 'oil', 'apple', 'pear', 'milk', 'beer', 'noodle', 'rice', 'egg', 'banana']
EDIT: 编辑:
If you need remove duplicates in each row first: 如果您需要先删除每行中的重复项:
print df['item_list'].apply(lambda x: pd.Series(list(set(x)))).stack().drop_duplicates().tolist()
> l= list(df['item_list']
> flattened_list = [item for sublist in l for item in sublist]
> flattened = set(flattened_list)
> pprint.pprint(flattened)
{'apple',
'banana',
'beer',
'egg',
'milk',
'noodle',
'oil',
'pear',
'rice',
'tomato'}
Hope that helps. 希望能有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.