Python Pandas数据框：收集列的值

Question

I have the following data frame: 我有以下数据框：

      var_1     var_2                         item_list 
0         0         1         [beer, apple, pear, rice]    
1         0         1          [egg, banana, oil, pear]   
2         0         1                    [beer, noodle]    
3         1         0                    [tomato, milk]    
4         1         0                           [apple]

Is it possible to collect all items in the item_list using data-frame apply function? 是否可以使用数据框应用功能收集item_list中的所有项目？ The output should be something like [beer, apple, pear, rice, egg, banana, oil, pear, ...] without duplicates in the list. 输出应类似于[beer, apple, pear, rice, egg, banana, oil, pear, ...] ，但列表中没有重复项。

Or I have to iterate cell by cell to collect all values in to one list? 还是我必须逐个单元地迭代以将所有值收集到一个列表中？

Answer 1

If your DataFrame is df , then you can use 如果您的DataFrame是df ，则可以使用

import itertools

itertools.chain.from_iterable(df.item_list)

to create an iterable of all the items. 创建所有项目的可迭代对象。 If you do 如果你这样做

list(itertools.chain.from_iterable(df.item_list))

then it will become a list. 然后它将成为一个列表。

Example 例

import pandas as pd

df = pd.DataFrame({'item_list': [[1, 2], [3, 4]]})

import itertools

>>> list(itertools.chain.from_iterable(df.item_list.values))
[1, 2, 3, 4]

Answer 2

I think you can apply Series , stack and convert tolist : 我认为您可以apply Series ， stack并转换为tolist ：

print df['item_list'].apply(pd.Series).stack().tolist()
['beer', 'apple', 'pear', 'rice', 'egg', 'banana', 'oil', 
 'pear', 'beer', 'noodle', 'tomato', 'milk', 'apple']

If you need remove duplicates use drop_duplicates or set : 如果需要删除重复项，请使用drop_duplicates或set ：

print df['item_list'].apply(pd.Series).stack().drop_duplicates().tolist()
['beer', 'apple', 'pear', 'rice', 'egg', 'banana', 'oil', 'noodle', 'tomato', 'milk']

print list(set(df['item_list'].apply(pd.Series).stack().tolist()))
['tomato', 'oil', 'apple', 'pear', 'milk', 'beer', 'noodle', 'rice', 'egg', 'banana']

EDIT: 编辑：

If you need remove duplicates in each row first: 如果您需要先删除每行中的重复项：

print df['item_list'].apply(lambda x: pd.Series(list(set(x)))).stack().drop_duplicates().tolist()

Answer 3

> l= list(df['item_list'] 
> flattened_list = [item for sublist in l for item in sublist]
> flattened = set(flattened_list)
> pprint.pprint(flattened)
{'apple',
 'banana',
 'beer',
 'egg',
 'milk',
 'noodle',
 'oil',
 'pear',
 'rice',
 'tomato'}

Hope that helps. 希望能有所帮助。

Python Pandas数据框：收集列的值

问题描述

3 个解决方案

解决方案1
2 2016-04-04 19:33:43

解决方案2
2 已采纳 2016-04-04 20:13:17

解决方案3
1 2016-04-04 19:35:21

Python Pandas数据框：收集列的值

问题描述

3 个解决方案

解决方案1 2 2016-04-04 19:33:43

解决方案2 2 已采纳 2016-04-04 20:13:17

解决方案3 1 2016-04-04 19:35:21

解决方案1
2 2016-04-04 19:33:43

解决方案2
2 已采纳 2016-04-04 20:13:17

解决方案3
1 2016-04-04 19:35:21