根据列类别从数据框中的行创建文本列表

Question

I have a dataframe of categories and text strings: 我有一个类别和文本字符串的数据框：

category    strings

pets        leash cat dog
pets        cat dog frog
candy       chocolate frog
candy       jelly beans lollipops

I would like 2 lists: 我想要2个清单：

petlist = ['leash', 'cat', 'dog', 'cat', 'dog', 'frog']
candylist = ['chocolate', 'frog', 'jelly', 'beans', 'lollipops']

The following code makes one list of all of the words from the strings column: 以下代码列出了字符串列中所有单词的一个列表：

all_words = df['strings'].str.cat(sep=' ').split()

How can I split this up into 2 lists based on the category and put the 2 lists in a dictionary? 如何根据类别将其分为2个列表，然后将2个列表放入字典中？

Here is what I tried: 这是我尝试过的：

all_words = {}
for cata in df['category']:
    all_words['wordlist_%s'% cata]=[]
for cata in df['category']:
    df_cata = df.loc[df['category'] == cata]
    all_words['wordlist_%s'% cata].append(df_cata['strings'].str.cat(sep=' ').split())

It has the correct keys but each key gives me the words from the first row of that category over and over. 它具有正确的键，但是每个键都反复给我该类别第一行中的单词。 So I've got a dictionary with one list that says leash cat dog leash cat dog and another list that says chocolate frog chocolate frog (so it's clearly starting over in a way that I don't want it to). 因此，我有一本词典，其中有一个列表说皮带猫狗狗皮带猫狗，另一个列表说巧克力青蛙巧克力青蛙（所以很明显，我不想这样做是从头开始的）。

Answer 1

You can set the index first then split then group on the index and concatenate all the lists with sum and make a dict out of it. 您可以先设置索引，然后拆分，然后在索引上分组，并用sum将所有列表连接起来，并从中得出字典。

df.set_index('category').strings.str.split().groupby(level='category').sum().to_dict()

Output 输出量

{'candy': ['chocolate', 'frog', 'jelly', 'beans', 'lollipops'],
 'pets': ['leash', 'cat', 'dog', 'cat', 'dog', 'frog']}

Answer 2

this should do it 这应该做

df.groupby('category').strings.apply(' '.join).str.split()

category
candy    [chocolate, frog, jelly, beans, lollipops]
pets              [leash, cat, dog, cat, dog, frog]
Name: strings, dtype: object

extra credit 额外信用
get unique list 获得唯一列表

df.groupby('category').strings.apply(' '.join).str.split().apply(np.unique)

category
candy    [beans, chocolate, frog, jelly, lollipops]
pets                        [cat, dog, frog, leash]
Name: strings, dtype: object

over achiever 超成就者
value_counts because I think it's interesting value_counts因为我认为这很有趣

df.groupby('category').strings.apply(' '.join).str.split(expand=True) \
    .stack().groupby(level=0).apply(pd.value_counts)

 category           
candy     jelly        1
          frog         1
          lollipops    1
          beans        1
          chocolate    1
pets      cat          2
          dog          2
          leash        1
          frog         1
dtype: int64

根据列类别从数据框中的行创建文本列表

问题描述

2 个解决方案

解决方案1
3 2016-12-21 23:10:20

解决方案2
2 已采纳 2016-12-21 23:08:40

根据列类别从数据框中的行创建文本列表

问题描述

2 个解决方案

解决方案1 3 2016-12-21 23:10:20

解决方案2 2 已采纳 2016-12-21 23:08:40

解决方案1
3 2016-12-21 23:10:20

解决方案2
2 已采纳 2016-12-21 23:08:40