简体   繁体   English

根据列类别从数据框中的行创建文本列表

[英]Creating lists of text from rows in a dataframe based on column category

I have a dataframe of categories and text strings: 我有一个类别和文本字符串的数据框:

category    strings

pets        leash cat dog
pets        cat dog frog
candy       chocolate frog
candy       jelly beans lollipops

I would like 2 lists: 我想要2个清单:

petlist = ['leash', 'cat', 'dog', 'cat', 'dog', 'frog']
candylist = ['chocolate', 'frog', 'jelly', 'beans', 'lollipops']

The following code makes one list of all of the words from the strings column: 以下代码列出了字符串列中所有单词的一个列表:

all_words = df['strings'].str.cat(sep=' ').split()

How can I split this up into 2 lists based on the category and put the 2 lists in a dictionary? 如何根据类别将其分为2个列表,然后将2个列表放入字典中?

Here is what I tried: 这是我尝试过的:

all_words = {}
for cata in df['category']:
    all_words['wordlist_%s'% cata]=[]
for cata in df['category']:
    df_cata = df.loc[df['category'] == cata]
    all_words['wordlist_%s'% cata].append(df_cata['strings'].str.cat(sep=' ').split())

It has the correct keys but each key gives me the words from the first row of that category over and over. 它具有正确的键,但是每个键都反复给我该类别第一行中的单词。 So I've got a dictionary with one list that says leash cat dog leash cat dog and another list that says chocolate frog chocolate frog (so it's clearly starting over in a way that I don't want it to). 因此,我有一本词典,其中有一个列表说皮带猫狗狗皮带猫狗,另一个列表说巧克力青蛙巧克力青蛙(所以很明显,我不想这样做是从头开始的)。

You can set the index first then split then group on the index and concatenate all the lists with sum and make a dict out of it. 您可以先设置索引,然后拆分,然后在索引上分组,并用sum将所有列表连接起来,并从中得出字典。

df.set_index('category').strings.str.split().groupby(level='category').sum().to_dict()

Output 输出量

{'candy': ['chocolate', 'frog', 'jelly', 'beans', 'lollipops'],
 'pets': ['leash', 'cat', 'dog', 'cat', 'dog', 'frog']}

this should do it 这应该做

df.groupby('category').strings.apply(' '.join).str.split()

category
candy    [chocolate, frog, jelly, beans, lollipops]
pets              [leash, cat, dog, cat, dog, frog]
Name: strings, dtype: object

extra credit 额外信用
get unique list 获得唯一列表

df.groupby('category').strings.apply(' '.join).str.split().apply(np.unique)

category
candy    [beans, chocolate, frog, jelly, lollipops]
pets                        [cat, dog, frog, leash]
Name: strings, dtype: object

over achiever 超成就者
value_counts because I think it's interesting value_counts因为我认为这很有趣

df.groupby('category').strings.apply(' '.join).str.split(expand=True) \
    .stack().groupby(level=0).apply(pd.value_counts)

 category           
candy     jelly        1
          frog         1
          lollipops    1
          beans        1
          chocolate    1
pets      cat          2
          dog          2
          leash        1
          frog         1
dtype: int64

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM