[英]Creating lists of text from rows in a dataframe based on column category
I have a dataframe of categories and text strings: 我有一个类别和文本字符串的数据框:
category strings
pets leash cat dog
pets cat dog frog
candy chocolate frog
candy jelly beans lollipops
I would like 2 lists: 我想要2个清单:
petlist = ['leash', 'cat', 'dog', 'cat', 'dog', 'frog']
candylist = ['chocolate', 'frog', 'jelly', 'beans', 'lollipops']
The following code makes one list of all of the words from the strings column: 以下代码列出了字符串列中所有单词的一个列表:
all_words = df['strings'].str.cat(sep=' ').split()
How can I split this up into 2 lists based on the category and put the 2 lists in a dictionary? 如何根据类别将其分为2个列表,然后将2个列表放入字典中?
Here is what I tried: 这是我尝试过的:
all_words = {}
for cata in df['category']:
all_words['wordlist_%s'% cata]=[]
for cata in df['category']:
df_cata = df.loc[df['category'] == cata]
all_words['wordlist_%s'% cata].append(df_cata['strings'].str.cat(sep=' ').split())
It has the correct keys but each key gives me the words from the first row of that category over and over. 它具有正确的键,但是每个键都反复给我该类别第一行中的单词。 So I've got a dictionary with one list that says leash cat dog leash cat dog and another list that says chocolate frog chocolate frog (so it's clearly starting over in a way that I don't want it to).
因此,我有一本词典,其中有一个列表说皮带猫狗狗皮带猫狗,另一个列表说巧克力青蛙巧克力青蛙(所以很明显,我不想这样做是从头开始的)。
You can set the index first then split then group on the index and concatenate all the lists with sum
and make a dict out of it. 您可以先设置索引,然后拆分,然后在索引上分组,并用
sum
将所有列表连接起来,并从中得出字典。
df.set_index('category').strings.str.split().groupby(level='category').sum().to_dict()
Output 输出量
{'candy': ['chocolate', 'frog', 'jelly', 'beans', 'lollipops'],
'pets': ['leash', 'cat', 'dog', 'cat', 'dog', 'frog']}
this should do it 这应该做
df.groupby('category').strings.apply(' '.join).str.split()
category
candy [chocolate, frog, jelly, beans, lollipops]
pets [leash, cat, dog, cat, dog, frog]
Name: strings, dtype: object
extra credit 额外信用
get unique list 获得唯一列表
df.groupby('category').strings.apply(' '.join).str.split().apply(np.unique)
category
candy [beans, chocolate, frog, jelly, lollipops]
pets [cat, dog, frog, leash]
Name: strings, dtype: object
over achiever 超成就者
value_counts
because I think it's interesting value_counts
因为我认为这很有趣
df.groupby('category').strings.apply(' '.join).str.split(expand=True) \
.stack().groupby(level=0).apply(pd.value_counts)
category
candy jelly 1
frog 1
lollipops 1
beans 1
chocolate 1
pets cat 2
dog 2
leash 1
frog 1
dtype: int64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.