简体   繁体   English

将多单词字符串拆分为包含字符串列表的Pandas系列的单个单词

[英]Split multi-word strings into individual words for Pandas series containing list of strings

I have a Pandas Dataframe that has the column values as list of strings. 我有一个Pandas Dataframe,它具有列值作为字符串列表。 Each list may have one or more than one string. 每个列表可以包含一个或多个字符串。 For strings that have more than one word, I'd like to split them into individual words, so that each list contains only individual words. 对于具有多个单词的字符串,我想将它们拆分为单个单词,以便每个列表仅包含单个单词。 In the following Dataframe, only the sent_tags column has lists which contain strings of variable length. 在以下数据帧中,仅sent_tags列具有包含可变长度字符串的列表。

DataFrame : 数据框

import pandas as pd    
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame({"fruit_tags": [["'apples'", "'oranges'", "'pears'"], ["'melons'", "'peaches'", "'kiwis'"]], "sent_tags":[["'apples'", "'sweeter than oranges'", "'pears sweeter than apples'"], ["'melons'", "'sweeter than peaches'", "'kiwis sweeter than melons'"]]})
print(df)  

    fruit_tags                        sent_tags
0   ['apples', 'oranges', 'pears']  ['apples', 'sweeter than oranges', 'pears sweeter than apples']
1   ['melons', 'peaches', 'kiwis']  ['melons', 'sweeter than peaches', 'kiwis sweeter than melons']

My attempt : 我的尝试

I decided to use word_tokenize from the NLTK library to break such strings into individual words. 我决定使用NLTK库中的word_tokenize将此类字符串分解为单个单词。 I do get the tokenized words for a particular selection inside the list but cannot club them together into each list for each row: 我确实获得了列表中特定选择的标记化单词,但无法将它们组合在一起形成每一行的每个列表:

from nltk.tokenize import word_tokenize
df['sent_tags'].str[1].str.strip("'").apply(lambda x:word_tokenize(x.lower()))
#Output
0    [sweeter, than, oranges]
1    [sweeter, than, peaches]
Name: sent_tags, dtype: object

Desired result : 所需结果

    fruit_tags                        sent_tags
0   ['apples', 'oranges', 'pears']  ['apples', 'sweeter', 'than', 'oranges', 'pears', 'sweeter', 'than', 'apples']
1   ['melons', 'peaches', 'kiwis']  ['melons', 'sweeter', 'than', 'peaches', 'kiwis', 'sweeter', 'than', 'melons']

Use list comprehension with flatenning with all text functions - strip , lower and split : 使用列表理解与所有文本功能flatenning - striplowersplit

s = df['sent_tags'].apply(lambda x: [z for y in x for z in y.strip("'").lower().split()])

Or: 要么:

s = [[z for y in x for z in y.strip("'").lower().split()] for x in df['sent_tags']]

df['sent_tags'] = s

print(df) 
                       fruit_tags  \
0  ['apples', 'oranges', 'pears']   
1  ['melons', 'peaches', 'kiwis']   

                                                        sent_tags  
0  [apples, sweeter, than, oranges, pears, sweeter, than, apples]  
1  [melons, sweeter, than, peaches, kiwis, sweeter, than, melons]  

另一种可能的方法可能是:

df['sent_tags'].apply(lambda x: [item for elem in [y.split() for y in x] for item in elem])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM