简体   繁体   English

删除除列表中的所有单词

[英]Remove all the words except in list

I have a pandas dataframe like below, It contains sentence of words, and I have one more list called vocab, I want to remove all the words from sentence except the words are in vocab list. 我有一个如下的pandas数据框,其中包含单词的句子,还有一个名为vocab的列表,我想从句子中删除所有单词,但单词不在vocab列表中。

Example df: df示例:

                                 sentence
0  packag come differ what about tomorrow
1        Hello dear truth is hard to tell

Example vocab: vocab示例:

['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']

Expected O/P: 预期O / P:

                                   sentence                  res
0   packag come differ what about tomorrow     packag differ tomorrow
1         Hello dear truth is hard to tell    dear truth hard tell

I first tried to use .str.replace and remove all important data from sentence then store this into t1. 我首先尝试使用.str.replace并从句子中删除所有重要数据,然后将其存储到t1中。 Again does the same thing for t1 and sentence so, that i'll get my expected output. 再次对t1和句子执行相同的操作,以便获得预期的输出。 But It's not working as In expected. 但是它没有按预期工作。

My attempt: 我的尝试:

vocab_lis=['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']
vocab_regex = ' '+' | '.join(vocab_lis)+' '
df=pd.DataFrame()
s = pd.Series(["packag come differ what about tomorrow", "Hello dear truth is hard to tell"])
df['sentence']=s
df['sentence']= ' '+df['sentence']+' '

df['t1'] = df['sentence'].str.replace(vocab_regex, ' ')
df['t2'] = df.apply(lambda x: pd.Series(x['sentence']).str.replace(' | '.join(x['t1'].split()), ' '), axis=1)

Is there any simple way to achieve my above task? 有没有简单的方法可以完成上述任务? I know that my code is not working because of spaces. 我知道我的代码由于空格而无法正常工作。 How to solve this? 如何解决呢?

Use nested list comprehension with split by whitespace: 将嵌套列表理解与按空格分隔一起使用:

df['res'] = [' '.join(y for y in x.split() if y in vocab_lis) for x in df['sentence']]
print (df)
                                 sentence                     res
0  packag come differ what about tomorrow  packag differ tomorrow
1        Hello dear truth is hard to tell    dear truth hard tell

vocab_regex = '|'.join(r"\b{}\b".format(x) for x in vocab_lis)
df['t1'] = df['sentence'].str.replace(vocab_regex, '')
print (df)
                                 sentence                  t1
0  packag come differ what about tomorrow   come  what about 
1        Hello dear truth is hard to tell     Hello   is  to

using np.array 使用np.array

data 数据

                                   sentence
0    packag come differ what about tomorrow
1          Hello dear truth is hard to tell

Vocab 词汇

v = ['packag', 'differ', 'tomorrow', 'dear', 'truth', 'hard', 'tell']

first split the sentence to make a list and then using np.in1d to check for common elements between the two list.Then just joining the list to make a string 首先将句子拆分成一个列表,然后使用np.in1d检查两个列表之间的公共元素,然后只需将列表连接成一个字符串

data['sentence'] = data['sentence'].apply(lambda x: ' '.join(np.array(x.split(' '))[np.in1d(x.split(' '),v)]))

Output 输出量

                                   sentence                     res
0    packag come differ what about tomorrow  packag differ tomorrow
1          Hello dear truth is hard to tell    dear truth hard tell

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM