[英]Remove Strings from Column in data frame per row that aren't in a list
Say I have a list of words:假设我有一个单词列表:
listOfWords = ['Apple','Orange','Banana','Potato']
And my data frame looks like this:我的数据框如下所示:
In:
ColumnA:
['Apple','Turnip','Banana','Potato']
['Apple','Orange','Banana','Potato']
['Apple','Orange','Pastry','Potato']
['Melon','Orange','Banana','Potato']
['Apple','Orange','Banana','Sandwich']
I am currently running the following code to retrieve the desired output我目前正在运行以下代码来检索所需的输出
for index, row in df.iterrows():
for word in df['Column']:
if word not in listOfWords:
word.replace(word,"")
Out:
ColumnA:
['Apple','Banana','Potato']
['Apple','Orange','Banana','Potato']
['Apple','Orange','Potato']
['Orange','Banana','Potato']
['Apple','Orange','Banana']
I am currently running this on 12,000 records and a list of length 12,000.我目前正在 12,000 条记录和长度为 12,000 的列表上运行它。 It has been running without errors for a few hours, however I am unsure if this is the most efficient way to do this.它已经运行了几个小时没有错误,但是我不确定这是否是最有效的方法。
Use list comprehension in apply
or nested list comprehension
:在apply
或嵌套list comprehension
中apply
list comprehension
:
df['ColumnA']= df['ColumnA'].apply(lambda x: [y for y in x if y in listOfWords])
#another solution
#df['ColumnA'] = [[y for y in x if y in listOfWords] for x in df['ColumnA']]
print (df)
ColumnA
0 [Apple, Banana, Potato]
1 [Apple, Orange, Banana, Potato]
2 [Apple, Orange, Potato]
3 [Orange, Banana, Potato]
4 [Apple, Orange, Banana]
Or if order is not importat use set
s with intersection:或者,如果订单不是重要的,请使用带有交集的set
s:
s = set(listOfWords)
df['ColumnA']= df['ColumnA'].apply(lambda x: list(set(x) & s))
print (df)
ColumnA
0 [Banana, Potato, Apple]
1 [Banana, Potato, Orange, Apple]
2 [Potato, Orange, Apple]
3 [Banana, Potato, Orange]
4 [Banana, Orange, Apple]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.