简体   繁体   English

如果子字符串在列表中,则保留子字符串

[英]Keep substrings if substrings are in a list

I have a column that contains strings and a list that contains strings that I wish to preserve in the column.我有一个包含字符串的列和一个包含我希望保留在列中的字符串的列表。 If a substring is not present in the list, remove it.如果列表中不存在 substring,请将其删除。 Note that there is to be no double whitespaces or whitespace at the beginning or end of the string in the column.请注意,列中字符串的开头或结尾不能有双空格或空格。

How can I accomplish this efficiently?我怎样才能有效地做到这一点?

df['column']
>>>
0    good day happy night
1    good bird sad day
2    day over ready

ls = ['good', 'day']

Output: Output:

df['column']
>>>
0    good day
1    good day
2    day

Use Series.str.findall with joined ls by |使用Series.str.findall并加入ls by | for regex OR with Series.str.join for join lists:对于正则表达式ORSeries.str.join连接列表:

ls = ['good', 'day']

df['column'] = df['column'].str.findall('|'.join(ls)).str.join(' ')
print (df)
     column
0  good day
1  good day
2       day

If need match values between space by word boundaries:如果需要按单词边界在空格之间匹配值:

#daytime changed
print (df)
                     column
0  good daytime happy night 
1         good bird sad day
2            day over ready

ls = ['good', 'day']

pat = '|'.join(r"\b{}\b".format(x) for x in ls)
df['column1'] = df['column'].str.findall(pat).str.join(' ')
df['column2'] = df['column'].str.findall('|'.join(ls)).str.join(' ')
print (df)
                     column   column1   column2
0  good daytime happy night      good  good day
1         good bird sad day  good day  good day
2            day over ready       day       day

Another idea is use lambda function and lookup to list converted to sets:另一个想法是使用 lambda function 并查找转换为集合的列表:

sets = set(ls)
df['column1'] = df['column'].apply(lambda x: ' '.join(y for y in x.split() if y in sets))

With apply and a lambda function:使用apply和 lambda function:

df['column'].apply(lambda row: ' '.join(list(set(row.split()) & ls)))

Note that by using sets the order of the strings might be changed.请注意,使用集合可能会更改字符串的顺序。

MCVE: MCVE:

df = pd.DataFrame({'column':['good day happy night', 'good bird sad day', 'day over ready']}) 
#                  column
# 0  good day happy night
# 1     good bird sad day
# 2        day over ready 
ls = ['good', 'day']
ls = set(ls)                                               
df['column'].apply(lambda row: ' '.join(list(set(row.split()) & ls)))
# 0    good day
# 1    good day
# 2         day
# Name: column, dtype: object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM