[英]Keep substrings if substrings are in a list
I have a column that contains strings and a list that contains strings that I wish to preserve in the column.我有一个包含字符串的列和一个包含我希望保留在列中的字符串的列表。 If a substring is not present in the list, remove it.
如果列表中不存在 substring,请将其删除。 Note that there is to be no double whitespaces or whitespace at the beginning or end of the string in the column.
请注意,列中字符串的开头或结尾不能有双空格或空格。
How can I accomplish this efficiently?我怎样才能有效地做到这一点?
df['column']
>>>
0 good day happy night
1 good bird sad day
2 day over ready
ls = ['good', 'day']
Output: Output:
df['column']
>>>
0 good day
1 good day
2 day
Use Series.str.findall
with joined ls
by |
使用
Series.str.findall
并加入ls
by |
for regex OR
with Series.str.join
for join lists:对于正则表达式
OR
与Series.str.join
连接列表:
ls = ['good', 'day']
df['column'] = df['column'].str.findall('|'.join(ls)).str.join(' ')
print (df)
column
0 good day
1 good day
2 day
If need match values between space by word boundaries:如果需要按单词边界在空格之间匹配值:
#daytime changed
print (df)
column
0 good daytime happy night
1 good bird sad day
2 day over ready
ls = ['good', 'day']
pat = '|'.join(r"\b{}\b".format(x) for x in ls)
df['column1'] = df['column'].str.findall(pat).str.join(' ')
df['column2'] = df['column'].str.findall('|'.join(ls)).str.join(' ')
print (df)
column column1 column2
0 good daytime happy night good good day
1 good bird sad day good day good day
2 day over ready day day
Another idea is use lambda function and lookup to list converted to sets:另一个想法是使用 lambda function 并查找转换为集合的列表:
sets = set(ls)
df['column1'] = df['column'].apply(lambda x: ' '.join(y for y in x.split() if y in sets))
With apply
and a lambda function:使用
apply
和 lambda function:
df['column'].apply(lambda row: ' '.join(list(set(row.split()) & ls)))
Note that by using sets the order of the strings might be changed.请注意,使用集合可能会更改字符串的顺序。
MCVE: MCVE:
df = pd.DataFrame({'column':['good day happy night', 'good bird sad day', 'day over ready']})
# column
# 0 good day happy night
# 1 good bird sad day
# 2 day over ready
ls = ['good', 'day']
ls = set(ls)
df['column'].apply(lambda row: ' '.join(list(set(row.split()) & ls)))
# 0 good day
# 1 good day
# 2 day
# Name: column, dtype: object
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.