[英]compare two columns of pandas dataframe with a list of strings
This is my dataframe:这是我的 dataframe:
import pandas as pd
df = pd.DataFrame({'a': ['axy a', 'xyz b'], 'b': ['obj e', 'oaw r']})
and I have a list of strings:我有一个字符串列表:
s1 = 'lorem obj e'
s2 = 'lorem obj e lorem axy a'
s3 = 'lorem xyz b lorem oaw r'
s4 = 'lorem lorem oaw r'
s5 = 'lorem lorem axy a lorem obj e'
s_all = [s1, s2, s3, s4, s5]
Now I want to take every row and check whether both columns of the row are present in any of strings in s_all
.现在我想获取每一行并检查该行的两列是否存在于
s_all
的任何字符串中。 For example for first row I select axy_a
and obj_e
and check if both of them are present in the strings of s_all
.例如对于第一行 I select
axy_a
和obj_e
并检查它们是否都存在于s_all
的字符串中。 Both of them are present in s2
and s5
.它们都存在于
s2
和s5
中。
the outcome that I want looks like this one:我想要的结果是这样的:
a b c
0 axy a obj e lorem obj e lorem axy a
1 axy a obj e lorem lorem axy a lorem obj e
2 xyz b oaw r lorem xyz b lorem oaw r
Here is my try but it didn't work:这是我的尝试,但没有奏效:
l = []
for sentence in s_all:
for i in range(len(df)):
if df.a.values[i] in sentence and df.b.values[i] in sentence:
l.append(sentence)
else:
l.append(np.nan)
I tried to append the result into a list and then use that list to create the c
column that I want but it didn't work.我尝试将 append 结果放入一个列表中,然后使用该列表创建我想要的
c
列,但它没有用。
You can create a new series object using apply
and explode
and concat
that with your DataFrame您可以使用
apply
和explode
创建一个新系列concat
并与您的 DataFrame 连接
match_series = df.apply(lambda row: [s for s in s_all if row['a'] in s and row['b'] in s], axis=1).explode()
pd.concat([df, match_series], axis=1)
Output Output
a b 0
0 axy a obj e lorem obj e lorem axy a
0 axy a obj e lorem lorem axy a lorem obj e
1 xyz b oaw r lorem xyz b lorem oaw r
you can write a little helper function and apply this function row by row to your df:您可以编写一个小助手 function 并将此 function 逐行应用于您的df:
def func(row):
out = []
a, b = row
for s in s_all:
if all([a in s, b in s]):
out.append(s)
return out
# if you have more than 2 columns or don't know how many, here more general approach
# other than that, same function as above
def func(row):
out = []
for s in s_all:
if all([string in s for string in row.tolist()]):
out.append(s)
return out
df['c'] = df.apply(func, axis=1)
Or as one-liner with a lambda function:或者作为 lambda function 的单线:
df['c'] = df.apply(lambda row: [s for s in s_all if all(string in s for elem in row.tolist() for string in elem)], axis=1)
The function returns a list with results. function 返回一个包含结果的列表。 To make each list element its own row, we use
explode
为了使每个列表元素成为自己的行,我们使用
explode
df = df.explode(column='c')
print(df)
Output: Output:
a b c
0 axy a obj e lorem obj e lorem axy a
0 axy a obj e lorem lorem axy a lorem obj e
1 xyz b oaw r lorem xyz b lorem oaw r
Due to multiple occurrences of patterns in a
and b
in the reference strings, you need to repeat their listings as well.由于参考字符串中
a
和b
模式多次出现,您还需要重复它们的列表。 This happens by appending l_a
and l_b
.这是通过附加
l_a
和l_b
的。 In turn, a new dataframe df_new
is constructed.反过来,构建了一个新的 dataframe
df_new
。 Modifying your for loop will do.修改你的 for 循环就可以了。
l = []
l_a = []
l_b = []
for i in range(len(df)):
for sentence in s_all:
if df.a.values[i] in sentence and df.b.values[i] in sentence:
l.append(sentence)
l_a.append(df.a.values[i])
l_b.append(df.b.values[i])
df_new = pd.DataFrame({'a' : l_a, 'b' : l_b, 'c' : l})
This yields这产生
a![]() |
b ![]() |
c ![]() |
|
---|---|---|---|
0 ![]() |
axy a![]() |
obj e![]() |
lorem obj e lorem axy a ![]() |
1 ![]() |
axy a![]() |
obj e![]() |
lorem lorem axy a lorem obj e ![]() |
2 ![]() |
xyz b ![]() |
oaw r![]() |
lorem xyz b lorem oaw r ![]() |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.