[英]python - pass dataframe column as argument in apply function
I have the following dataframe: 我有以下数据框:
In[1]: df = DataFrame({"A": ['I love cooking','I love rowing'], "B": [['cooking','rowing'],['cooking','rowing']]})
Thus the output that I get is: 因此,我得到的输出是:
In[2]: df
Out[1]:
A B
0 I love cooking [cooking, rowing]
1 I love rowing [cooking, rowing]
I want to create a 'C' column where I count the number of occurrences of elements of 'B' in 'A'. 我想创建一个“ C”列,在其中计算“ A”中“ B”元素的出现次数。
The function I create is: 我创建的功能是:
def count_keywords(x,y):
a = 0
for element in y:
if element in x:
a += 1
return a
and then do: 然后执行:
df['A'].apply(count_keywords,args=(df['B'],))
In this case, I am passing the entire pandas dataseries as argument, so the element of the dataseries df['B'] is obviously a list, not a string (which in turn is the element of the list). 在这种情况下,我将整个pandas数据系列作为参数传递,因此数据系列df ['B']的元素显然是一个列表,而不是字符串(反过来又是列表的元素)。
So I get: 所以我得到:
TypeError: 'in <string>' requires string as left operand, not list
However, if I adjust the function so that: 但是,如果我将功能调整为:
def count_keywords(x,y):
a = 0
for element in y:
for new_element in element:
if new_element in x:
a += 1
return a
and then do: 然后执行:
In[3]: df['A'].apply(count_keywords,args=(df['B'],))
the output is: 输出为:
Out[2]:
0 2
1 2
Because the function loops through every element in the pandas series and then through every element in the list. 因为该函数循环遍历pandas系列中的每个元素,然后遍历列表中的每个元素。
How can I get the function to just check, per dataframe row, the element of series df['B'] against the element in series df['A'], so the output is:? 如何获得仅针对每个数据帧行,将系列df ['B']的元素与系列df ['A']的元素进行比较的函数,所以输出为:?
Out[2]:
0 1
1 1
Thanks a lot! 非常感谢!
You have to apply
over the other axis. 你必须apply
于其他轴。
def count_keywords(row):
counter = 0
for e in row['B']:
if e in row['A']:
counter += 1
row['C'] = counter
return row
df2 = df.apply(count_keywords,axis=1)
Gives you: 给你:
A B C
0 I love cooking [cooking, rowing] 1
1 I love rowing [cooking, rowing] 1
Then df2['C'] should give you the 1,1 series you mention. 然后df2 ['C']应该给您您提到的1,1系列。
Another way you could do this is by using a set intersection to calculate the size. 您可以执行此操作的另一种方法是使用设置的相交来计算大小。 In theory this may be faster then iterating over the elements, since set
is sort of designed for this kind of thing: 从理论上讲,这比遍历元素更快,因为set
是为这种事情而设计的:
df['C'] = df.apply(lambda x: len(set(x.B).intersection(set(x.A.split()))), axis = 1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.