[英]Pandas compare items in list in one column with single value in another column
Consider this two column df.考虑这两列df。 I would like to create an apply function that compares each item in the "other_yrs" column list with the single integer in the "cur" column and keeps count of each item in the "other_yrs" column list that is greater than or equal to the single value in the "cur" column.我想创建一个应用 function 将“other_yrs”列列表中的每个项目与“cur”列中的单个 integer 进行比较,并保持“other_yrs”列列表中大于或等于“cur”列中的单个值。 I cannot figure out how to enable pandas to do this with apply.我无法弄清楚如何通过应用启用 pandas 来执行此操作。 I am using apply functions for other purposes and they are working well.我将应用功能用于其他目的,它们运行良好。 Any ideas would be very appreciated.任何想法将不胜感激。
cur other_yrs
1 11 [11, 11]
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0]
4 16 [15, 85]
5 17 [17, 17, 16]
6 13 [8, 8]
Below is the function I used to extract the values into the "other_yrs" column.下面是我用来将值提取到“other_yrs”列中的 function。 I am thinking I can just insert into this function some way of comparing each successive value in the list with the "cur" column value and keep count.我想我可以在这个 function 中插入某种方式,将列表中的每个连续值与“cur”列值进行比较并保持计数。 I really only need to store the count of how many of the list items are <= the value in the "cur" column.我真的只需要存储列表项的计数<=“cur”列中的值。
def col_check(col_string):
cs_yr_lst = []
count = 0
if len(col_string) < 1: #avoids col values of 0 meaning no other cases.
pass
else:
case_lst = col_string.split(", ") #splits the string of cases into a list
for i in case_lst:
cs_yr = int(i[3:5]) #gets the case year from each individual case number
cs_yr_lst.append(cs_yr) #stores those integers in a list and then into a new column using apply
return cs_yr_lst
The expected output would be this:预期的 output 将是这样的:
cur other_yrs count
1 11 [11, 11] 2
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0] 11
4 16 [15, 85] 1
5 17 [17, 17, 16] 3
6 13 [8, 8] 2
Use zip
inside a list comprehension to zip the columns cur
and other_yrs
and use np.sum
on boolean mask:在 zip 列cur
和other_yrs
的列表理解中使用zip
并在 boolean 掩码上使用np.sum
:
df['count'] = [np.sum(np.array(b) <= a) for a, b in zip(df['cur'], df['other_yrs'])]
Another idea:另一个想法:
df['count'] = pd.DataFrame(df['other_yrs'].tolist(), index=df.index).le(df['cur'], axis=0).sum(1)
Result:结果:
cur other_yrs count
1 11 [11, 11] 2
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0] 11
4 16 [15, 85] 1
5 17 [17, 17, 16] 3
6 13 [8, 8] 2
You can consider explode
and compare then group on level=0 and sum:您可以考虑explode
并比较,然后在 level=0 上分组并求和:
u = df.explode('other_yrs')
df['Count'] = u['cur'].ge(u['other_yrs']).sum(level=0).astype(int)
print(df)
cur other_yrs Count
1 11 [11, 11] 2
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0] 11
4 16 [15, 85] 1
5 17 [17, 17, 16] 3
6 13 [8, 8] 2
If columns contain millions of records in both of the dataframes and one has to compare each element in first column with all the elements in the second column then following code might be helpful.如果列在两个数据框中都包含数百万条记录,并且必须将第一列中的每个元素与第二列中的所有元素进行比较,那么下面的代码可能会有所帮助。
for element in Dataframe1.Column1:
Dataframe2[Dateframe2.Column2.isin([element])]
Above code snippet will return one by one specific rows of dataframe2 where element from dataframe1 is found in dataframe2.column2.上面的代码片段将逐一返回 dataframe2 的特定行,其中 dataframe1 中的元素位于 dataframe2.column2 中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.